This tree search framework hits 98.7% on documents where vector search fails

agentic RAG
A new open-source framework called page index Retrieval-augmented generation (RAG) solves one of the old problems: handling very long documents.

The classic RAG workflow (segment documents, compute embeddings, store them in a vector database, and retrieve top matches based on semantic similarity) works well for basic tasks like Q&A on small documents.

PageIndex abandons the standard "segment-and-embed" The method completely treats document retrieval not as a search problem, but as a navigation problem.

But as enterprises attempt to move RAG to higher-risk workflows — auditing financial statements, analyzing legal contracts, navigating pharmaceutical protocols — they’re hitting an accuracy hurdle that segment optimization can’t solve.

alphago for documents

PageIndex addresses these limitations by borrowing a concept from game-playing AI rather than search engines: tree search.

When humans need to find specific information in a dense textbook or long annual report, they don’t scan each paragraph linearly. They consult the table of contents to identify the relevant chapter, then section, and finally the specific page. PageIndex forces the LLM to replicate this human behavior.

Instead of pre-calculating vectors, the framework creates a "global index" of the structure of the document, creating a tree where nodes represent chapters, sections, and subsections. When a query arrives, the LLM performs a tree search, implicitly classifying each node as relevant or irrelevant based on the full context of the user’s request.

"In computer science terms, a table of contents is a tree-structured representation of a document, and navigating it corresponds to a tree search," Zhang said. "PageIndex applies the same basic idea – tree search – to document retrieval, and can be thought of as an AlphaGo-style system for retrieval rather than a game."

This changes the architectural paradigm from passive retrieval, where the system simply fetches matching text, to active navigation, where an agentic model decides where to look.

Limits of semantic similarity

There is a fundamental flaw in how traditional raga Handles complex data. Vector retrieval assumes that the text that is semantically most similar to the user’s query is also the most relevant. In professional circles, this assumption is repeatedly broken.

PageIndex co-founder Mingtian Zhang cites financial reporting as a prime example of this failure mode. If a financial analyst asks AI "EBITDA" (earnings before interest, taxes, depreciation and amortization), a standard vector database will retrieve every part where that abbreviation or similar word appears.

"Multiple sections may refer to EBITDA with similar wording, yet only one section defines the exact calculation, adjustment or reporting scope relevant to the question," Zhang told VentureBeat. "A similarity-based retriever struggles to distinguish these cases because the semantic signals are almost indistinguishable."

this is "intent vs content" Difference. User doesn’t want to find word "EBITDA"; They want to understand the “logic” behind it for that specific quarter.

Furthermore, traditional embedding strips the query of its context. Because embedding models have strict input-length limits, the retrieval system typically only looks at the specific question asked, ignoring previous turns of the conversation. This separates the recovery phase from the user’s reasoning process. The system matches documents based on a short, non-contextualized query, rather than the entire history of the problem the user is trying to solve.

Solving the multi-hop logic problem

This structural approach has the most visible impact on the real world "multi-hop" Queries that require AI to follow a trail of breadcrumbs to different parts of the document.

In a recent benchmark test called FinanceBench, a system built on PageIndex was called "Mafin 2.5" Achieved a state-of-the-art accuracy score of 98.7%. The performance difference between this approach and vector-based systems becomes apparent when analyzing how they handle internal references.

Zhang offers the example of a question regarding the total value of deferred assets in the Federal Reserve’s annual report. The body of the report describes the “change” in price but does not list the total. However, there is a footnote in the text: “For more detailed information see Appendix G of this report….”

Vector-based systems usually fail here. The text of Appendix G looks nothing like a user’s query about deferred assets; It’s probably just a table of numbers. Because there is no semantic match, the vector database ignores it.

However, the logic-based retriever reads the hint in the main text, follows the structural links of Appendix G, locates the correct table, and returns the accurate data.

Latency trade-offs and infrastructure changes

For enterprise architects, the immediate concern with the LLM-powered search process is latency. Vector lookup occurs in milliseconds; to be llm "Reading" The table of contents indicates a fairly slow user experience.

However, Zhang points out that due to how retrieval is integrated into the generation process, the perceived latency to the end user may be negligible. In a classic RAG setup, retrieval is a blocking step: the system must search the database before it can begin generating answers. With PageIndex, retrieval occurs inline during the model’s reasoning process.

"The system can start streaming immediately, and retrieve when generated," Zhang said. "This means that PageIndex does not add an additional ‘retrieval gate’ before the first token, and the Time to First Token (TTFT) is equivalent to a normal LLM call."

This architectural change also simplifies data infrastructure. By removing the dependency on embeddings, enterprises no longer need to maintain a dedicated vector database. Tree-structured indexes are lightweight enough to sit in a traditional relational database like PostgreSQL.

It addresses a growing problem in LLM systems with recovery components: the complexity of keeping the vector store in sync with live documents. PageIndex structure separates indexing from text extraction. If a contract is amended or a policy is updated, the system can handle small edits by only reindexing the affected subtree rather than reprocessing the entire document corpus.

A decision matrix for enterprise

While the increase in accuracy is compelling, tree-search retrieval is not a universal replacement for vector search. Technology is best viewed as a specialized tool "deep work" Instead of a catch-all for every recovery task.

For small documents such as emails or chat logs, the entire context often fits in the context window of a modern LLM, making any retrieval system unnecessary. In contrast, for tasks based on purely semantic search, such as recommending similar products or finding content with similar content "Feeling," Vector embedding remains the better choice because the goal is proximity, not logic.

PageIndex fits right in the middle: long, highly structured documents where the cost of error is high. This includes technical manuals, FDA filings, and merger agreements. In these scenarios, the need is for auditability. An enterprise system should be able to explain not only the answer, but also the path it took to find it (for example, confirming that it checked Section 4.1, followed the references to Appendix B, and synthesized the data found there).

The future of agentic recovery

The rise of frameworks like PageIndex signals a broader trend in the AI ​​stack: the move toward "agent rag." As models become more capable of planning and reasoning, the responsibility for finding data is moving from the database layer to the model layer.

We’re already seeing this in the coding field, where agents like cloud code And cursors are moving away from simple vector lookups in favor of active codebase exploration. Zhang believes general document recovery will follow the same trajectory.

"Vector databases still have suitable use cases," Zhang said. "But their historical role as the default database for LLM and AI will become less clear over time."



<a href

Leave a Comment