Databricks Tested A Stronger Model Against Its Multi-step Agent On Hybrid Queries. The Stronger Model Still Lost By 21%.

The data teams building AI agents keep running in the same failure modes. Queries that require linking structured data with unstructured content, sales figures with customer reviews or citation counts with academic papers, break down single-turn RAG systems.

New research from Databricks puts a number on that failure margin. According to the research, the company’s AI research team tested a multi-stage agentic approach against state-of-the-art single-turn RAG baselines across nine enterprise intelligence tasks and recorded gains of 20% or more on Stanford’s Stark benchmark suite, along with consistent improvements in Databricks’ own CarlBench evaluation framework. Databricks argues that the performance difference between single-turn RAG and multi-step agents on hybrid data tasks is an architectural issue, not a model quality issue.

This work builds on Databricks’ earlier directed retriever research, which showed retrieval improvements on unstructured data using metadata-aware queries. This latest research combines structured data sources, relational tables, and SQL warehouses into a single logic cycle, addressing a class of questions that current agent architectures fail to answer with.

"RAG works, but it doesn’t scale," Michael Bendersky, director of research at Databricks, told VentureBeat. "If you want to make your agents even better, and understand why your sales are declining, you can now help agents look at tables and view sales data. Your RAG pipeline will be disabled in that function."

Single-turn retrieval cannot encode structural constraints

The main finding is that standard RAG systems fail when a query mixes a precisely structured filter with an open-ended semantic search.

Consider a question like "Which of our products have seen sales decline over the past three months, and what potentially related issues have come up in customer reviews on various seller sites?" Sales data resides in a warehouse. Review sentiment resides in unstructured documents on vendor sites. A single-turn RAG system cannot split that query, route each half to the correct data source, and combine the results.

To confirm that this is an architecture issue rather than a model quality issue, Databricks published a STaRK baseline using the current state-of-the-art foundation model. According to the research, the robust model still loses to the multi-step agent by 21% on the academic domain and by 38% on the biomedical domain.

STaRK is a benchmark published by Stanford researchers that covers three semi-structured retrieval domains: Amazon product data, the Microsoft Academic Graph, and a biomedical knowledge base.

How does the supervisor agent handle what RAG can’t handle?

Databricks built the Supervisor Agent as a production implementation of this research approach, and its architecture explains why the benefits are consistent across task types. The approach involves three main steps:

parallel tool decomposition. Instead of issuing a blanket query and hoping that the results will meet both structured and unstructured needs, the agent fires SQL and vector search calls simultaneously, then analyzes the combined results before deciding what to do next. That parallel step is what allows it to handle queries that cross data type boundaries without needing to normalize the data first.

Self-improvement. When the initial recovery attempt reaches an impasse, the agent detects the failure, refactors the query, and tries a different path. On a STaRK benchmark task that requires finding a paper by an author with exactly 115 prior publications on a specific topic, the agent first performs both SQL and vector search queries in parallel. When the two result sets show no overlap, it adapts and issues a SQL JOIN between the two constraints, then calls the vector search system to verify the result before returning an answer.

Declarative configuration. The agent is not tailored for a specific dataset or task. Connecting it to a new data source means writing a simple language description of what that source contains and what types of questions it should answer. No custom code required.

"The agent can perform tasks such as decomposing queries into SQL queries and performing search queries out of the box," Bendersky said. "It can combine the results of SQL and RAG, reason about those results, create follow-up queries, and then reason about whether the final answer was actually found."

It’s not just about hybrid recovery

The distinction that Databricks has made is not about the retrieval technology, it is about the architecture.

"We don’t see it nearly as hybrid retrieval where you combine embeddings and search results, or embeddings and tables," He said. "We think of it as an agent that has access to many tools."

The practical consequence of that framing is that adding a new data source means connecting it to the agent and writing a description of what it contains. The agent handles routing and orchestration without additional code.

Custom RAG pipelines require converting the data into a format that the retrieval system can read, typically text fragments with embeddings. SQL tables need to be flattened, JSON needs to be normalized. Each new data source added to the pipeline means more transformation work. Databricks’ research argues that as enterprise data grows to include more source types, that burden makes custom pipelines increasingly impractical compared to an agent that queries each source in its native format.

"Just bring the agent to the data," Bendersky said. "You basically give the agent more sources, and he will learn to use them well."

What does this mean for enterprises

For data engineers evaluating whether to build a custom RAG pipeline or adopt a declarative agent framework, the research provides a clear direction: If the task involves queries that span structured and unstructured data, building custom retrieval is the harder path to take. The research found that across all tested functions, only the instructions and tool details differed between deployments. The agent took care of the rest.

Practical limitations are real but manageable. This approach works well with five to ten data sources. Tying too many together, without deciding which sources are complementary rather than contradictory, makes the agent slower and less reliable. Bendersky recommends scaling sequentially and confirming the results at each step, rather than combining all available data up front.

Data accuracy is a prerequisite. The agent can query across mismatched formats, SQL sales tables as well as JSON review feeds, without the need for normalization. It cannot correct source data that is factually incorrect. Adding a simple language description of each data source at the time of ingestion helps the agent properly route queries from the start.

The research positions this as an initial step in a longer trajectory. As enterprise AI workloads mature, agents will be expected to reason across dozens of source types, including dashboards, code repositories, and external data feeds. Research argues that the declarative approach is one that streamlines scaling, as adding a new source remains a configuration issue rather than an engineering one.

"It’s like a ladder," Bendersky said. "The agent will gradually receive more and more information and then gradually improve overall."

<a href

Databricks tested a stronger model against its multi-step agent on hybrid queries. The stronger model still lost by 21%.

Single-turn retrieval cannot encode structural constraints

How does the supervisor agent handle what RAG can’t handle?

It’s not just about hybrid recovery

What does this mean for enterprises

Like this:

Related

Leave a Comment Cancel reply

Single-turn retrieval cannot encode structural constraints

How does the supervisor agent handle what RAG can’t handle?

It’s not just about hybrid recovery

What does this mean for enterprises

Share this:

Like this:

Related

Leave a Comment Cancel reply