Databricks built a RAG agent it says can handle every kind of enterprise search

Karl RAG funnel smk1
Most enterprise RAG pipelines are optimized for a search behavior. They fail silently at others. A model trained to synthesize cross-document reports handles constraint-driven entity discovery poorly. The model, designed for simple lookup tasks, fails on multi-step logic on internal notes. Most teams find out when something breaks.

Databricks plans to fix this with KARL, short for Knowledge Agents through Reinforcement Learning. The company trained an agent simultaneously in six different enterprise search behaviors using a new reinforcement learning algorithm. The result, the company claims, is a model that matches Cloud Opus 4.6 on a purpose-built benchmark at 33% lower cost per query and 47% lower latency, trained entirely on synthetic data that the agent generated itself and requiring no human labeling. This comparison is based on KARLBench, which Databricks created to evaluate enterprise search practices.

"The big reinforcement learning wins we’ve seen in the community in the last year have been on verifiable tasks where there is a right and a wrong answer," Jonathan Frankel, chief AI scientist at Databricks, told VentureBeat in an exclusive interview. "The tasks we are working on for KARL, and which are common to most enterprises, cannot be rigorously verified in the same way."

Those tasks include synthesizing intelligence in product manager meeting notes, reconstructing competitive deal outcomes from fragmented customer records, answering questions about account history where no document has a complete answer, and generating battle cards from unstructured internal data. None of them have a single correct answer that the system can automatically check.

"Learning reinforcement in a world where you don’t have strict right and wrong answers, and figuring out how to guide the process and make sure the reward isn’t hacked – it’s really non-trivial," Frankel said. "Very little of what companies do in day-to-day knowledge work is verifiable."

Generalization mesh in enterprise RAG

Standard RAG collapses on vague, multi-step queries based on fragmented internal data that it was never designed to interrogate.

To evaluate KARL, Databricks created the KARLBench benchmark to measure performance across six enterprise search behaviors: constraint-driven entity search, cross-document report synthesis, long document traversal with tabular numerical reasoning, whole entity retrieval, procedural reasoning on technical documentation, and fact aggregation on internal company notes. That final task is PMBench, built from Databricks’ own product manager meeting notes – fragmented, vague, and unstructured in a way Frontier models handle poorly.

Training on one task and testing on another produces poor results. The KARL paper shows that multi-task RL generalizes in ways that single-task training does not. The team trained KARL on synthetic data for two of the six tasks and found that it performed well on all four tasks it had never seen before.

For example, to create competitive battle cards for a financial services client, the agent must identify relevant accounts, filter for recency, reconstruct past competitive deals, and predict outcomes – none of which is labeled anywhere in the data.

FRANKL SAYS WHAT KARL DOES "ground logic": Running a difficult logic chain involving every step in the retrieved facts. "You can think of it as RAG," He said, "But like RAG Plus Plus Plus Plus Plus Plus, up to 200 vector database calls."

RL Engine: Why OAPL matters

KARL’s training is powered by OAPL, short for Optimal Advantage-based Policy Optimization with Lagged Inference Policy. This is a new approach, jointly developed by researchers at Cornell, Databricks, and Harvard and published in a separate paper A week before Carl.

Standard LLM reinforcement learning uses on-policy algorithms such as GRPO (Group Relative Policy Optimization), which assumes that the model generating the training data and the model being updated are in sync. In distributed training, they never happen. Previous approaches corrected for this with importance sampling, which introduced variation and instability. OAPL instead embraces the off-policy nature of distributed training, using a regression objective that remains stable with a policy interval of more than 400 gradient steps, 100 times more off-policy than previous approaches. In code generation experiments, it matches the GRPO-trained model using approximately three times fewer training samples.

OAPL’s sampling efficiency is what keeps the training budget affordable. Reusing previously collected rollouts rather than requiring fresh on-policy data for each update meant that the full KARL training run stalled within a few thousand GPU hours. This is the difference between a research project and something an enterprise team can realistically attempt.

Agent, memory and context stack

In recent months there has been much discussion in the industry about how RAG could be replaced with episodic memory, sometimes called agentic memory.

For Frankl, this is not an either/or discussion, rather he sees it as a layered stack. A vector database with millions of entries sits at the base, which is too large to reference. The LLM Reference window is located at the top. Between them, compression and caching layers are emerging that determine how much of what an agent has already learned can be carried forward.

For Carl, this is not abstract. Some KARLBench tasks require up to 200 sequential vector database queries, in which the agent refines searches, verifies details and cross-references documents before providing answers, causing the reference window to be exhausted multiple times. Instead of training a separate summary model, the team let KARL learn end-to-end compression through RL: when the context gets too large, the agent compresses it and continues, the only training signal being the reward at the end of the task. Removing learned compression dropped accuracy on one benchmark from 57% to 39%.

"We simply let the model figure out how to compress its own context," Frankel said. "And it worked phenomenally well."

Where KARL falls short

Frankl was candid about the methods of failure. KARL struggles most on questions with significant ambiguity, where multiple valid answers exist and the model cannot determine whether the question is truly open or difficult to answer. That judgment call is still an unresolved issue.

The model also demonstrates what Frankl described as giving up on some questions early – pausing before giving a final answer. He insisted on defining this as a failure, noting that the most expensive queries are usually the ones the model gets wrong anyway. Stopping is often the right decision.

KARL was also trained and evaluated specifically on vector search. Tasks requiring SQL queries, file searches, or Python-based calculations are not yet in scope. Frankel said those capabilities are next on the roadmap, but they’re not in the current system.

What this means for enterprise data teams

KARL puts forward three decisions worth reconsidering for teams evaluating their recovery infrastructure.

The first is pipeline architecture. If your RAG agent is optimized for one search behavior, the KARL results suggest it is failing on others. Multi-task training across diverse retrieval behaviors produces models that generalize. No narrow pipelines.

The second is why RL matters here – and it’s not just a training description. Databricks tested an alternative: Distilling from expert models via supervised fine-tuning. That approach improved distribution performance but produced negligible gains on tasks that the model had never seen. RL developed general search behavior that shifted. For enterprise teams facing heterogeneous data and unpredictable query types, this difference is the whole game. The third is what RL efficiency actually means in practice. A model trained for search better completes tasks in fewer steps, stops earlier on questions it cannot answer, diversifies its search rather than repeating failed queries, and compresses its own context rather than running out of room. The argument for training purpose-built search agents rather than routing everything through the general-purpose Frontier API isn’t primarily about cost. It’s about building a model that knows how to do the job.



<a href

Leave a Comment