Databricks' OfficeQA uncovers disconnect: AI agents ace abstract tests but stall at 45% on enterprise docs

OfficeQA image smk
There is no shortage of AI benchmarks in the market today with popular options like final test of humanity (HLE), ARC-AGI-2 and GDPval, among many others.

AI agents excel at solving abstract math problems and passing the PhD-level exams on which most benchmarks are based, but Databricks has a question for the enterprise: Can they really handle the document-heavy work that most enterprises need to do?

According to new research from the data and AI platform company, the answer is sobering. Even the best-performing AI agents achieve less than 45% accuracy on tasks reflecting real enterprise workloads, highlighting a significant gap between academic benchmarks and business reality.

"If we focus our research efforts on improving [existing benchmarks]So we’re probably not solving the right problems to make Databricks a better platform," Erich Elsen, principal research scientist at Databricks, explained to VentureBeat. "So we were looking around. How do we create a benchmark that, if we get better at this, we’re actually getting better at solving our customers’ problems?"

The result is OfficeQA, a benchmark designed to test AI agents on grounded reasoning: answering questions based on complex proprietary datasets containing unstructured documents and tabular data. Unlike existing benchmarks, which focus on abstract capabilities, OfficeQA proxies for the economically valuable tasks that enterprises actually perform.

Why do academic benchmark enterprises miss the mark?

According to Elsen, popular AI benchmarks have several shortcomings from an enterprise perspective.

The HLE includes questions requiring PhD level expertise in a variety of areas. ARC-AGI evaluates abstract reasoning through visual manipulation of colored grids. Both push the boundaries of AI capabilities, but do not reflect daily enterprise work. Even GDPVal, which was created specifically to evaluate economically useful functions, misses the target.

"We come from a very deep science or engineering background, and sometimes we make assessments that reflect that," Elsen said. " So they’re either extremely math-heavy, which is a great, useful function, but pushing the limits of human mathematics is not what customers are trying to do with Databricks."

While AI is typically used for customer support and coding apps, Databricks’ customer base has a broader set of needs. Answering questions about documents or collections of documents is a common enterprise task, Elsen said. These require parsing complex tables with nested headers, retrieving information across dozens or hundreds of documents, and performing calculations, where a single-digit error can lead to wrong business decisions in organizations.

Creating a benchmark that reflects enterprise document complexity

To create a meaningful test of grounded reasoning capabilities, Databricks needed a dataset that approximates the messy reality of proprietary enterprise document corporations while remaining freely available for research. The team arrived at the US Treasury Bulletin, which was published monthly for five decades beginning in 1939 and quarterly thereafter.

Treasury Bulletin Enterprise checks every box for document complexity. Each bulletin is 100 to 200 pages long and contains prose, complex tables, charts, and figures describing Treasury operations: where federal money came from, where it went, and how it financed government operations. The collection spans approximately 89,000 pages spanning eight decades. Until 1996, bulletins were scans of physical documents; Later, they were digitally generated into PDFs. USAFacts, an organization whose mission is "To make government data easier to access and understand," Partnered with Databricks to develop benchmarks, identifying Treasury Bulletins as the norm and ensuring questions reflected realistic use cases.

The 246 queries require agents to deal with real-world disorganized document challenges: scanned images, hierarchical table structures, temporary data spread across multiple reports, and the need for external knowledge such as inflation adjustments. Questions range from simple price lookups to multi-step analyzes that require statistical calculations and cross-year comparisons.

To ensure that the benchmarks required true document-based retrieval, Databricks filtered out questions that LLMs could only answer using parametric knowledge or web search. This eliminated simple queries and some surprisingly complex queries where the models took advantage of historical financial records memorized during pre-training.

Each question has a valid ground truth answer (usually a number, sometimes dates or short lists), which enables automated evaluation without human intervention. This design choice matters: It allows reinforcement learning (RL) approaches that require verifiable rewards, such as when models train on coding problems.

Current performance highlights fundamental shortcomings

Databricks tested the Cloud Opus 4.5 agent (using Cloud’s SDK) and the GPT-5.1 agent (using OpenAI’s file search API). The results should give pause to any venture betting heavily on existing agent capabilities.

When provided with raw PDF documents:

  • The Cloud Opus 4.5 agent (with default intelligence=high) achieved 37.4% accuracy.

  • The GPT-5.1 agent (with reasoning_effort=high) achieved 43.5% accuracy.

However, performance improved significantly when pre-parsed versions of pages were provided using Databricks ai_parse_documentThis indicates that poor RAW PDF performance stems from the LLM API struggling with parsing rather than logic. Even with the analyzed documents, the experiments show room for improvement.

When a parsed document is provided using Databricks’ ai_parse_document:

  • Cloud Opus 4.5 Agent achieved 67.8% accuracy (+30.4 percentage points improvement)

  • GPT-5.1 agent achieved 52.8% accuracy (+9.3 percentage points improvement)

Three findings that matter for enterprise deployments

The trial identified important insights for physicians:

Parsing remains the fundamental bottleneck: Complex tables with nested headers, merged cells, and unusual formatting often produce misaligned values. Even when given the exact Oracle pages, agents struggled primarily due to parsing errors, although performance almost doubled with pre-parsed documents.

Document versioning creates ambiguity:Financial and regulatory documents are revised and reissued, meaning that many valid answers exist depending on the publication date. Once credible answers are found, agents often stop searching and lose track of more authoritative sources.

visual logic is a difference: About 3% of queries require chart or graph interpretation, where current agents consistently fail. For enterprises where data visualizations communicate critical insights, this represents a meaningful capability limitation.

How can enterprises use OfficeQA

The design of the benchmark enables specific improvement paths beyond simple scoring.

"Since you are able to see the correct answer, it is easier to tell if the error is coming from parsing," Elsen explained.

This automated evaluation enables fast iteration on parsing pipelines. Verified ground truth answers also enable RL training similar to coding benchmarks, as it does not require any human judgment.

Elsen said the benchmark provides "a really strong reaction signal" For developers working on search solutions. However, he cautioned against treating it as training data.

"At least in my imagination, the goal of releasing this is as an evaluation and not as a source of raw training data," He said. "If you tune specifically to this environment, it’s unclear how generalizable your agent’s results will be."

What this means for enterprise AI deployment

For enterprises currently deploying or planning to deploy document-heavy AI agent systems, OfficeQA provides a sobering reality check. Even the latest Frontier models achieve only 43% accuracy on unprocessed PDFs and fall below 70% accuracy even with optimal document parsing. Performance on the most difficult questions is stable at 40%, indicating considerable room for improvement.

Three immediate implications:

Evaluate the complexity of your document: If your documents resemble the complexity profile of a Treasury bulletin (scanned images, nested table structures, cross-document references), expect significantly lower accuracy than vendor marketing claims. Test your real documents before production deployment.

Plan for constraint parsing:Test results show that parsing remains a fundamental bottleneck. Budget time and resources for custom parsing solutions rather than assuming off-the-shelf OCR is sufficient.

Difficult Questions Plan for Failure Modes: Even with optimal analysis, agents remain stable at 40% on complex multi-step queries. For mission-critical document workflows that require multi-document analysis, statistical calculations, or visual reasoning, current agent capabilities may not be ready without significant human oversight.

For enterprises seeking to lead in AI-powered document intelligence, this benchmark provides a solid evaluation framework and identifies specific capability gaps that need to be addressed.



<a href

Leave a Comment