The 70% Factuality Ceiling: Why Google’s New ‘FACTS’ Benchmark Is A Wake-up Call For Enterprise AI

There is no shortage of generic AI benchmarks designed to measure the performance and accuracy of a given model when completing a variety of helpful enterprise tasks – from coding to following instructions to agentic web browsing and tool use. But many of these benchmarks have a major flaw: They measure AI’s ability to meet specific problems and requests, not how. Real The model lies in its output – how well it produces objectively correct information tied to real-world data – especially when dealing with information contained in imagery or graphics.

For industries where accuracy is paramount – legal, finance and medical – lack of standardized way of measuring factuality Has been a serious blind spot.

That changes today: Google’s FACTS team and its data science unit Kaggle released the FACTS Benchmark Suite, a comprehensive evaluation framework designed to close this gap.

Related research papers reveal a more nuanced definition of problem,partitioning "factuality" In two different operating scenarios: "relevant factuality" (grounding reactions in the data provided) and "world knowledge factuality" (retrieving information from memory or the web).

While the main news is the Gemini 3 Pro’s top-tier placement, the deeper story for builders is industry-wide "Factfulness Wall."

According to preliminary results, none of the models, including Gemini 3 Pro, GPT-5, or Cloud 4.5 Opus, managed to achieve a 70% accuracy score across the set of problems. For tech leaders, this is a signal: the era of "trust but verify" It is not over yet.

rebuilding the benchmark

FACTS Suite goes beyond simple questions and answers. It is composed of four separate tests, each of which simulates a different real-world failure mode that developers encounter in production:

Parametric Benchmarks (Inside Knowledge): Can the model accurately answer trivia-style questions using only its training data?
Search benchmarks (device usage): Can the model effectively use web search tools to retrieve and synthesize live information?
Multimodal Benchmark (Vision): Can the model accurately interpret charts, diagrams, and images without hallucinations?
Grounding Benchmark v2 (Reference): Can the model stick strictly to the given source text?

Google has released 3,513 examples to the public, while Kaggle has kept a private set to prevent developers from training on test data – a common problem known as "Contamination."

Leaderboard: Inch Game

In the initial round of benchmarks, Gemini 3 Pro leads with a comprehensive fact score of 68.8%, followed by Gemini 2.5 Pro (62.1%) and OpenAI’s GPT-5 (61.8%). However, a closer look at the data reveals where the real battlegrounds lie for engineering teams.

Sample	FACT SCORE (AVERAGE)	Search (RAG Capacity)	Multimodal (Vision)
gemini 3 pro	68.8	83.8	46.1
gemini 2.5 pro	62.1	63.9	46.9
GPT-5	61.8	77.7	44.1
grok 4	53.6	75.3	25.7
cloud 4.5 opus	51.3	73.2	39.2

Data is obtained from release notes from the FACTS team.

For builders: the "Search" versus "parametric" Difference

For developers building RAG (Retrieval-Augmented Generation) systems, the search benchmark is the most important metric.

Data shows huge discrepancy between a model’s ability "Know" things (parametric) and its potential "search" Things (find). For example, the Gemini 3 Pro gets a high 83.8% score on search tasks, but only 76.4% on parametric tasks.

This validates the current enterprise architecture standard: Do not rely on the model’s internal memory for important facts.

If you’re building an internal knowledge bot, the FACTS results show that linking your model to a search tool or vector database is not optional – it’s the only way to push accuracy toward acceptable production levels.

multimodal warning

The most worrisome data point for product managers is performance on multimodal tasks. Scores here are universally low. Even the category-leading Gemini 2.5 Pro only achieved 46.9% accuracy.

Benchmark tasks included reading charts, interpreting diagrams, and identifying objects in nature. With less than 50% accuracy across the board, this suggests that multimodal AI is not yet ready for unsupervised data extraction.

Ground level: If your product roadmap involves AI automatically extracting data from invoices or interpreting financial charts without human in-the-loop review, You are potentially introducing significant error rates In your pipeline.

Why this matters to your stack

The FACTS benchmark is likely to become a standard reference point for purchasing. When evaluating models for enterprise use, technology leaders should look beyond the overall score and focus on specific sub-benchmarks that match their use case:

Building a customer support bot? Check the grounding score to make sure the bot is sticking to your policy documents. (The Gemini 2.5 Pro actually beats the Gemini 3 Pro here, 74.2 vs. 69.0).
Building a research assistant? Prioritize search score.
Building an image analysis tool? Proceed with extreme caution.

As the FACTS team noted in their release, "All evaluated models achieved overall accuracy below 70%, leaving much room for future advancements."For now, the message to the industry is clear: models are getting smarter, but they’re still not infallible. Design your system with the assumption that, about one-third of the time, the raw model may be wrong.

<a href

The 70% factuality ceiling: why Google’s new ‘FACTS’ benchmark is a wake-up call for enterprise AI

rebuilding the benchmark

Leaderboard: Inch Game

For builders: the "Search" versus "parametric" Difference

multimodal warning

Why this matters to your stack

Like this:

Related

Leave a Comment Cancel reply

rebuilding the benchmark

Leaderboard: Inch Game

For builders: the "Search" versus "parametric" Difference

multimodal warning

Why this matters to your stack

Share this:

Like this:

Related

Leave a Comment Cancel reply