
The arms race to build smart AI models has a measurement problem: The tests used to rank them are becoming obsolete almost as fast as the models are improving. On Monday, Artificial Intelligence, an independent AI benchmarking organization whose rankings are closely watched by developers and enterprise buyers, released a major change to its Intelligence Index that fundamentally changes how the industry measures AI progress.
The new Intelligence Index v4.0 includes 10 assessments related to agent intelligence, coding, scientific reasoning, and general knowledge. But the changes go deeper than just tweaking test names. The organization removed three key benchmarks – MMLU-Pro, AIME 2025, and LiveCodeBench – that have long been mentioned by AI companies in their marketing materials. In their place, the new index offers assessments designed to measure whether AI systems can accomplish the kind of work that people actually get paid to do.
Type: embedded-entry-inline Identification: 1bCmRrroGCdUb07IuaHysL
"This index change reflects a broader change: intelligence is being measured less by memory and more by economically useful action," said Arvind Sundar, a researcher who responded to the announcement on X (formerly Twitter).
Why AI benchmarks are breaking: The problem with tests that top models have already mastered
The benchmark overhaul addresses a growing crisis in AI evaluation: leading models have become so capable that traditional tests can no longer distinguish meaningfully between them. The new index deliberately makes it harder to climb the curve. According to artificial analysis, top models now score 50 or below on the new v4.0 scale, whereas in the previous version it was 73 – a recalibration designed to restore headroom for future improvements.
This saturation problem has plagued the industry for months. When each Frontier model scores in the 90th percentile on a given test, the test loses its usefulness as a decision-making tool for enterprises trying to choose which AI system to deploy. The new method attempts to solve this by giving equal importance to four categories – agent, coding, scientific reasoning and genera l – while introducing evaluation where the most advanced systems still struggle.
Under the new framework, results show OpenAI’s GPT-5.2 with extended reasoning effort claiming the top spot, followed by Anthropic’s Cloud Opus 4.5 and Google’s Gemini 3 Pro. OpenAI describes GPT-5.2 as "The most capable model series ever for professional knowledge work," While Anthropic’s Cloud Opus 4.5 scores higher than GPT-5.2 on SWE-Bench Verified, a test set that evaluates software coding capabilities.
GDPval-AA: New benchmark testing whether AI can do your job
The most significant addition to the new index is GDPVal-AA, an assessment based on OpenAI’s GDPVal dataset that tests AI models on real-world economically valuable tasks across 44 businesses and 9 major industries. Unlike traditional benchmarks, which ask models to solve abstract math problems or answer multiple-choice trivia, GDPVal-AA measures whether AI can produce the deliverables that professionals actually create: documents, slides, diagrams, spreadsheets, and multimedia content.
Models gain shell access and web browsing capabilities through Artificial Intelligence calls "stirrup," This refers to the agentic harness. Scores are obtained from blind pairwise comparisons, by freezing the ELO rating at the time of evaluation to ensure index stability.
Under this framework, OpenAI’s GPT-5.2 with extended reasoning leads with an ELO score of 1442, while Anthropic’s Cloud Opus 4.5 non-thinking version is at 1403. Cloud Sonnet 4.5 is at 1259.
According to OpenAI, on the original GDPval evaluation, GPT-5.2 beat or tied top industry professionals on 70.9% of well-specified tasks. Company claims GPT-5.2 "Specialized knowledge spanning 44 professions outperforms industry professionals on job tasks," With companies including Notion, Box, Shopify, Harveys, and Zoom "State-of-the-art long-horizon reasoning and tool-calling performance."
The emphasis on economically measurable outputs is a philosophical shift in how the industry thinks about AI potential. Instead of asking whether a model can pass the bar exam or solve competition math problems — accomplishments that generate headlines but don’t necessarily translate into workplace productivity — the new benchmarks ask whether AI can actually do the jobs.
Undergraduate physics problems highlight the limitations of today’s most advanced AI models
While GDPval-AA measures practical productivity, another new assessment called CritPT shows how far AI systems are from true scientific reasoning. The benchmark tests language models on unpublished, research-level reasoning tasks in modern physics, including condensed matter, quantum physics, and astrophysics.
CriPT was developed by over 50 active physics researchers from over 30 leading institutions. Its 71 holistic research challenges simulate full-scale research projects at the entry level – the equivalent of warm-up exercises given to junior graduate students by a practical principal investigator. Each problem is hand-curated to provide guess-resistant, machine-verifiable answers.
The results are worrying. Current state-of-the-art models are far from reliably solving full research-scale challenges. GPT-5.2 with Extended Logic leads the CritPT leaderboard with a score of only 11.5%, followed by Google’s Gemini 3 Pro Preview and Anthropic’s Cloud 4.5 Opus Thinking variant. These scores show that despite significant progress in consumer-facing tasks, AI systems still struggle with the deep reasoning required for scientific discovery.
AI hallucination rates: why the most accurate models aren’t always the most reliable
Perhaps the most revealing new assessment is the AA-OmniScience, which measures factual recall and hallucination across 6,000 questions covering 42 economically relevant topics within six domains: business, health, law, software engineering, humanities and social sciences, and science/engineering/mathematics.
The assessment produces an omniscience index that rewards accurate knowledge while penalizing hallucination responses – providing insight into whether a model can distinguish what it knows from what it doesn’t. The findings highlight an inconvenient truth: Higher accuracy does not guarantee fewer hallucinations. The models with the highest accuracy often fail to advance on the omniscience index because they guess rather than avoid when uncertain.
Google’s Gemini 3 Pro Preview leads the omniscience index with a score of 13, followed by Cloud Opus 4.5 Thinking and Gemini 3 Flash Reasoning, both at 10. However, the breakdown between accuracy and hallucination rates reveals a more complex picture.
In terms of actual accuracy, Google’s two models lead with scores of 54% and 51% respectively, followed by Cloud 4.5 Opus Thinking at 43%. But Google’s models also exhibit higher hallucination rates than peer models, scoring 88% and 85%. Anthropic’s Cloud 4.5 Sonnet Thinking and Cloud Opus 4.5 Thinking show hallucination rates of 48% and 58%, respectively, while GPT-5.1 achieves 51% with higher reasoning effort – the second-lowest hallucination rate tested.
Omniscience accuracy and hallucination rate both contribute 6.25% weight to the overall Intelligence Index v4.
Inside the AI arms race: How OpenAI, Google and Anthropic fare under new testing
The benchmark shuffle comes at a particularly turbulent moment in the AI industry. All three major Frontier model developers have launched major new models within a matter of a few weeks – and the Gemini 3 still holds the top spot on most leaderboards on LMArena, a widely cited benchmarking tool for comparing LLMs.
Google’s November release of Gemini 3 prompted OpenAI to announce an "code Red" Efforts to improve ChatGPT. OpenAI is relying on its GPT family of models to justify its $500 billion valuation and planned spending of more than $1.4 trillion. "We actually announced this code red to signal to the company that we wanted to use resources in a particular area," said Fidzi Simo, CEO of Applications at OpenAI. Altman told CNBC that he expects OpenAI to move out of its Code Red by January.
Anthropic responded with Cloud Opus 4.5 on November 24, achieving a SWE-Bench verified accuracy score of 80.9% – reclaiming the coding crown from both GPT-5.1-codecs-max and Gemini 3. The launch marked Anthropic’s third major model release in two months. Microsoft and Nvidia have since announced billions of dollars of investment in Anthropic, boosting its valuation to nearly $350 billion.
How synthetic analysis tests AI models: A look at the independent benchmarking process
Simulated analysis entails that all evaluations are run independently using a standardized methodology. The organization says that it "The methodology emphasizes objectivity and real-world applicability," Estimating a 95% confidence interval for an intelligence index of less than ±1% based on more than 10 replicate experiments on some models.
The organization’s published methodology defines key terms that enterprise buyers should understand. According to the methodology documentation, synthetic analysis considers a "endpoint" Having a hosted instance of the model accessible via API – meaning a single model can have multiple endpoints for different providers. A "provider" A company that hosts and provides access to one or more model endpoints or systems. Critically, there is a difference between synthetic analysis "loose weight" Models that have weight are publicly released, and truly open-source models – given that many open LLMs are released with licenses that do not meet the full definition of open-source software.
The methodology also clarifies how the organization standardizes token measurement: it uses OpenAI tokens, measured with OpenAI’s TikTok package, as a standard unit across all providers to enable fair comparisons.
What the new AI Intelligence Index means for enterprise technology decisions in 2026
For technology decision makers evaluating AI systems, the Intelligence Index v4.0 provides a more nuanced picture of potential than previous benchmark compilations. Equal weighting of agents, coding, scientific reasoning, and general knowledge means that enterprises with specific use cases may want to examine category-specific scores rather than relying solely on the overall index.
The introduction of the hallucination measurement as a specific, weighted factor addresses one of the most persistent concerns in enterprise AI adoption. A model that appears to be highly accurate but repeatedly suffers hallucinations when uncertainty creates significant risks in regulated industries such as health care, finance and law.
The Artificial Analysis Intelligence Index is described as "A text-only, English language assessment suite." The organization benchmarks the models separately for image input, speech input, and multilingual performance.
Reaction to the announcement has been largely positive. "It’s great to see the index evolving to reduce saturation and focus more on agentic performance," wrote one commenter in the X.com post. "Incorporating real-world functions such as GDPval-AA makes the scores more relevant for practical use."
Others made more ambivalent comments. "The new wave of models that is about to come will leave them all behind," one observer predicted. "By the end of the year the singularity will be undeniable."
But whether that prediction proves prescient or premature, one thing is already clear: The era of judging AI based on how well it answers test questions is coming to an end. The new standard is simpler and far more consequential – can it work?
<a href