Frontier Models Are Failing One In Three Production Attempts — And Getting Harder To Audit

AI agents are now embedded in real enterprise workflows, and they are still failing in one out of three attempts on structured benchmarks. The gap between capability and reliability is the defining operational challenge for IT leaders in 2026, according to Stanford HAI’s ninth annual AI Index report.

This uneven, unpredictable performance is called an AI index. "crooked border," A term coined by AI researcher Ethan Mollick to describe the threshold where AI excels and then suddenly fails.

“AI models can win gold medals at the International Mathematical Olympiad, but still can’t reliably tell the time,” Stanford HAI researchers explain.

How do models proceed in 2025?

Adoption of enterprise AI has reached 88%. Notable achievements in 2025 and early 2026:

The Frontier Model improved the Humanities Last Exam (HLE) by 30% in just one year, which includes 2,500 questions in mathematics, natural sciences, ancient languages, and other specific subfields. HLE was designed to be tough for AI and friendly to human experts.
Leading models scored more than 87% on MMLU-Pro, which tests multi-step reasoning based on 12,000 human-reviewed questions across more than a dozen topics. The Stanford HAI researchers say this shows “how competitive the frontier has become on broader knowledge tasks.”
The top models, including Cloud Opus 4.5, GPT-5.2, and Queue3.5, scored between 62.9% and 70.2% on τ-bench. The benchmark tests agents on real-world tasks in realistic domains that include chatting with a user and calling external tools or APIs.
Model accuracy on GAIA, which benchmarks general AI assistants, increased by about 20% to 74.5%.
Agent performance on SWE-Bench Verified increased from 60% to near 100% in just one year. The benchmark evaluates models based on their ability to solve real-world software issues.
The success rate on WebArena increased from 15% in 2023 to 74.3% in early 2026. This benchmark presents a realistic web environment for evaluating autonomous AI agents, tasking them with information retrieval, site navigation, and content configuration.
Agent performance on MLE-Bench, which evaluates machine learning (ML) engineering capabilities, increased from 17% in 2024 to nearly 65% in early 2026.

AI agents are showing increasing capabilities in cybersecurity. For example, the Frontier model solved 93% of the problems on Cybench, a benchmark that includes 40 professional-level tasks in six capture-the-flag categories, including cryptography, web security, reverse engineering, forensics, and exploitation.

This is compared to 15% in 2024 and represents the “fastest improvement rate”, indicating that cybersecurity tasks “are appropriate for current agent capabilities.”

There has also been significant development in video generation over the past year; Models can now capture how objects behave. For example, Google DeepMind’s Veo 3 was tested on over 18,000 generated videos, and demonstrated the ability to simulate jumps and solve mazes without being trained on those tasks.

“Video generation models are no longer simply producing realistic-looking content,” the researchers write. “Some people are starting to learn how the physical world really works.”

Overall, AI is being used across many areas in the enterprise – knowledge management, software engineering and IT, marketing and sales – and expanding into specialized domains such as tax, mortgage processing, corporate finance and legal reasoning, where accuracy ranges from 60 to 90%.

“AI capability is not static,” says Stanford HAI. “It’s growing faster and reaching more people than ever before.”

AI capabilities have increased, but reliability has decreased

Multimodal models now meet or exceed human baselines on PhD-level science questions, multimodal reasoning, and competing mathematics. For example, Gemini Deep Think earned a gold medal at the 2025 International Mathematical Olympiad (IMO) by solving five of six problems from start to finish in natural language within a 4.5-hour time limit – a marked improvement from the silver-level score in 2024.

Yet these same AI systems still fail in about one in three attempts, according to Stanford HAI, and have trouble with basic perception tasks. On Clockbench – a test covering 180 clock designs and 720 questions – Gemini DeepThink achieved only 50.1% accuracy, compared to about 90% for humans. GPT-4.5 High reached an almost identical score of 50.6%.

“Many multimodal models still struggle with something that seems routine to most humans: telling time,” the Stanford HAI report notes. This seemingly simple task combines visual perception with simple arithmetic, identifying the hands of a clock and their positions, and converting them into a time value. Ultimately, errors in any of these steps can add up, leading to inaccurate results, according to the researchers.

In the analysis, the models were shown to be of a range of watch styles: standard analog, watches without a second hand, watches with hands, others with black dials or Roman numerals. But even after fine-tuning on 5,000 synthetic images, the models only improved on familiar motifs and failed to generalize to real-world variations (like distorted dials or thin hands).

The researchers concluded that, when models confused the hour and minute hands, their ability to interpret direction deteriorated, suggesting that the challenge was not just in the data, but in integrating multiple visual cues.

Stanford HAI notes, “Even as models narrow the gap with human experts on knowledge-intensive tasks, this kind of visual reasoning remains a persistent challenge.”

Hallucinations and multi-step logic remain key differences

Even though models are accelerating their reasoning, hallucinations remain a major concern.

For example, in one benchmark, hallucination rates across 26 major models ranged from 22% to 94%. The accuracy of some models declined rapidly when put under scrutiny – for example, GPT-4o’s accuracy dropped from 98.2% to 64.4%, and DeepSeek R1’s accuracy dropped from more than 90% to 14.4%.

On the other hand, Grok 4.20 beta, Cloud 4.5 Haiku, and mimo-v2-pro showed the lowest rates.

Furthermore, models continue to struggle with multi-step workflows, even when they are tasked with more of them. For example, on the τ-Bench benchmark – which evaluates tool usage and multi-turn reasoning – no model exceeds 71%, suggesting that “managing multiturn conversations while using tools correctly and adhering to policy constraints remains difficult even for frontier models,” according to the Stanford HAI report.

models are becoming opaque

The Stanford HAI report states that leading models are now “virtually indistinguishable” from each other in terms of performance. Open-weight models are more competitive than ever, but they are consolidating.

As capability is no longer a “clear differentiator”, competitive pressure is shifting towards cost, reliability and real-world usability.

Frontier Labs is disclosing less information about its models, evaluation methods are rapidly losing relevance, and independent testing cannot always confirm developer-reported metrics.

As Stanford HAI points out: “The most efficient systems are now the least transparent.”

The training code, parameter calculations, dataset size and duration are often constrained by companies like OpenAI, Anthropic and Google. And transparency is declining more broadly: In 2025, 80 out of 95 models were released without associated training code, while only four made their code fully open source.

Furthermore, after rising between 2023 and 2024, the score on the Foundation Model Transparency Index – which ranks leading foundation developers on 100 transparency indicators – has since fallen. The average score is now 40, which represents a decrease of 17 points.

According to the report, “Major gaps remain in disclosure of training data, compute resources, and post-deployment impact.”

Benchmarking AI is becoming harder and less reliable

Benchmarks used to measure AI progress are facing increasing reliability issues, with error rates reaching 42% on a widely used assessment. “More ambitious testing of AI is being conducted in reasoning, security, and real-world task performance,” yet “those measurements are becoming increasingly difficult to trust,” the Stanford report said.

Major challenges include:

“Sparse and decreasing” reporting on developers’ bias
benchmark contamination, or when models are exposed to test data; This may cause scores to be “incorrectly inflated”
Discrepancies between developer-reported results and independent testing
“Poorly constructed” means lack of documentation, statistical significance and details on reproducible scripts
“Increasing opacity and non-standard signals” that make model-to-model comparisons unreliable

According to the report, “Even if benchmark scores are technically valid, strong benchmark performance does not always translate into real-world utility.” Furthermore, “AI capability is outstripping the benchmarks designed to measure it.”

This is leading to “benchmark saturation”, where models achieve such high scores that tests can no longer distinguish between them. More complex, interactive forms of intelligence are becoming increasingly difficult to benchmark. Some are calling for assessments that measure human-AI collaboration rather than individual AI performance, but this technology is in the early stages of development.

According to Stanford HAI, “Evaluations intended to remain challenging for years become saturated within months, narrowing the window in which benchmarks remain useful for tracking progress.”

Are we here? "extreme data"?

As builders move toward more data-intensive heuristics, concerns about data bottlenecks and scaling stability are increasing. Leading researchers are warning that the available pool of high-quality human text and web data has been “exhausted” – a situation known as “peak data”.

According to Stanford HAI, hybrid approaches by combining real and synthetic data can “significantly accelerate training” – sometimes by a factor of 5 to 10 – and small models trained on purely synthetic data have shown promise for narrowly defined tasks such as classification or code generation.

The report states that synthetically generated data can be effective in improving model performance in post-training settings, including fine-tuning, alignment, instruction tuning, and reinforcement learning (RL). However, “these benefits have not generalized to larger, general-purpose language models.”

Instead of “blindly” scaling data, researchers are turning to cutting, curating, and refining the input, and improving performance by cleaning labels, deduplicating samples, and building overall higher-quality datasets.

According to the report, “Discussions over data availability often overlook an important shift in recent AI research.” “Performance gains come from improving the quality of existing datasets, not from acquiring more.”

Responsible AI is being left behind

According to Stanford HAI, while the infrastructure for responsible AI is growing, progress has been “uneven” and unable to rapidly achieve potential.

While almost all leading frontier AI model developers report results on capability benchmarks, related reporting on safety and responsibility is inconsistent and “spotty”.

Documented AI incidents increased significantly year over year – 362 in 2025 compared to 233 in 2024. And, while several Frontier models received “very good” or “good” security ratings under standard use (according to the Illuminate benchmark, which assesses generative AI across 12 “threat” categories), security performance fell across all models when tested against jailbreak attempts using adversarial signals.

“AI models perform well on security tests under normal circumstances, but their security weakens under a deliberate attack,” notes Stanford HAI.

Adding to this challenge, builders have pointed out that improvements in one dimension, such as safety, may impair another dimension, such as accuracy. According to Stanford researchers, “The infrastructure for responsible AI is growing, but progress has been uneven, and it is not keeping pace with the pace of AI deployment.”

The Stanford data makes one thing clear: The difference that matters in 2026 is not between AI and human performance. It’s between what the AI can do in demo and what it does reliably in production. Right now – with less transparency from labs and benchmarks that become saturated before they are useful – it is harder than ever to measure that difference.

<a href

Frontier models are failing one in three production attempts — and getting harder to audit

How do models proceed in 2025?

AI capabilities have increased, but reliability has decreased

Hallucinations and multi-step logic remain key differences

models are becoming opaque

Benchmarking AI is becoming harder and less reliable

Are we here? "extreme data"?

Responsible AI is being left behind

Like this:

Related

Leave a Comment Cancel reply

How do models proceed in 2025?

AI capabilities have increased, but reliability has decreased

Hallucinations and multi-step logic remain key differences

models are becoming opaque

Benchmarking AI is becoming harder and less reliable

Are we here? "extreme data"?

Responsible AI is being left behind

Share this:

Like this:

Related

Leave a Comment Cancel reply