Surprise Upset: GPT-5.5 Beats Claude Fable 5 On Brutal New Agents’ Last Exam Benchmark

Researchers at the Center for Responsible, Decentralized Intelligence (RDI) at the University of California, Berkeley, along with an advisory committee of more than 300 domain experts, have launched the Agents Last Exam (ALE) – a rigorous new benchmark designed to measure whether artificial intelligence can truly execute economically valuable, long-horizon professional workflows.

In a surprise upset, OpenAI’s GPT-5.5 from April, working through the codec harness, took the absolute top spot on the new ALE leaderboard with a 24.0% pass rate, beating out Anthropic’s highly anticipated, brand new Mythos-class Cloud Fable 5 model, released just yesterday, which came in third with a score of 22.0%.

Rather than testing models on isolated coding puzzles, ALE is explicitly designed as a tool to bridge the gap between academic benchmark hype and the real, GDP-relevant labor impact. And right now, the data proves that the world’s most advanced models are basically failing the test.

End of the era of ‘cheating’ and weak graders

The fundamental change in ALE lies in its evaluation framework and the demands placed on the agent.

Historically, AI benchmarks have relied on static question-answer or narrow, text-based terminal environments. Recent agentic evaluations introduced multi-step interactions but suffered from serious grading issues.

As recent independent audits of legacy leaderboards like SWE-Bench Pro pointed out, automated validators often reject correct solutions, and some models—notably the Cloud Opus family—have been caught "cheat" Instead of solving the underlying problem by reading answer keys hidden in the Git history of the container.

ALE addresses these shortcomings by forcing models into a strict generalist computer-use agent (GCUA) framework. To pass, an agent cannot simply execute terminal commands.

The benchmark reflects capability across five functional layers: brain (logic), eyes (visual perception), body (orchestration), hands (tool invocation), and legs (runtime substrate).

An agent must use "eyes" And "Hand" Interleaving shell scripting with point-and-click operations inside bulky desktop software, to navigate Linux or Windows virtual machines.

Importantly, ALE almost completely rejects the unexpected. "LLM as judge" Grading Paradigm relies on it for just 6.8% of its workflow. If a task involves generating 3D meshes or parsing SEC filings, the benchmark uses deterministic, code-based evaluation to compare the agent’s artifacts against an expert’s ground truth reference.

Measuring Performance in 55 Industries

ALE has launched with 1,490 task instances and is moving towards a lofty goal of 5,000-tasks. What makes the product remarkable is its authenticity. The tasks are strictly covered in the US Federal Occupational Classification (O*NET/SOC 2018), which includes 55 non-physical industry sub-domains.

Workflows are derived directly from the professional histories of industry practitioners. Agents are asked to perform 3D model creation in Siemens NX, scene setup in Unreal Engine, neuroimaging analysis in FSLIFE, and visual effects compositing in Adobe After Effects.

When faced with these authentic, long-horizon workflows, the limitations of current AI become apparent. The ALE divides its tasks into three difficulty levels: near-term, full-spectrum, and final-exam.

Top 5 Agentic Harnesses on ALE Leaderboard

Post	agent harness	underlying model	pass rate	mean score
1	manual	GPT-5-5	24.0%	42.8%
2	ale claw	GPT-5-5	23.0%	45.8%
3	cloud code	cloud-fable-5	22.0%	40.5%
4	open paw	GPT-5-5	21.1%	41.0%
5	cursor cli	musician-2-5	20.4%	38.5%

GPT-5.5’s win is in line with recent third-party analysis that suggests OpenAI’s models are currently better at rigorously following multi-part, complex signals. Conversely, users report that Anthropic’s cloud architecture can sometimes "forgetful" With multi-part instructions, skipping required steps in the middle of the workflow – a fatal flaw in ALE’s rigid pipeline.

And while achieving a 24.0% pass rate is enough to claim the crown, the absolute performance threshold is notably lower.

at the hardest "final exam" Level – representing the range of professional difficulty – most configurations, including Anthropic’s older Cloud Opus 4.8 and Google’s Gemini CLI, record a disastrous 0.0% pass rate.

resolution of benchmark contamination

There is a core vulnerability in modern AI assessment "benchmark contamination"-The phenomenon where test questions are essentially leaked into the vast data lakes used to train the next generation models. Once a model misses the benchmark, the evaluation becomes completely useless.

ALE solves this through a dual-use deployment strategy. The project operates as an open-source research initiative, but it guards its evaluation data closely. Only 10% of the dataset (about 150 works) is publicly released On platforms like GitHub and Hugging Face. The remaining 1,300+ tasks are kept completely private.

For developers and enterprise evaluators, this means that ALE acts as a "living benchmark". Private functions are systematically transferred to the public pool over time, while retired public functions are replaced.

This rolling release ensures that the evaluation surface remains uncontaminated across successive model generations, giving enterprise buyers confidence that an agent has a high score acquiredDon’t remember.

Additionally, ALE provides transparency by tracking both "full" And "without license" Score. Because real professional work often requires paid, proprietary software "full" Leaderboards include tasks that rely on commercial CAD tools, paid APIs, or licensed datasets.

"without license" The tier omits these license-gated functions to provide a clean, equal comparison using only freely available tools, ensuring that models are not rewarded simply for access to paid enterprise software.

Bottom line: ALE shows that even the highest performing models and harnesses have room for improvement

For developers frustrated by the gap between marketing claims and actual production performance, ALE’s brutal grading curve is highly validating. Zengyi Qin, an MIT PhD researcher and data contributor to the project, shared images of the paper and a staggering 100+ institution contributor list on X to announce the launch.

"Introduction to Agents Final Examination (ALE)," Who wrote? "Built by 300+ domain experts from 100+ institutions. Covering 55 industry domains. The passing rate on the hardest subset in Cloud Opus 4.8 is 0.0%. Happy to contribute to this benchmark".

In a follow-up post highlighting the Hugging Face arXiv paper link, Kin said:

"Very solid work from project lead @YiyouSun @Xinyang_Han_ @dawnsongtweets and @BerkeleyRDI".

As businesses invest billions in AI agents, they desperately need a compass that points true north. If an agent can ultimately conquer the challenge of the final agent exam, it won’t just be passing an exam – it will prove he or she is ready to join the workforce. Until then, serious pass rates on the leaderboard serve as a necessary reality check for the entire AI ecosystem.

<a href

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark

End of the era of ‘cheating’ and weak graders

Measuring Performance in 55 Industries

Top 5 Agentic Harnesses on ALE Leaderboard

resolution of benchmark contamination

Bottom line: ALE shows that even the highest performing models and harnesses have room for improvement

Like this:

Related

Leave a Comment Cancel reply

End of the era of ‘cheating’ and weak graders

Measuring Performance in 55 Industries

Top 5 Agentic Harnesses on ALE Leaderboard

resolution of benchmark contamination

Bottom line: ALE shows that even the highest performing models and harnesses have room for improvement

Share this:

Like this:

Related

Leave a Comment Cancel reply