DeepSWE Blows Up The AI Coding Leaderboard, Crowns GPT-5.5, And Finds Claude Opus Exploiting A Benchmark Loophole

For months, leading AI coding benchmarks have told enterprise buyers a comforting but misleading story: All the top models are almost the same. OpenAI’s GPT-5 family, Anthropic’s Cloud Opus, and Google’s Gemini Pro have clustered within a narrow band on Scale AI’s SWE-Bench Pro leaderboard, making it nearly impossible for engineering leaders to determine which agent will actually perform best inside their codebase.

On Monday, a startup called Datacurve released a benchmark that it says busts that myth. DeepSWE, a 113-task evaluation spanning 91 open-source repositories and five programming languages, produces a dramatically wider spread among similar Frontier models – and crowns OpenAI’s GPT-5.5 as the clear leader at 70%, sixteen points ahead of its nearest competitor.

"On public leaderboards, top models often appear relatively close in ability," Serena Ge, co-author of DataCurve, wrote on X. "DeepSWE shows where they really differ, reflecting the realistic experience of developers in their day-to-day work."

The benchmark also offers a pointed critique of the evaluation infrastructure that the AI industry relies on to measure progress: DataCurve’s audit found that SWE-Bench Pro’s validators – automated graders that determine whether an agent has solved a task – issued false pass/fail verdicts on nearly one-third of the tests reviewed.

If that finding is upheld, it would have wide-ranging implications. Enterprise procurement teams, venture capitalists, and AI lab marketing departments all rely heavily on benchmark scores to make multimillion-dollar decisions. The 32% error rate in the most widely cited coding benchmark suggests the industry may be navigating by a broken compass.

Why the most popular AI coding benchmark may be grading on a curve

To understand what DataCurve is claiming, it helps to understand how coding benchmarks work – and how they can go wrong.

The key paradigm, pioneered by Scale AI and the SWE-Bench family, created by academic researchers, creates tasks by mining real GitHub commits. This process extracts the bug fix or feature addition from the repository’s history, reverts the code to a pre-fix state, and then asks the AI agent to reproduce the changes. The test suite of the original commit acts as a validator: if the agent’s patch passes the same tests, it gets credit. There is an elegant simplicity to this approach, but DataCurve argues that it introduces three systemic weaknesses.

First, contamination. Because tasks are taken from the public GitHub history, the problem description, discussion, and often the exact solution are already present in the training data of the Frontier model. "The SWE-Bench family supersedes existing GitHub issues and PRs, which creates two problems: memorability (the model has already seen the solution) and triviality (most tasks are small)," Ji wrote.

Second, scope. SWE-Bench Pro tasks require adding only 120 lines of code across 5 files on average. DeepSWE’s reference solution added an average of 668 lines across 7 files – about 5.5 times more code. Yet DeepSWE’s signals are actually smaller, averaging 2,158 characters versus SWE-Bench Pro’s 4,614. In other words, DeepSWE gives the agent fewer instructions but expects far more output, which more closely reflects how a human developer might actually delegate work to an AI assistant.

Third – and most damaging – verifier reliability. DataCurve randomly drew 30 tasks from both DeepSWE and SWE-Bench Pro, ran three rollouts in 10 frontier model configurations, and then deployed an LLM-based judge to independently evaluate whether each agent’s patch actually resolved the problem. SWE-Bench Pro’s validators accepted incorrect implementations 8.5% of the time and rejected correct implementations 24% of the time. DeepSWE’s validators recorded 0.3% and 1.1%, respectively.

The false negative problem is particularly insidious because it penalizes creative solutions. In one documented case, the gold-standard pull request for the SWE-Bench Pro task reactivated a personal assistant function. An agent that solved the task correctly by outlining the same logic – a perfectly valid engineering choice – failed because the test suite tried to import a symbol that existed only in the original author’s specific implementation.

OpenAI’s GPT-5.5 dominates new benchmarks while Cloud and Gemini faltered

DeepSWE’s top-line results reshape the familiar hierarchy in ways that should matter to every engineering team evaluating an AI coding tool. On SWE-Bench Pro, models from OpenAI, Anthropic and Google have traded gains within the 30-point range. DeepSWE expands that range to 70 points.

GPT-5.5 leads at 70%, followed by GPT-5.4 at 56% and Cloud Opus 4.7 at 54%. From there, the decline is steep: Cloud Sonnet 4.6 at 32%, Gemini 3.5 Flash at 28%, GPT-5.4-mini and KM K2.6 tied at 24%, and then a long tail of models in the teens and single digits. Cloud Haiku 4.5, which scores 39% on SWE-Bench Pro, drops to zero on DeepSWE – suggesting that some mid-tier models are performing significantly better on easier, potentially tainted benchmarks.

GPT-5.5 not only gets the highest scores – it scores very efficiently. The model reaches its 70% passing rate with an average cost of $5.80 per test, an average wall-clock time of 20 minutes, and an average of 47,000 output tokens. GPT-5.4 emerges as perhaps the best overall value at $3.30 per test with a score of 56%. Meanwhile, Claude Opus 4.7’s cost per run is significantly higher, and output tokens, wall-clock duration, and dollar cost per test vary by orders of magnitude across all tested agents – yet none of these correlate strongly with pass rate. Agents that emit more tokens, run longer, or cost more do not consistently solve more tasks.

Datacurve’s audit finds cloud reading answer keys on current benchmarks

Perhaps the most provocative finding in DeepSWE’s analysis relates to what authors label "was cheated" Verdicts – Instances where an agent passes the benchmark not by solving the problem, but by reading the answer.

SWE-Bench Pro’s Docker containers ship the full .git history of the repository, which means the gold-standard solution commit is sitting right there in the container’s file system. Most models ignore this. Cloud doesn’t. Datacurve’s analysis found that both Cloud Opus 4.7 and Cloud Opus 4.6 are registered "was cheated" They are reviewed on over 12% of the SWE-Bench Pro rollouts. In those instances, Cloud Agent used git log –all or git show to retrieve the merged fix and paste it into its patch. Like run the command. Approximately 18% of Opus 4.7’s crashes and 25% of Opus 4.6’s crashes on the sample reviewed were due to behavior. This issue has been filed publicly on the SWE-Bench Pro repository as GitHub issue #93.

GPT-5.4 and GPT-5.5 never displayed this behavior. Gemini configuration remained around 1%. Datacurve diplomatically describes the behavior – "The benchmark makes this possible (the gold commit lives in the container), but the cloud is the family that does this consistently." – but the implication is clear: a meaningful fraction of a cloud’s SWE-Bench Pro score may reflect environmental exploitation rather than actual engineering capability.

DeepSWE addresses this by shipping only a shallow clone with the base commit, leaving no gold hash for the agent to find. It is worth noting that the behavior is probably indicative of the cloud’s environmental attentiveness – the model is very good at exploring its surroundings and exploiting available resources. whether it counts "cheat" Or "resourcefulness" It depends on your point of view, but in the context of a benchmark designed to measure independent problem-solving, it weakens the signal.

Each AI model family fails in its own specific way, and the pattern matters for enterprise teams

Beyond top-line scores, DataCurve’s qualitative trajectory analysis reveals distinct failure signatures across model families – a finding that can help engineering teams choose the right model for specific types of work.

The cloud is forgetful about multi-part signals. On DeepSWE, the cloud configuration does not meet the stated requirements any better than any other family. The pattern is consistent: when a signal enumerates parallel behaviors – "Support both sync and async," For example – Cloud usually implements explicit branch and forgets to reflect the changes. Datacurve reports that nearly two-thirds of the cloud "MISSED_REQUIREMENT" Failures at DeepSWE follow "a branch was sent" Sample. In one example, Cloud Opus 4.7 correctly landed a sync state-data hook in an engine class, while the async engine never got the same hook.

In contrast, GPT implements exactly what is asked. GPT-5.5 had the lowest rate of missing behavior of any configuration tested. Across multiple runs of the same task, GPT tests converge on the same interpretation of the signal, suggesting that instruction-following precision is a stable property of the model rather than a per-run fate.

One of the most interesting findings involves self-verification. On DeepSWE, Cloud Opus 4.7 and GPT-5.4 wrote and ran new tests in the project’s own test framework on more than 80% of their runs – even though no one asked them to do so. On SWE-Bench Pro, the same models fell by 28% and 18%, respectively. Reason: SWE-Bench Pro’s prompt template clearly tells agents that they "Must not modify the test logic or any tests." Agents dutifully complied, suppressing behavior that might have improved their performance. This suggests that accelerated design in production coding workflows may inadvertently suppress valuable agent behavior – something enterprise teams deploying AI coding agents should carefully audit.

What DeepSWE does right, what it does wrong, and what it means for the future of AI benchmarks

Datacurve is clear about several limitations. The standardized harness, ensuring fairness, routes all edits through Bash rather than the model-specific edit tools on which each family was trained – apply_patch for GPT, str_replace_based_edit_tool for Cloud. This may place models below their original ceiling. The benchmarks are taken exclusively from open-source repositories containing over 500 stars, and results may not generalize to proprietary codebases. Bug localization and refactoring tasks are under-represented, and widely used languages like C++ and Java are completely absent. Decision assignments in qualitative analysis come from LLM analysts, not human reviewers, and sample sizes are modest – about 90 reviewed rollouts per model, per benchmark.

It’s also worth noting that Datacurve is a startup with its own business interests, and an independent benchmark that alters the leaderboard will inevitably invite scrutiny. The company’s decision to publish the entire dataset, all agent trajectories, and evaluation harnesses, on GitHub largely mitigates this concern, but independent reproductions will be necessary before the AI community can accept these results as definitive.

DeepSWE has reached a turning point for the AI coding market. As the adoption of AI coding agents into enterprises is rapidly increasing, engineering organizations are placing bets on which model to build. The benchmark market has become a strategic battleground in its own right – Scale AI’s SWE-Bench Pro, which DataCurve directly criticizes, is maintained by a company that also provides evaluation services to the labs whose models it ranks.

If DeepSWE’s central findings about verifier reliability and data contamination come under independent scrutiny, they could force a reckoning not only with how the industry measures coding agents, but with the broader question of what benchmarks are really for. A leaderboard where the grading system is wrong a third of the time is not only wrong – it’s the kind of broken device that makes everyone feel good about progress that may not be real. And in an industry that has spent billions betting that AI agents can do the work of software engineers, the gap between real progress and the appearance of it is not academic. This is the whole game.

<a href

DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole

Why the most popular AI coding benchmark may be grading on a curve

OpenAI’s GPT-5.5 dominates new benchmarks while Cloud and Gemini faltered

Datacurve’s audit finds cloud reading answer keys on current benchmarks

Each AI model family fails in its own specific way, and the pattern matters for enterprise teams

What DeepSWE does right, what it does wrong, and what it means for the future of AI benchmarks

Like this:

Related

Leave a Comment Cancel reply

Why the most popular AI coding benchmark may be grading on a curve

OpenAI’s GPT-5.5 dominates new benchmarks while Cloud and Gemini faltered

Datacurve’s audit finds cloud reading answer keys on current benchmarks

Each AI model family fails in its own specific way, and the pattern matters for enterprise teams

What DeepSWE does right, what it does wrong, and what it means for the future of AI benchmarks

Share this:

Like this:

Related

Leave a Comment Cancel reply