Today’s coding benchmarks have established that models can write correct code. But as AI-generated code becomes the dominant path to production, correctness is now table stakes. The question that we should be asking is: can models actually write good code?
We’re excited to introduce FrontierCode, a benchmark that measures how well models can truly meet the standards of high-quality production codebases. What sets us apart:
- Would the maintainer actually merge this PR? We’re the first benchmark to measure code mergeability. Our criteria assess end-to-end code quality — correctness, test quality, scope discipline, style, and adherence to codebase standards. This employs a novel ensemble of grading techniques, including unit tests, rubrics, and new types of verifiers.
- Crafted by open-source maintainers. 20+ world-class open-source developers built realistic, diverse, and challenging coding tasks from the repos they maintain, spending more than 40 hours per task. They define what “mergeable” means in their repo.
-
Rigorous quality control. Rubric grading is subjective, so we built an extensive QC pipeline with adversarial testing, calibration, and multi-stage review, where every task is manually reviewed by a Cognition researcher. We achieve an 81% lower false positive rate compared to SWE-Bench Pro.
Our benchmark provides the strongest available signal of a model’s ability to write high-quality, maintainable code. We find that even today’s most capable models struggle on this new standard.
20+ world-class open-source maintainers
40 hours effort per task
Manually reviewed by Cognition researchers
Every task
81% lower false positive rate
Compared to SWE-Bench Pro
First-ever benchmark measuring code quality
And subtle human preferences
We present three nested subsets of FrontierCode at increasing difficulty: Extended, Main, and Diamond. Diamond comprises the 50 hardest tasks, Main the 100 hardest (including Diamond), and Extended the full set of 150.
We report two metrics, pass rate and score:
-
A solution passes if it clears all blocker criteria, i.e., criteria that a maintainer would consider hard stops during code review, and fails otherwise.
-
A solution’s score is a weighted aggregate of the rubric items. Solutions that do not pass blocking criteria receive 0.
Each model is run 5 times at every available reasoning effort. For each effort, we average the metric across the 5 trials, then report each model’s score at its best performing reasoning level.
FrontierCode Diamond remains unsaturated: the best performing model, Claude Opus 4.8, achieves a score of only 13.4%. Other models score significantly lower: GPT-5.5 receives 6.3%, Gemini 3.1 Pro 4.7%, and others even less. However, GPT 5.5 consistently uses up to 4x fewer tokens than Opus 4.8, achieving a better cost-intelligence tradeoff.
On FrontierCode Main and Extended, Opus 4.8 still maintains a clear lead, at 34.3% and 51.8%, respectively. We also observe a large gap between open-source models and the frontier. Kimi K2.6, the best-performing open-source model, achieves just 3.8% on Diamond, 16% on Main and 37% on Extended.
The rest of this post will be a deep dive into why and how we built FrontierCode.
The first generation of coding benchmarks, such as SWE-Bench Verified and Pro, were designed for less capable models. They fall short on many measures of realism and robustness.
Fundamentally, they only test functional correctness, not quality. Moreover, these benchmarks are prone to misclassification errors. Experiments from METR have found that high-scoring models on these benchmarks often produce patches that wouldn’t be accepted by human maintainers.
How do we define misclassifications? These fall under two categories:
-
False Positives: The verifier should not reward solutions that are wrong. Test coverage may be incomplete, allowing the model to write an incorrect solution that’s still accepted.
-
False Negatives: The verifier should not penalize solutions that are correct. Tests can be either too specific, e.g. checking for exact error strings or function names, or unsolvable, testing for a behavior not in the instruction or in the codebase.
We show through analysis of agent trajectories that FrontierCode produces 81% less misclassification errors than other leading benchmarks. This means that FrontierCode scores are the most accurate ranking currently available.
Existing benchmarks also suffer from lack of diversity in several ways.
While other benchmarks generated issues from single PRs via programmatic scraping, FrontierCode is hand-selected by repo maintainers from multi-PR chains and freeform requests. We also triple the number of represented languages from SWE-Bench Pro.
It’s also known that existing benchmarks provide too much guidance in the form of overly specified and detailed prompts. Today’s frontier models need far less hand-holding. FrontierCode expects the agent to infer the maintainer’s intent, given the same context as a human contributor.
Our prompts contain two parts. First is the task description. Second, the codebase guidelines for generic testing, lint, and style practices, just like those found in AGENTS.md. The task descriptions are humanlike and deliberately concise — a third the length of SWE-Bench Pro’s.
Furthermore, we’ve chosen to scale the difficulty of tasks using quality rubrics, rather than simply increasing patch size. Despite having smaller patches than benchmarks like DeepSWE, FrontierCode is harder for agents to solve.
To produce an evaluation for code quality as ambitious as FrontierCode, we had to embed quality into every step of the benchmark creation process.
A Team of Open Source Maintainers
FrontierCode aims to measure whether models can produce code that would be merged into production codebases. To ensure this, we collaborated directly with the maintainers of 36 flagship open-source repositories. This team of all-star experts has collectively reviewed and merged thousands of commits to their codebases. They can apply deep stylistic and design knowledge to every PR they see.
Each maintainer invested more than 40 hours per task, undergoing multiple rounds of iteration with other eval engineers and Cognition researchers. They’ve distilled their judgment into concrete evaluation criteria: any PR that satisfies these standards would actually be approved.
Here’s what they say about FrontierCode:
“Working with the team behind FrontierCode was a privilege. Taking on the AI evaluation problem felt like nothing less than an art… Where others grade like a CI, FrontierCode grades like a tech lead.”
Tomer Nosrati, CEO and Tech Lead of Celery (28.6k stars)
“What sets FrontierCode apart is the attention to detail. Each task is calibrated to a depth that simply hasn’t been seen before in LLM benchmarking. We should be moving away from benchmarks that can be gamed and instead using ones like FrontierCode to demonstrate genuine model intelligence and creativity.”
Martin McKeaveney, Co-Founder and CTO of Budibase (28k stars)
“I’m grateful to have worked with leading experts in the Open Source community. We had deep discussions on correctness versus quality and what mergeability means in the context of their repository. FrontierCode is a milestone for AI models respecting subjective quality in the real world.”
Merlijn Vos, Core Maintainer of uppy (30.8k stars)
“FrontierCode’s unique value comes from the human experience encoded in its evals: years of judgment about what makes code high-quality and worthy of merging. The almost obsessive care brought to every criterion is why I believe this benchmark sets a new bar for SWE evaluation.”
Claudio Costa, Core Maintainer of Mattermost (37k stars)
Beyond Unit Tests
FrontierCode measures mergeability by evaluating code along the following axes:
-
Behavioral correctness: Does the patch successfully solve the problem?
-
Regression safety: Does it break anything in the existing codebase?
-
Mechanical cleanliness: Does it pass the project’s build, lint, and style checks?
-
Test correctness: Do the agent’s tests actually capture the desired behavior?
-
Scope: Does the patch touch only what it needs to?
-
Code quality: Does the code conform to codebase conventions, follow sound design patterns, and remain readable to collaborators?
The following table describes how we use both classical unit tests and novel methods, such as adaptive classical grading, scope, and reverse-classical tests (more on these methods below) to evaluate these criteria.
| Category | Method | How it works | Passes when |
|---|---|---|---|
| Behavioral correctness | classical | Injects test files into the repository, runs them, then cleans up. | All injected tests pass |
| Mechanical cleanliness, regression safety | command | Runs a shell command. | Exit code 0 |
| Test correctness | reverse-classical | Runs agent’s submitted tests against the base commit. | The tests fail |
| Behavioral correctness for complex tasks | adaptive classical grading | Uses an LLM to adapt reference tests or application code to align with the implementation. | Adapted tests pass |
| Scope | scope | Checks file boundaries, diff size constraints, and optionally semantic locality of changes. | Diff within constraints |
| Code quality | prompt | An LLM reviews agent’s diff against a natural-language prompt. | LLM score meets threshold |
Each criterion is either a blocker or a non-blocker:
Blockers represent mergeability requirements, i.e., criteria that a maintainer would consider hard stops during code review. These include correctness checks, as well as non-correctness concerns like performance or scope restrictions.
Non-blockers represent quality signals such as code style, type safety, and readability, which would not necessarily block a merge.
If a solution satisfies all the blockers, it is considered passing, and its score is the weighted aggregate of all the rubric items it passes. Otherwise it receives a score of zero.
Novel Grading Methods
We’ve introduced three main techniques to strengthen criteria against misclassifications, while allowing space for multiple valid solutions:
Reverse-Classical: The reverse-classical criterion is a way to ensure that agent-written tests are meaningful: when we run them on the original, broken codebase, they must fail. This gives us an automated, deterministic check that the agent understood the problem well enough to write an effective test for it.
Code Scope: A good PR should exercise restraint: it modifies only what it needs to, without touching unrelated files or introducing unnecessary refactors. The scope criterion is an automated check that enforces these boundaries. It combines three types of constraints:
-
files: For fast, deterministic checks on which files can be allowed, denied, or must be deleted. -
size: To enforce limits on the number of changed lines, net line growth, or total files modified. -
semantic: For LLM-based checks that verify the locality or nature of a change within a specific part of a file (e.g., inside a single function).
Adaptive Classical Grading: Open-ended coding tasks can have many valid solutions. Static unit tests are too rigid; good solutions can fail for superficial differences like function names or error wording. We resolve this conflict with mutagent, a tool we built that uses an LLM to surgically patch the test environment (or the application code) and align with the agent’s implementation details, allowing us to run rigorous, deterministic tests on open-ended solutions.
Example Task
Press “Run eval” to generate Opus 4.8’s patch for this task.
The graded rubric appears here after the run.
Andrew He (ecnerwala) is the second highest rated US competitor on Codeforces, two-time IOI gold medalist, a founding engineer at Cognition and our resident C++ expert. He personally reviewed the models’ behavior on this task.
This task is based on the jsonschema repo which is written in C++. It requires implementing a new function auto LOG_WARNING() -> std::ostream & that should be used in every instance of printing warning: in the codebase. The helper should prefix log messages with warning:, print to stderr, and ignore the --verbose flag.
The task seems simple: a passing solution has to just identify all places in the given codebase that print warning: and replace them with a call to a newly implemented LOG_WARNING() function. However, models fail this task in a somewhat surprising way. One of the blocking criteria requires that multi-line warning messages idiomatically call LOG_WARNING, like so:
cpp
LOG_WARNING() << "You are opting in to remove schema identifiers... \n"
<< "The only legit use case...\n"
<< "non-compliant...\n" << ... ;
Claude Opus 4.8, on the other hand, consistently opts for the following implementation:
cpp
LOG_WARNING() << "You are opting in to remove schema identifiers...\n";
std::cerr << "The only legit use case...\n";
std::cerr << "non-compliant...\n";
These two are behaviorally the same; in both cases a multi-line error message will be printed to stderr. However, the agent solution bakes in the assumption at the call site that LOG_WARNING() and std::cerr are the same stream, which could change in a future modification of LOG_WARNING().
Quality Control
How do we iterate on rubric quality?
Improving binary verifiers like unit tests is relatively tractable because every solution falls into one of two buckets — correct or incorrect. You can examine each rollout, check its bucket, and strengthen the tests accordingly.
Hardening prompt-based criteria is a much harder QC problem. Rubrics introduce a spectrum of correctness: two solutions for the same task can both be functionally correct yet score differently on every criteria. We can no longer look at solutions in isolation. We have to compare within a group of solutions and verify that their relative scores actually separate better solutions from worse ones.
Rubric design is also inherently subjective and requires domain expertise. For each criterion, the maintainer must decide whether it’s a blocker or non-blocker, assign its weight relative to other criteria, and ensure complete coverage so that models cannot exploit gaps in the rubric.
Our rubric creation process
- 1.
Design
We prefer classical tests for things that can be checked deterministically, such as correctness. For complex tasks, we favor behavioral tests that are robust to superficial differences in implementation details.
For soft qualities, we prefer LLM grading. This is better for assessing, say, idiomatic code, readability, or adherence to a preferred architectural pattern.
Based on these principles, we first ask the task creator to manually audit each rubric item and document its rationale.
- 2.
Hack report
To prevent false positives, the task author imitates a lazy or adversarial programmer and tries to get a passing score with a deliberately incorrect or incomplete solution. This exposes criteria that can be improved.
To prevent false negatives, the task author tries to write a perfectly valid, alternative solution that is different from the canonical one. If this solution fails the evaluation, the rubric is too rigid.
We augment the hack report process by also asking Devin to come up with novel ways to hack the rubric.
- 3.
Rubric calibration
To ensure that the rubric has sufficient resolution, the author must write four distinct solutions that target a range of scores from 0 to 100%.
- 4.
Review
Each contributor belongs to an eval pod led by an experienced pod lead, who acts as the first quality gate. The lead reviews the full eval candidate and iterates with the contributor through multiple rounds. Once the eval candidate passes all pod-level checks, a Cognition researcher conducts a final review along with the pod lead and contributor. For a random subset, researchers also solve the tasks themselves to verify that instructions are clear and grading is fair.
- 5.
Re-Review
At any stage, reviewers can send the task back for revision. Most tasks cycle through multiple iterations before passing.
The result of this extensive process is a suite of durable, difficult tasks that reflect the high standards of the world’s top open-source repositories.
FrontierCode is the benchmark for the next generation of coding agents. We are confident developers, enterprises, and researchers can trust it to evaluate the production readiness of their strongest models. While we don’t currently plan to release the tasks publicly to avoid contamination, we are opening up our evaluation to all model creators, in the hope that we can push the frontier even further in the coming months.
Acknowledgments
FrontierCode is the product of close collaboration across research, design, and a community of practitioners who lent their expertise to vet tasks and shape the rubric. Thank you to everyone listed below.
- Research
- Eric Lu, Ben Pan, Deniz Birlikci, Sam Lee, Ray Wang, Rohan Choudhury, Fermi Ma, TC Qin, Carlo Baronio, Silas Alberti
- Design
- Katie Cheng, Joseph Alessio
- Outstanding External Contributors
- Claudio Costa, Martin McKeaveny, Lance Fuchia, Merlijn Vos, Tomer Nosrati, Swyx
<a href