Introducing FrontierCode | Cognition

By Eric Lu, Ben Pan, Deniz Birlikci, Sam Lee, Ray Wang, Rohan Choudhury, Fermi Ma, TC Qin, Carlo Baronio, Silas Alberti, and more →06.08.26

Today’s coding benchmarks have established that models can write correct code. But as AI-generated code becomes the dominant path to production, correctness is now table stakes. The question that we should be asking is: can models actually write good code?

We’re excited to introduce FrontierCode, a benchmark that measures how well models can truly meet the standards of high-quality production codebases. What sets us apart:

Our benchmark provides the strongest available signal of a model’s ability to write high-quality, maintainable code. We find that even today’s most capable models struggle on this new standard.

20+ world-class open-source maintainers

40 hours effort per task

Manually reviewed by Cognition researchers

Every task

81% lower false positive rate

Compared to SWE-Bench Pro

First-ever benchmark measuring code quality

And subtle human preferences

We present three nested subsets of FrontierCode at increasing difficulty: Extended, Main, and Diamond. Diamond comprises the 50 hardest tasks, Main the 100 hardest (including Diamond), and Extended the full set of 150.

We report two metrics, pass rate and score:

Each model is run 5 times at every available reasoning effort. For each effort, we average the metric across the 5 trials, then report each model’s score at its best performing reasoning level.

FrontierCode Diamond remains unsaturated: the best performing model, Claude Opus 4.8, achieves a score of only 13.4%. Other models score significantly lower: GPT-5.5 receives 6.3%, Gemini 3.1 Pro 4.7%, and others even less. However, GPT 5.5 consistently uses up to 4x fewer tokens than Opus 4.8, achieving a better cost-intelligence tradeoff.

On FrontierCode Main and Extended, Opus 4.8 still maintains a clear lead, at 34.3% and 51.8%, respectively. We also observe a large gap between open-source models and the frontier. Kimi K2.6, the best-performing open-source model, achieves just 3.8% on Diamond, 16% on Main and 37% on Extended.

The rest of this post will be a deep dive into why and how we built FrontierCode.

The first generation of coding benchmarks, such as SWE-Bench Verified and Pro, were designed for less capable models. They fall short on many measures of realism and robustness.

Fundamentally, they only test functional correctness, not quality. Moreover, these benchmarks are prone to misclassification errors. Experiments from METR have found that high-scoring models on these benchmarks often produce patches that wouldn’t be accepted by human maintainers.

How do we define misclassifications? These fall under two categories:

Trajectory false positive and false negative rates by benchmark

We show through analysis of agent trajectories that FrontierCode produces 81% less misclassification errors than other leading benchmarks. This means that FrontierCode scores are the most accurate ranking currently available.

Existing benchmarks also suffer from lack of diversity in several ways.

While other benchmarks generated issues from single PRs via programmatic scraping, FrontierCode is hand-selected by repo maintainers from multi-PR chains and freeform requests. We also triple the number of represented languages from SWE-Bench Pro.

Language composition by benchmark, normalized for task count

It’s also known that existing benchmarks provide too much guidance in the form of overly specified and detailed prompts. Today’s frontier models need far less hand-holding. FrontierCode expects the agent to infer the maintainer’s intent, given the same context as a human contributor.

Our prompts contain two parts. First is the task description. Second, the codebase guidelines for generic testing, lint, and style practices, just like those found in AGENTS.md. The task descriptions are humanlike and deliberately concise — a third the length of SWE-Bench Pro’s.

FrontierCode prompt length distribution
Example prompts from each benchmark, shown at the same scale. Scroll within each column to compare structure, length, and specificity.

Furthermore, we’ve chosen to scale the difficulty of tasks using quality rubrics, rather than simply increasing patch size. Despite having smaller patches than benchmarks like DeepSWE, FrontierCode is harder for agents to solve.

FrontierCode patch size distribution

To produce an evaluation for code quality as ambitious as FrontierCode, we had to embed quality into every step of the benchmark creation process.

A Team of Open Source Maintainers

FrontierCode aims to measure whether models can produce code that would be merged into production codebases. To ensure this, we collaborated directly with the maintainers of 36 flagship open-source repositories. This team of all-star experts has collectively reviewed and merged thousands of commits to their codebases. They can apply deep stylistic and design knowledge to every PR they see.

Each maintainer invested more than 40 hours per task, undergoing multiple rounds of iteration with other eval engineers and Cognition researchers. They’ve distilled their judgment into concrete evaluation criteria: any PR that satisfies these standards would actually be approved.

Here’s what they say about FrontierCode:

“Working with the team behind FrontierCode was a privilege. Taking on the AI evaluation problem felt like nothing less than an art… Where others grade like a CI, FrontierCode grades like a tech lead.”

Tomer Nosrati, CEO and Tech Lead of Celery (28.6k stars)

“What sets FrontierCode apart is the attention to detail. Each task is calibrated to a depth that simply hasn’t been seen before in LLM benchmarking. We should be moving away from benchmarks that can be gamed and instead using ones like FrontierCode to demonstrate genuine model intelligence and creativity.”

Martin McKeaveney, Co-Founder and CTO of Budibase (28k stars)

“I’m grateful to have worked with leading experts in the Open Source community. We had deep discussions on correctness versus quality and what mergeability means in the context of their repository. FrontierCode is a milestone for AI models respecting subjective quality in the real world.”

Merlijn Vos, Core Maintainer of uppy (30.8k stars)

“FrontierCode’s unique value comes from the human experience encoded in its evals: years of judgment about what makes code high-quality and worthy of merging. The almost obsessive care brought to every criterion is why I believe this benchmark sets a new bar for SWE evaluation.”

Claudio Costa, Core Maintainer of Mattermost (37k stars)

Beyond Unit Tests

FrontierCode measures mergeability by evaluating code along the following axes:

The following table describes how we use both classical unit tests and novel methods, such as adaptive classical grading, scope, and reverse-classical tests (more on these methods below) to evaluate these criteria.

Category Method How it works Passes when
Behavioral correctness classical Injects test files into the repository, runs them, then cleans up. All injected tests pass
Mechanical cleanliness, regression safety command Runs a shell command. Exit code 0
Test correctness reverse-classical Runs agent’s submitted tests against the base commit. The tests fail
Behavioral correctness for complex tasks adaptive classical grading Uses an LLM to adapt reference tests or application code to align with the implementation. Adapted tests pass
Scope scope Checks file boundaries, diff size constraints, and optionally semantic locality of changes. Diff within constraints
Code quality prompt An LLM reviews agent’s diff against a natural-language prompt. LLM score meets threshold

Each criterion is either a blocker or a non-blocker:

Blockers represent mergeability requirements, i.e., criteria that a maintainer would consider hard stops during code review. These include correctness checks, as well as non-correctness concerns like performance or scope restrictions.

Non-blockers represent quality signals such as code style, type safety, and readability, which would not necessarily block a merge.

If a solution satisfies all the blockers, it is considered passing, and its score is the weighted aggregate of all the rubric items it passes. Otherwise it receives a score of zero.

Novel Grading Methods

We’ve introduced three main techniques to strengthen criteria against misclassifications, while allowing space for multiple valid solutions:

Reverse-Classical: The reverse-classical criterion is a way to ensure that agent-written tests are meaningful: when we run them on the original, broken codebase, they must fail. This gives us an automated, deterministic check that the agent understood the problem well enough to write an effective test for it.

Code Scope: A good PR should exercise restraint: it modifies only what it needs to, without touching unrelated files or introducing unnecessary refactors. The scope criterion is an automated check that enforces these boundaries. It combines three types of constraints:

Adaptive Classical Grading: Open-ended coding tasks can have many valid solutions. Static unit tests are too rigid; good solutions can fail for superficial differences like function names or error wording. We resolve this conflict with mutagent, a tool we built that uses an LLM to surgically patch the test environment (or the application code) and align with the agent’s implementation details, allowing us to run rigorous, deterministic tests on open-ended solutions.

Example Task

Press “Run eval” to generate Opus 4.8’s patch for this task.

The graded rubric appears here after the run.

Interactive: run the FrontierCode grading pipeline against each model’s output and inspect how the patch maps to rubric pass/fail.

Andrew He (ecnerwala) is the second highest rated US competitor on Codeforces, two-time IOI gold medalist, a founding engineer at Cognition and our resident C++ expert. He personally reviewed the models’ behavior on this task.

This task is based on the jsonschema repo which is written in C++. It requires implementing a new function auto LOG_WARNING() -> std::ostream & that should be used in every instance of printing warning: in the codebase. The helper should prefix log messages with warning:, print to stderr, and ignore the --verbose flag.

The task seems simple: a passing solution has to just identify all places in the given codebase that print warning: and replace them with a call to a newly implemented LOG_WARNING() function. However, models fail this task in a somewhat surprising way. One of the blocking criteria requires that multi-line warning messages idiomatically call LOG_WARNING, like so:

cpp

LOG_WARNING() << "You are opting in to remove schema identifiers... \n"
              << "The only legit use case...\n"
              << "non-compliant...\n" << ... ;
Idiomatic multi-line LOG_WARNING usage

Claude Opus 4.8, on the other hand, consistently opts for the following implementation:

cpp

LOG_WARNING() << "You are opting in to remove schema identifiers...\n";
    std::cerr << "The only legit use case...\n";
    std::cerr << "non-compliant...\n";
Claude Opus 4.8 mixed LOG_WARNING and std::cerr usage

These two are behaviorally the same; in both cases a multi-line error message will be printed to stderr. However, the agent solution bakes in the assumption at the call site that LOG_WARNING() and std::cerr are the same stream, which could change in a future modification of LOG_WARNING().

Quality Control

How do we iterate on rubric quality?

Improving binary verifiers like unit tests is relatively tractable because every solution falls into one of two buckets — correct or incorrect. You can examine each rollout, check its bucket, and strengthen the tests accordingly.

Hardening prompt-based criteria is a much harder QC problem. Rubrics introduce a spectrum of correctness: two solutions for the same task can both be functionally correct yet score differently on every criteria. We can no longer look at solutions in isolation. We have to compare within a group of solutions and verify that their relative scores actually separate better solutions from worse ones.

Rubric design is also inherently subjective and requires domain expertise. For each criterion, the maintainer must decide whether it’s a blocker or non-blocker, assign its weight relative to other criteria, and ensure complete coverage so that models cannot exploit gaps in the rubric.

Our rubric creation process

Rubric hardening pipeline
  1. 1.
    Design

    We prefer classical tests for things that can be checked deterministically, such as correctness. For complex tasks, we favor behavioral tests that are robust to superficial differences in implementation details.

    For soft qualities, we prefer LLM grading. This is better for assessing, say, idiomatic code, readability, or adherence to a preferred architectural pattern.

    Based on these principles, we first ask the task creator to manually audit each rubric item and document its rationale.

  2. 2.
    Hack report

    To prevent false positives, the task author imitates a lazy or adversarial programmer and tries to get a passing score with a deliberately incorrect or incomplete solution. This exposes criteria that can be improved.

    To prevent false negatives, the task author tries to write a perfectly valid, alternative solution that is different from the canonical one. If this solution fails the evaluation, the rubric is too rigid.

    We augment the hack report process by also asking Devin to come up with novel ways to hack the rubric.

  3. 3.
    Rubric calibration

    To ensure that the rubric has sufficient resolution, the author must write four distinct solutions that target a range of scores from 0 to 100%.

  4. 4.
    Review

    Each contributor belongs to an eval pod led by an experienced pod lead, who acts as the first quality gate. The lead reviews the full eval candidate and iterates with the contributor through multiple rounds. Once the eval candidate passes all pod-level checks, a Cognition researcher conducts a final review along with the pod lead and contributor. For a random subset, researchers also solve the tasks themselves to verify that instructions are clear and grading is fair.

  5. 5.
    Re-Review

    At any stage, reviewers can send the task back for revision. Most tasks cycle through multiple iterations before passing.

The result of this extensive process is a suite of durable, difficult tasks that reflect the high standards of the world’s top open-source repositories.

FrontierCode is the benchmark for the next generation of coding agents. We are confident developers, enterprises, and researchers can trust it to evaluate the production readiness of their strongest models. While we don’t currently plan to release the tasks publicly to avoid contamination, we are opening up our evaluation to all model creators, in the hope that we can push the frontier even further in the coming months.

Acknowledgments

FrontierCode is the product of close collaboration across research, design, and a community of practitioners who lent their expertise to vet tasks and shape the rubric. Thank you to everyone listed below.

Research
Eric Lu, Ben Pan, Deniz Birlikci, Sam Lee, Ray Wang, Rohan Choudhury, Fermi Ma, TC Qin, Carlo Baronio, Silas Alberti
Design
Katie Cheng, Joseph Alessio

Outstanding External Contributors
Claudio Costa, Martin McKeaveny, Lance Fuchia, Merlijn Vos, Tomer Nosrati, Swyx



<a href

Leave a Comment