Scaling long-running autonomous coding · Cursor

We’ve been experimenting with autonomously running coding agents for several weeks.

Our goal is to understand how far we can push the limits of agentic coding for projects that would typically take human teams months to complete.

This post explains what we’ve learned by running hundreds of concurrent agents on a single project, coordinating their work, and watching them write a million lines of code and trillions of tokens.

Limitations of a single agent

Today’s agents work well for focused tasks, but are slow for complex projects. The natural next step is to run multiple agents in parallel, but figuring out how to coordinate them is challenging.

Our first instinct was that it would be very difficult to plan ahead. The path of a large project is unclear, and the correct division of work is not clear at the beginning. We started with dynamic coordination, where agents decide what to do based on what others are currently doing.

learning to coordinate

Our initial approach gave agents equal status and let them self-coordinate through a shared file. Each agent will check what the others are doing, claim the task, and update its status. To prevent two agents from grabbing the same task, we used a locking mechanism.

It failed in interesting ways:

  1. Agents hold the locks for too long, or forget to open them altogether. Even when the locking was working correctly, it still became a hindrance. Twenty agents will slow down the effective throughput of two or three, with most of the time spent waiting.
  2. The system was brittle: agents could fail when holding locks, attempt to acquire a lock already held, or update the coordination file without acquiring the lock.

We tried replacing locks with optimistic concurrency control. Agents can read state freely, but writes will fail if the state has changed since the last read. It was simpler and more robust, but it still had deep problems.

With no hierarchy, agents became risk-averse. He avoided difficult tasks and instead made small, safe changes. No agent took responsibility for difficult problems or end-to-end implementation. Due to this the work continues for a long time without progress.

planner and worker

Our next approach was to separate the roles. Instead of a flat structure where every agent does everything, we created a pipeline with different responsibilities.

  • planners Continually explore the codebase and create tasks. They may give rise to sub-planners for specific areas, causing the planning itself to become parallel and iterative.
  • workers Choose tasks and concentrate completely on completing them. They don’t coordinate with other workers or worry about the bigger picture. They work on their assigned task until it is completed, then make changes.

At the end of each cycle, a judge agent determines whether to continue or not, then the next iteration starts anew. This solved most of our coordination problems and helped us move very large projects forward without any one agent getting tunnel vision.

been going on for weeks

To test this system, we pointed it toward an ambitious goal: building a web browser from the ground up. The agents worked for about a week and wrote more than 1 million lines of code in 1,000 files. You can explore the source code on GitHub.

Regardless of the codebase size, new agents can still understand it and make meaningful progress. Hundreds of employees run together, working in the same branch with minimum conflict.

Although it may seem like a simple screenshot, building a browser from scratch is extremely difficult.

Another experiment was doing an in-place migration of the Cursor codebase from Solid to React. +266K/-193K edits took over 3 weeks. As we have begun testing the changes, we are confident that it is possible to merge this change.

Another experiment was to improve an upcoming product. A long-running agent made video rendering 25 times faster with the efficient Rust version. It also adds support for smoothly zooming and panning with natural spring transitions and motion blur while following the cursor. This code has been merged and will be in production soon.

We have some other interesting examples still going on:

  • Java LSP: 7.4K commits, 550K LoC
  • Windows 7 Emulator: 14.6K commits, 1.2M LoC
  • Excel: 12K commits, 1.6M LoC
  • FX1: 9.5K commits, 1.2M LoC

what we have learned

We have deployed billions of tokens on these agents for the same goal. The system is not completely efficient, but it is more effective than we expected.

Model choice matters for extremely long-running tasks. We found that GPT-5.2 models are much better at extended autonomous work: following instructions, staying focused, avoiding drift, and executing things accurately and completely.

Opus 4.5 stops first and takes shortcuts when convenient, providing quick access to control. We also found that different models excel in different roles. GPT-5.2 is a better planner than GPT-5.1-Codex, even though the latter is trained specifically for coding. We now use the best-fit model for each role instead of one universal model.

Many of our improvements came from removing complexity rather than adding it. We initially created an Integrator role for quality control and conflict resolution, but found that it created more obstacles than it solved. Workers were already able to deal with conflicts themselves.

The best systems are often simpler than you expect. We initially attempted to model the system from distributed computing and organizational design. However, not all of them work for agents.

The right amount of structure is somewhere in the middle. Very little structure and agent conflict, repetitive tasks and flow. Too much structure creates fragility.

A surprising part of the system’s behavior depends on how we signal the agents. They required extensive experimentation to learn to coordinate well, avoid pathological behaviors, and maintain focus for long periods of time. Harness and model matter, but signals matter more.

what will happen next

Multi-agent coordination remains a difficult problem. Our current system works, but we are still nowhere near the optimal situation. Planners must wake up when their work is complete to plan the next step. Agents sometimes run for too long. We still need fresh starts from time to time to deal with drift and tunnel vision.

But the answer to the original question, can we scale autonomous coding by deploying more agents on a problem, is more optimistic than we expected. Hundreds of agents can work together on the same codebase for weeks, allowing real progress on ambitious projects.

The techniques we develop here will ultimately inform the agent capabilities of the cursor. If you’re interested in working on the toughest problems in AI-assisted software development, we’d love to hear from you at hriing@cursor.com.



<a href

Leave a Comment