Effective harnesses for long-running agents \ Anthropic

As AI agents become more capable, developers are increasingly asking them to perform complex tasks that require hours or even days of work. However, making agents make consistent progress across multiple context windows remains an open problem.

The main challenge of long-running agents is that they must work in different sessions, and each new session starts without memories of the previous one. Imagine a software project with engineers working in shifts, where each new engineer comes in with no memory of what happened on the previous shift. Because context windows are limited, and because most complex projects cannot be completed in a single window, agents need a way to bridge the gap between coding sessions.

We developed a two-tier solution to enable the Cloud Agent SDK to work effectively in multiple context windows: a initiating agent which sets up the environment in the first place, and a coding agent The function of which is to make incremental progress in each session, while leaving clear artifacts for the next session. You can find code examples in the attached quickstart.

long running agent problem

The Cloud Agent SDK is a powerful, general-purpose agent harness that is adept at coding as well as other tasks that require using tools to gather context, plan, and execute models. It has context management capabilities such as compaction, which enables an agent to work on a task without exhausting the context window. Theoretically, given this setup, it should be possible for an agent to continue performing useful tasks for an arbitrarily long time.

However, compaction is not enough. Out of the box, even Frontier coding models like Opus 4.5 running on the Cloud Agent SDK in a loop in multiple context windows will fall short of building a production-quality web app if it’s only given a high-level prompt, such as “Clone cloud.ai.”

Cloud failures appeared in two patterns. At first, the agent tried to do too much at once – essentially trying to one-shot the app. Often, this leads to the model being taken out of context in the middle of its implementation, leading to the next session starting with half-finished and undocumented features. The agent would then have to guess what happened, and spend a considerable amount of time trying to get the original app working again. This also happens with compaction, which does not always give absolutely clear instructions to the next agent.

The second failure mode often occurs later in a project. After some features have already been built, a later agent instance will look around, see that progress has been made, and declare the work complete.

This decomposes the problem into two parts. First, we need to establish an initial environment that lays the foundation for All The features that are needed for a given signal prepare the agent to work step-by-step and feature-by-feature. Second, we must motivate each agent to make gradual progress toward its goal, leaving the environment in a cleaner state at the end of the session. By “clean state” we mean the kind of code that would be suitable for merging into the main branch: there are no major bugs, the code is organized and well documented, and in general, a developer can easily start work on a new feature without having to clean up an unrelated mess.

When experimenting internally, we addressed these issues using a two-part solution:

  1. Initializer agent: The very first agent session uses a special prompt that tells the model to set up the initial environment: a init.sh Scripts, a claude-progress.txt file that keeps a log of actions taken by the agents, and an initial git commit that shows which files were added.
  2. Coding agent: Each subsequent session tells the model to make incremental progress, then emits structured updates.1

The key insight here was to find a way for agents to quickly understand the status of a work when starting from a fresh context window, which is accomplished with the claude-progress.txt file along with the Git history. The inspiration for these practices came from learning what effective software engineers do every day.

environmental management

In the updated Claude 4 prompting guide, we’ve shared some best practices for multi-context window workflows, including a harness structure that uses “a separate prompt for the first context window.” This “separate prompt” requests that the initializer agent set up the environment with all the necessary context that future coding agents will need to work effectively. Here, we provide in-depth information on some of the key components of such an environment.

Feature List

To address the problem of the agent one-shotting an app or considering a project prematurely complete, we induced the initializer agent to write an extensive file of extended feature requirements at the user’s initial prompt. In the claude.ai clone example, this means over 200 features, like “a user can open a new chat, type a query, press enter, and see an AI response.” All of these features were initially marked as “failed” so that later coding agents would have a clear outline of what the full functionality looked like.

{
    "category": "functional",
    "description": "New chat button creates a fresh conversation",
    "steps": [
      "Navigate to main interface",
      "Click the 'New Chat' button",
      "Verify a new conversation is created",
      "Check that chat area shows welcome state",
      "Verify conversation appears in sidebar"
    ],
    "passes": false
  }

We prompt coding agents to edit this file by simply changing the state of the pass field, and we use strongly worded instructions such as “Removing or editing tests is unacceptable as it may cause missing or degraded functionality.” After some experimentation, we landed on using JSON for this, as JSON files are less likely to be inappropriately altered or overwritten in the model than Markdown files.

progressive progress

Given this initial environmental scaffolding, the next iteration of the coding agent was asked to work on only one feature at a time. This incremental approach proved important to address the agent’s tendency to do too much at once.

Once working sequentially, it is still required that the model leaves the environment in a clean state after making code changes. In our experiments, we found that the best way to achieve this behavior was to ask the model to commit its progress with descriptive commit messages and write a summary of its progress to a progress file. This allowed the model to use Git to revert bad code changes and recover the working state of the code base.

These approaches also increased efficiency, as they eliminated the need for an agent to guess what happened and spend their time trying to get the native app working again.

tests

A final major failure mode we observed was the cloud’s tendency to mark a feature as complete without proper testing. In the absence of clear indication, the cloud tended to make code changes and even test with unit tests curl command against a development server, but will fail to recognize that the feature does not work from beginning to end.

In the case of building a web app, the cloud performed mostly well in verifying features from end to end after being explicitly prompted to use browser automation tools and perform all the testing like a human user would.

    Screenshots taken by Claude via the Puppeteer MCP server when it tested the claude.ai clone.
Screenshots taken by Claude via the Puppeteer MCP server when it tested the claude.ai clone.

Providing these types of testing tools to the cloud dramatically improved performance, as the agent was able to identify and fix bugs that were not obvious from the code alone.

Some issues still remain, such as the limitations of the cloud approach and limitations of browser automation tools, making it difficult to identify every type of bug. For example, Claude Puppetier cannot see browser-native alert modals through MCP, and features that rely on these modals are broken as a result.

gaining momentum

Taking all of the above into account, each coding agent is induced to go through a series of steps to achieve its effect, some of which are quite basic but helpful nonetheless:

  1. run pwd To view the directory you are working in. You will only be able to edit files in this directory.
  2. Read git logs and progress files to get up to speed on what was recently worked on.
  3. Read the features list file and select the highest-priority feature that has not been worked on yet.

This approach saves the cloud a few tokens per session because it doesn’t have to figure out how to test the code. It also helps to ask the initializer agent to write an init.sh script that the development server can run, and then run through a basic end-to-end test before implementing a new feature.

In the case of the Claude.ai clone, this meant that the agent always started the local development server and used Puppeteer MCP to start new chats, send messages, and receive responses. This ensured that the cloud could immediately identify if an app was left in a broken state, and quickly fix any existing bugs. If the agent had started implementing a new feature instead, it could have made the problem worse.

Given all this, a typical session starts with the following helpful messages:

[Assistant] I'll start by getting my bearings and understanding the current state of the project.
[Tool Use] 
[Tool Use] 
[Tool Use] 
[Assistant] Let me check the git log to see recent work.
[Tool Use] 
[Assistant] Now let me check if there's an init.sh script to restart the servers.

[Assistant] Excellent! Now let me navigate to the application and verify that some fundamental features are still working.

[Assistant] Based on my verification testing, I can see that the fundamental functionality is working well. The core chat features, theme switching, conversation loading, and error handling are all functioning correctly. Now let me review the tests.json file more comprehensively to understand what needs to be implemented next.

Agent Failure Modes and Solutions

crisis initiator agent behavior coding agent behavior
Cloud declared victory over the entire project very early on. Set up a feature list file: Based on the input specification, set up a structured JSON file with a list of end-to-end feature descriptions. Read the feature list file at the beginning of the session. Select a feature to get started.
Leaves the cloud environment in a state with bugs or undocumented progress. An initial Git repo and progress notes file has been written. Start the session by reading the progress notes file and the git commit log, and run a basic test on the development server to catch any unknown bugs. End the session by writing git commit and progress update.
Marks cloud features as done prematurely. Set up a feature list file. Verify all features yourself. Mark features as “passed” only after careful testing.
The cloud has to spend time figuring out how to run the app. write one init.sh Scripts that the development server can run. Start the session by reading init.sh,
A summary of four common failure modes and solutions in long-running AI agents.

future work

This research demonstrates a potential set of solutions in long-run agent harnesses to enable models to make incremental progress across multiple context windows. However, open questions remain.

Most notably, it is still unclear whether a single, general-purpose coding agent performs best in all contexts, or whether better performance can be achieved through multi-agent architectures. It seems reasonable that specialized agents, such as testing agents, quality assurance agents, or code cleanup agents, could do an even better job on subtasks in the software development lifecycle.

Additionally, this demo is optimized for full-stack web app development. Future directions are to generalize these findings to other areas. It is likely that some or all of these lessons could be applied to types of agentic tasks that run over longer periods of time, for example, scientific research or financial modeling.

approvals

Written by Justin Young. Special thanks to David Hershey, Prithvi Rajasekaran, Jeremy Hadfield, Naya Bouscal, Michael Tingley, Jesse Mu, Jake Eaton, Marius Bullendara, Maggie Vo, Pedram Navid, Nadine Yasser, and Alex Notov for their contributions.

This work reflects the collective efforts of multiple teams at Anthropic that made it possible for the cloud to safely perform long-horizon autonomous software engineering, particularly the Code RL and Cloud Code teams. Interested candidates who want to contribute are welcome to apply at anthropic.com/careers.

footnote

1. We refer to these as different agents in this context only because they have different initial user signals. The system prompts, set of tools, and overall agent harness were otherwise identical.



<a href

Leave a Comment