
A code migration agent completes its run, and the pipeline looks green. But many pieces were never compiled – and it took days to capture them. This is not a model failure; It’s an agent that decides what was done before it actually happens.
Many enterprises are now seeing that production AI agent pipelines fail not because of the model’s capabilities, but because the model behind the agent decides to stop. Several methods to prevent premature task exit are now available from Langchain, Google, and OpenAI, although these often rely on different evaluation systems. The latest method comes from Anthropic: /target on cloud code, which formally separates task execution and task evaluation.
Coding agents work in a loop: they read files, run commands, edit the code, and then check whether the task is completed.
Cloud code/targets essentially adds a second layer to that loop. After the user defines a goal, the cloud will continue to work in turn, but after each step an evaluator comes in to review the model and decide whether the goal has been achieved or not.
Both models separated
Orchestration platforms from all three vendors identified the same bottleneck. But their way of reaching them is different. OpenAI leaves the loop alone and lets the model decide when it completes, but lets users tag on their own evaluators. For Langgraph and Google’s Agent Development Kit, independent evaluation is possible, but developers need to define the critic node, write termination logic, and configure the observation.
The cloud code/goal sets the default of the independent evaluator, whether the user wants it to run for a longer or shorter period of time. Basically, the developer sets the conditions for completing the goal through a prompt. For example, all tests in /goal test/auth pass, and the lint step is cleared. The cloud code then runs, and each time the agent attempts to finish its work, the evaluation model, which is Haiku by default, will check against the condition loop. If the condition is not met, the agent keeps running. If the condition is met, it logs the received condition in the agent conversation transcript and clears the goal. The evaluator only makes two decisions, which is why the short haiku model works well, whether it is complete or not.
Cloud code makes this possible by separating the model that attempts to complete a task from the evaluator model that ensures that the task is actually completed. This prevents the agent from mixing up what it has already completed and what still needs to be done. With this method, Anthropic said there is no need for a third-party observability platform – although enterprises are free to continue using it with cloud code – no need for custom logs, and less reliance on post-mortem reconstruction.
Competitors such as Google ADK support similar evaluation patterns. Google ADK deploys a LoopAgent, but developers must architect that logic.
In its documentation, Anthropic states that the most successful situations are usually:
- A measurable end state: a test result, a build exit code, a file count, an empty queue
-
A stated check: How the cloud should prove it, such as “npm test exits 0” or “git status is clean.”
-
Constraints that matter: Anything that must not change on the way there, such as “no other test files are modified”
reliability in the loop
For enterprises already managing huge tool stacks, the appeal is a basic evaluator that doesn’t add any other systems to maintain.
This is part of a broader trend in the agentic field, particularly as the possibility of stateful, long-lived, and self-learning agents becomes more of a reality. Evaluator models, verification systems, and other independent decision systems are beginning to appear in reasoning systems and, in some cases, coding agents such as DEVIN or SWE-Agent.
Sean Brownell, solutions director at Sprinklr, told VentureBeat in an email that there is interest in this kind of loop, where the tasks and judges are separate, but he thinks there is nothing unique about Anthropic’s approach.
"Yes, the loop works. Separating the builder from the judge is perfect design, because fundamentally, you can’t trust a model to judge its homework. The working model is the worst judge of whether the job is done or not." Brownell said. "That being said, Anthropic isn’t the first company to come to market. The most interesting story here is that two of the largest AI labs in the world sent the same command just a few days apart, but each of them came to completely different conclusions about who would declare ‘done’."
Brownell said the loop works best "For deterministic work with verifiable end-states such as migrations, fixing broken test suites, clearing the backlog," But for more nuanced tasks or tasks that require a design decision, it’s much more important to have a human making that decision.
Bringing that evaluator/task split down to the agent-loop level shows how companies like Anthropic are moving agents and orchestration toward a more auditable, observable system.
<a href