An update on recent Claude Code quality reports \ Anthropic

Over the past month, we’ve been noticing reports that cloud responses have become corrupted for some users. We’ve discovered three separate changes in these reports that impacted Cloud Code, Cloud Agent SDK, and Cloud Cowork. API not affected.

All three issues have now been resolved as of April 20 (v2.1.116).

In this post, we explain what we found, what we fixed, and what we would do differently to make sure similar problems are less likely to happen again.

We take reports about degradation very seriously. We never intentionally degrade our models, and we were able to immediately confirm that our API and inference layer were unaffected.

After investigation, we identified three separate issues:

  1. On March 4, we changed Cloud Code’s default logic effort high To medium To reduce very long latency – enough to make the UI appear frozen – some users were seeing this high Method This was a wrong compromise. We rolled back this change on April 7 after users told us they would prefer higher intelligence by default and opt for less effort for simple tasks. This affected Sonnet 4.6 and Opus 4.6.
  2. On March 26, we shipped a change to remove the cloud’s legacy behavior from sessions that have been inactive for more than an hour, to reduce latency when users resume those sessions. A bug caused this to happen every turn for the rest of the session instead of just once, making Cloud feel forgetful and repetitive. We fixed it on April 10th. This affected Sonnet 4.6 and Opus 4.6.
  3. On April 16, we added a system prompt directive to reduce verbosity. In combination with other rushed changes, this hurt coding quality, and was reverted on April 20. It influenced Sonnet 4.6, Opus 4.6 and Opus 4.7.

Because each change affected a different portion of traffic on a different schedule, the overall effect looked like a widespread, disproportionate decline. When we began investigating the reports in early March, it was initially challenging to distinguish them from normal variation in user feedback, and neither our internal use nor evals reproduced the issues initially identified.

This is not the experience users should expect from cloud code. Starting April 23, we are resetting usage limits for all customers.

Changes to Cloud Code’s default logic effort

When we released Opus 4.6 in Cloud Code in February, we set the default logic effort to high.

Shortly thereafter, we received feedback from users that Cloud Opus 4.6 would sometimes lag for too long in high effort mode, causing the UI to appear frozen and causing uneven latency and token usage for those users.

In general, the longer the model thinks, the better the output. The level of effort that cloud code lets users set is that tradeoff – more thinking versus less latency and less usage limit hits. As we calibrate effort levels for our models, we take this tradeoff into account to choose points along the test-time-computation curve that provide people with the best choice. In the product layer, we then choose which point along this curve we set as our default, and that is the value we send to the messaging API as a try parameter; We then provide other options /effort.

image?url=https%3A%2F%2Fwww cdn.anthropic.com%2Fimages%2F4zrzovbb%2Fwebsite%2Fde3bcf9733b61f57234d8c45e663b1bd48677ea1

In our internal evaluation and testing, Medium Effort achieved slightly lower intelligence with significantly lower latency for most tasks. It didn’t suffer from the same problems with latency being too long to think about sometimes, and it helped maximize users’ usage range. As a result, we initiated a change making Medium the default effort, and explained the reasoning through in-product dialogue.

image?url=https%3A%2F%2Fwww cdn.anthropic.com%2Fimages%2F4zrzovbb%2Fwebsite%2Fc8a4e9b51dad6989d7ab0b87fde3399a82386ee2

Soon after the rollout, users began reporting that cloud code felt less intelligent. We sent out several design iterations to clarify the current effort setting to alert people that they could change the default (notices on startup, an inline effort selector, and bringing back UltraThink), but most users retained the medium effort default.

After hearing feedback from more customers, we reversed this decision on April 7. Now all users are default xhigh Strive for Opus 4.7, and high Try for all other models.

A caching optimization that removed the prior logic

When the cloud reasons through a task, that reasoning is typically kept in the conversation history so that at every subsequent turn, the cloud can see why it made the edits and tool calls.

On March 26, we shipped what was intended to improve efficiency at this facility. We use speedy caching to make back-to-back API calls cheaper and faster for users. The cloud API writes the input token to the cache when a request is made, then after a period of inactivity the prompt is flushed out of the cache, making room for other prompts. Cache usage is something we manage carefully (more on our approach below).

The design should have been simple: if a session has been inactive for more than an hour, we can reduce the cost to users of resuming that session by clearing out old think sections. Since the request will be a cache miss anyway, we can remove unnecessary messages from the request to reduce the number of uncached tokens sent to the API. Then we will resume sending the full argument history. To do this we used clear_thinking_20251015 with api header keep:1.

There was a bug in the implementation. Instead of clearing the thought history once, it was cleared at every turn for the rest of the session. Once a session crosses the idle threshold, every request for the rest of that process tells the API to keep only the most recent block of arguments and discard everything before that. This got complicated: if you sent a follow-up message while Cloud was in the middle of using a device, a new turn was started under the broken flag, so the argument from the current turn was also removed. The cloud will continue to execute, but it won’t remember why it decided to do what it did. This resulted in forgetfulness, duplication, and strange device choices that people reported.

Since it will continuously remove the thinking blocks from subsequent requests, those requests will also result in cache misses. We believe this is why individual reports of usage limits being exhausted faster than expected.

image?url=https%3A%2F%2Fwww cdn.anthropic.com%2Fimages%2F4zrzovbb%2Fwebsite%2F332d9c487bb73c8078686068dcbe1b616720a8dd

Two unrelated experiments made it challenging for us to reproduce this issue at first: an internal-only server-side experiment related to message queuing; And orthogonal changes in the way we display thinking have suppressed this bug in most CLI sessions, so we didn’t catch it even when testing external builds.

This bug was at the intersection of cloud code’s context management, anthropic APIs, and extended thinking. The changes it made put it through multiple human and automated code reviews, as well as unit testing, end-to-end testing, automated validation, and dogfooding. With this only being a corner case (stale session) and the difficulty of reproducing the issue, it took us over a week to discover and confirm the root cause.

As part of the investigation, we back-tested code review against offending pull requests using Opus 4.7. When provided with the code repository needed to collect the entire reference, Opus 4.7 found the bug, while Opus 4.6 did not. To prevent this from happening again, we are now providing support for additional repositories as references for code review.

We fixed this bug in v2.1.101 on April 10th.

A system prompt change to reduce verbosity

Our latest model, Cloud Opus 4.7, has one notable behavioral quirk compared to its predecessor: As we wrote at launch, it’s quite talkative. This makes it smarter on harder problems, but it also generates more output tokens.

A few weeks before releasing Opus 4.7, we started tuning the cloud code in preparation. Each model behaves slightly differently, and we spend time optimizing the harness and product for it before each release.

We have many tools to reduce verbosity: model training, prompting, and thinking in the product to improve UX. Ultimately we used all of these, but one additional addition to the system prompt made a huge impact on the intelligence in the cloud code:

“Length limit: Keep text between tool calls to ≤25 words. Keep final responses to ≤100 words unless the task requires more detail.”

After several weeks of internal testing and no outages in the set of evaluations we ran, we felt confident about the change and shipped it with Opus 4.7 on April 16.

As part of this investigation, we ran more ablations (removing lines from the system prompt to understand the impact of each line) using a broader set of evaluations. One of these evaluations showed a 3% drop for both Opus 4.6 and 4.7. We immediately reverted the prompt as part of the April 20th release.

going forward

We’re going to do several things differently to avoid these issues: We’ll ensure that a large portion of internal staff use the exact public build of the cloud code (as opposed to the version we use to test new features); And we will improve our code review tool that we use internally, and push this improved version to customers.

We’re also adding tighter controls on system prompt changes. We will run a comprehensive suite of per-model assessments for rapid changes to the cloud code for each system, continue ablation to understand the impact of each line, and we have built new tooling to simplify the review and audit of rapid changes. We’ve additionally added guidance to our CLAUDE.md to ensure that model-specific changes are based on the specific model they’re targeting. For any changes that may be counterintuitive, we will add soak periods, a comprehensive evaluation suite, and gradual rollouts so we can catch issues early.

We recently created @ClaudeDevs on X to give us a place to explain in-depth product decisions and the reasoning behind them. We will share similar updates in centralized threads on GitHub.

Finally, we would like to thank our users: people who used /feedback The order of sharing your issues with us (or those who have posted specific, reproducible examples online) are what ultimately allowed us to identify and fix these problems. Today we’re resetting usage limits for all customers.

We are extremely grateful for your feedback and your patience.



<a href

Leave a Comment