When Claude changed, everything changed: Managing AI blast radius in production

u7277289442 A mushroom cloud of data is exploding against a d 859177f5 e487 44a4 a545 407c10edd87e 0
Our system did one thing, and did it well: it transformed natural language queries into API calls.

There were user analysts, account managers and heads of operations. They knew what data they needed, but manually assembling it meant pulling from four dashboards, two BI tools, and a Salesforce report builder. From our system, he typed the request in plain English. like a request "Compile a report on sales volume from January to March 2026 for the Northeast region, broken down by city" This was translated into an API call on which the system could act:

json

{

"Description": : "User requested sales quantity for given date range, here is the API call to get the response",

"api_call": : "/api/sales_volume",

"post_body": {

"start date": : "2026-01-01",

"last date": : "2026-03-31",

"Area": : "North east"

}

}

The rest of the pipeline was conventional engineering. The system routed the call to the right backend – we had integrations with internal reporting portals, SalesForce, and multiple in-house services – applied a large language model (LLM) (-generated JSON query) to filter and shape the response, and delivered it as a Drive document via email, or presented as a chart in the browser.

By mid-2025, the system was producing several hundred reports per month. These reports are consumed by leadership and analysts and disseminated to external stakeholders. This had become the default way for most teams to extract ad-hoc data.

As described in the example above, the contract between the LLM and the rest of the system was a structured JSON object.

json

{

"Description": : "User requested sales quantity for given date range, here is the API call to get the response",

"api_call": : "/api/sales_volume",

"post_body": {

"start date": : "2026-01-01",

"last date": : "2026-03-31",

"Area": : "North east"

}

}

We built it on Cloud Sonnet 3.5 in early 2025. We upgraded it to 3.7 without incident and 4.0 without incident. By the time of shipment of Sonnet 4.5, we were convinced of the stability and predictability of LLM, which we considered a simple problem. Model upgrades became routine, like tossing out a smaller version of a well-behaved library.

Then we launched 4.5. For a meaningful percentage of requests, the model started folding the content of the post_body into the description field. Two failure modes were observed.

First, the filter parameters never reached the API. Our system reads post_body As the source of truth for the request payload, and that field returned empty. The API call was made without date range or region filters. Depending on the specific API being called, the backend either returned sales volume for all time or all regions or returned a 500 error.

Second, the model started asking clarifying questions in its responses. This was new. Earlier versions always took a best-effort approach to an ambiguous request and returned a structured object. Sonnet 4.5, being more cautious, sometimes answers a question instead. Our system had no way for this. It was built on the assumption that every model invocation would result in an API call. There was no human-in-the-loop component and no state to hold a partially completed request. This caused the downstream system to break down in many ways.

We are back to 4.0. This was more difficult than it should have been: between the 4.0 and 4.5 deployments, our team had added new API integrations, all of which qualified against 4.5. Bringing back the models meant re-qualifying each of them against 4.0 under time pressure.

Why does the traditional engineering discipline fail here?

Software engineering is based on the ability to limit the impact of change. When you upgrade a driver or library, you read the release notes to see if significant changes can be expected. Unit tests parameterize what can possibly be moved. You can take advantage of the following property: the system being changed is so deterministic that its behavior can be predicted, or at least sampled well enough to give you confidence. The radius of the explosion is surrounded by construction.

LLM enabled systems break this assumption. The component that produces your output is not under your control. You cannot vary a model version bump from 4.0 to 4.5. It is a wholesale replacement of the functionality that your system depends on.

This is what we mean by A infinite blast radius: a change whose downstream effects cannot be calculated in advance because both the input space (natural language) and failure modes (anything the model might do differently) are unbounded.

anatomy of failure

The post-mortem revealed that our signal was always underspecified. We asked the model to return a JSON object with three fields. We explained what each area is for. We did not explicitly state that the description should be a natural-language string and should not contain serialized representations of other fields.

Earlier versions of the model estimated this constraint from context. Sonnet 4.5, apparently better "helpful" In your formatting choices, decided that asking for clarification or providing the request body in the description makes the response more useful. From the model’s perspective, this was a reasonable interpretation of an ambiguous instruction. However, this violated the assumptions under which our system was built.

The bug was not in the model. The bug was in our assumption that the model would continue to fill our specification gaps as usual. Three successful upgrades had trained us to believe that those gaps were safe.

The structured output mode and tool-use API would have caught this specific failure at the schema level. We were not using them for engineering reasons beyond the scope of this article. But the schema limits only the syntax, not the semantics. A scheme cannot specify that a clarifying question should not appear in a system that has no path to clarification, or that a date range should never silently default to all time. Schemas solve the easy half of the problem.

ewels-first architecture

The discipline bridging this gap is to treat evaluation suites – not signals – as formal specifications of the system. signal is one execution Of specificity. model is one interpreter. The evals themselves are specific, and any model or instantiation transformation is valid only after it passes them.

In practice, an eval is a triple: an input, a property that the output must satisfy, and a scoring function. For our system, Eval capturing the 4.5 regression looks approximately like this:

Python

def test_description_contains_no_serialized_payload(response):

desc = response["description"].lower()

Forbidden = ["curl", "post_body", "{", "http://", "https://"]

Do not emphasize any (token in description in place of token prohibited), \

F"Description Leaked Structured Content: {Reaction['description']}"

A few hundred such properties, some hand-written for known-important invariants, some generated as regression tests from real production traffic, some scored by LLM-as-judges for obscure properties like tone, became a gateway into the world. Model upgrades and quick changes should be treated as pull requests that need to be greenlit to the suite before they can be merged.

The construction and maintenance of evals is expensive. As your product changes, they flow. LLM-A-Judge introduces its own variations in scoring results. And the suite can only catch the failure modes you’ve thought to specify – you can’t develop your own way to protect against a class of failure you’ve never even imagined. We learned this lesson the hard way: no one on our team wrote such a claim "Description field should not contain curl command," Because no one thought that anyone would put the model there.

evals are not a silver bullet. They give you the ability to limit the blast radius of change when the underlying function is a black box: by densely sampling the input-output feedback you really care about, and refusing to deploy it when that behavior plays out.

roadmap

The engineering community has not yet developed a body of knowledge for writing effective evaluations. There are no widely accepted standards for what ‘coverage’ means in natural language input spaces. CI/CD systems were not designed to capture probabilistic test results. As agents take on more autonomous tasks – writing code, moving money, scheduling changes to infrastructure – the difference between "The model passed our smoke tests" And "We know what this system will do in production" becomes the central engineering problem of the next several years.

The teams that close that gap will be the ones that later stop treating evals as quality-assurance and start treating them as the actual specification of their system.

Vijay Sagar Gullapalli is the founding AI engineer and USPTO-patented inventor at Adopt AI.

Sharat Mahavartayjula is a senior software engineer at Sherwin-Williams.



<a href

Leave a Comment