AI Agents Are Quietly Generating Chaos Engineering Failures Enterprises Don’t Track Yet

There is a category of production event that engineering teams are not yet tracking – because it does not fit any existing postmortem template.

The agent initiated an action. The agent’s action was technically correct given the context. The context was incomplete. The infrastructure collapsed. And, by the time the incident was reviewed, three teams were debating whether it was an agent failure or an infrastructure failure, because the framework for thinking about these two things had never been connected.

The scale of this performance is no longer theoretical. 79 percent of organizations now have some form of AI agent in production, with 96% of organizations planning to expand. Gartner predicts that 33% of enterprise software will include agentic AI by 2028, but separately warns that 40% of those projects will be canceled due to poor risk controls.

No statistic can capture the failure modes that occur between those two numbers: agents that are running, which have not been canceled, and which are quietly generating infrastructure incidents that no one has classified as a risk.

I spent six years building infrastructure automation systems at enterprise scale, first at Cisco (the leading AI-powered lifecycle platform deployed at over 20 global enterprise customers), then at Splunk (designing AI-assisted root cause analysis and observability workflows across thousands of enterprise environments).

During that time I also filed a patent on an intent-based chaos engineering methodology. And through it all, I kept seeing organizations making the same structural mistake: treating autonomous agents and chaos engineering as separate disciplines. They are not. They are the same discipline, and the difference between them is quietly generating the next wave of major production events.

Judgment call that agents leave

To understand why this matters, you need to understand what’s really wrong with how enterprises handle chaos today, before adding agents into the picture.

Most mature engineering organizations have invested in chaos engineering programs. Game day, blast radius control, SLO-gated experiment. When a human engineer begins a chaos experiment, the sequence has one important quality: a human is making a decision about whether the system has the capacity to absorb the disturbance right now. They check the dashboard. They look at the error budget expenditure rate. They assess whether dependencies are stable or not. It’s imperfect and often counter-intuitive, but before anything moves there is at least one person in the loop asking the right questions.

That question disappears when you introduce an autonomous remediation agent that can restart services, re-route traffic, scale resources, or modify configurations in response to detected anomalies. The agent notices an anomaly. The agent takes action. Action is a chaotic event. No SLO burn rate check. No blast radius calculation. There is no human judgment about whether now is the right time to introduce additional stress into a system that may already be under pressure from three other directions.

Here is the typical failure mode I’ve seen play out. A remediation agent detects elevated latency on a microservice and responds by restarting the service cluster; A reasonable action given its training data and its narrow view of the phenomenon. What the agent doesn’t know: Three other services are in the middle of handling peak traffic. The shared connection pool is already at 87% utilization. A dependent database is running a background index rebuild. The restart triggers a howling swarm against the recovery service.

What started out as a latency spike that the agent was designed to fix becomes a cascade that the agent was never designed to model. That agent action did not restart the blast radius service. This was everything downstream of the restart, the system state agent didn’t have a complete picture of it.

No one’s chaos engineering program had tested for that specific combination. The agent was not included as an actor in anyone’s blast radius calculations. Because we don’t think of agents as spreading anarchy. We need.

According to the AI Incident Database, reported AI-related incidents are expected to increase by 21% from 2024 to 2025. This count almost certainly underestimates the true risk, as most organizations have no incident classification that captures an autonomous agent action as the initiating cause of a cascade. The event is logged as a service restart, connection pool saturation, or latency event. The agent is invisible in postmortem.

Absorptive capacity is a resource; Most systems don’t treat it this way

The underlying problem is that there is no shared language for absorptive capacity across enterprise systems – a real-time estimate of how much additional stress a system can take before it violates its SLO commitments. Chaos engineering programs manage this indirectly through human judgment and static limits, which fire once a threshold has already been exceeded. Agents don’t manage this at all.

Through structured primary research with site reliability engineering (SRE) and platform engineering practitioners at organizations including Intuit and GPTZero, I am developing a resiliency budgeting model. The basic idea is to treat absorptive capacity as a constantly recalculating, consumable resource rather than a static limit that you try not to violate.

A flexibility budget is based on four live signal categories.

The SLO burn rate is the primary input, as it directly encodes the distance between current system behavior and the commitment that actually matters. If a system is spending its monthly error budget at five times the expected rate, the resiliency budget is close to zero, regardless of what the CPU utilization looks like.
P99 latency trend matters more than absolute latency, because a service trending upward for more than forty minutes tells you something different from a service that has remained stable at the same absolute value.
The dependency saturation state is the most commonly missed signal; A chaos experiment or an agent action that assumes a shared connection pool is freely available when it is sitting at 87% will generate failure modes that one has not designed for.
Application behavior signals, session completion rates, API call pattern changes, conversion declines, and system stress surface compared to infrastructure metrics, as users experience degradation before they are reported by Prometheus.

What makes it a budget rather than a limit is that it is consumable. Each chaos experiment is drawn from the available capacity. Every agent’s action is motivated by this. In multi-team organizations where multiple experiments and multiple agents may work simultaneously, the budget is shared.

Without a shared ledger of consumption, two teams experimenting against overlapping dependencies produce a joint blast radius that neither team had planned for. Add autonomous agents operating entirely outside the ledger, and accounting collapses.

Where language models help, and exactly where they fail

Many engineering organizations are now running experiments using large language models (LLMs) to generate chaos hypotheses from dependency graphs and event postmortem corpora. The results are directly useful. Language models surface potential failure modes that experienced SREs consider worth testing, and they generate hypotheses faster than manual processes, especially when working from rich postmortem histories.

The limit is the stability of the dependence graph, and it is a hard limit. A hypothesis generated from a graph that does not reflect last month’s service extraction, or a new shared library dependency added two sprints ago, would propose an experiment with incorrect blast radius assumptions. The problem is not that the model makes a mistake, the problem is that the model doesn’t know it is making a mistake. This would be confidently wrong about a system boundary that no longer exists, and in chaos engineering, a confident mistake in production means an unplanned outage.

Stanford’s Trusted AI Research Lab found that model-level guardrails alone are insufficient: fine-tuning attacks bypassed leading models in most tested cases. The implication for chaos hypothesis generation is direct, a model that cannot reliably capture its safety bounds cannot be trusted to accurately model the blast radius of an action it has never seen in a dependency graph it has not verified.

When hypothesis generation is derived from postmortem corpora, the problem of staleness is substantially reduced. The postmortem describes the failures that actually occurred in the system at a specific moment in time. The signal is naturally validated by the production reality. This is a tractable near-term AI application in this area, and it is really useful for organizations with mature incident documentation practices.

What AI cannot do, and should not be asked to do, is make execution decisions when the signal is ambiguous. That decision requires awareness of things that remain completely outside any monitoring system: a pending deployment that changed the dependability landscape an hour earlier, on-call staffing levels on a holiday weekend, a customer commitment that makes any additional risk unacceptable by Monday.

A model without access to that context should not make that call. This is not a temporary limitation pending a more capable model. This is a structural constraint on what machine observations can reflect, and building an agent architecture that ignores this is creating one that will ultimately make consequential decisions with incomplete information – and with no human in the loop to catch it.

What does this mean for how enterprises control agents in production?

The implications of governance are simpler to describe than they seem and harder to implement. Every autonomous agent action touching the infrastructure needs to be registered against the same live signal layer that controls the chaos experiments. The same SLO burn rates, latency trends, dependency saturation tell us that a human engineer would check before starting an experiment to determine what an agent is allowed to do and when. If the flexibility budget is below a defined floor, the agent waits or moves. This doesn’t work.

Agent actions also need to be modeled as experiments, not just logged as events. When an agent restarts a service, the question is not just whether the restart completed successfully. This is whether the blast radius of that action was proportional to the absorption capacity available, and what cascading effects it produced in dependencies. That’s chaos engineering data. This comes under the budget model, which tells the agent or team what needs to be done to make the next decision.

And when signals are really ambiguous, when budget scores are ambiguous, when recent deployments have changed the topology in ways that the agent’s context window doesn’t capture, when dependency states are in flux, execution decisions need to go to the human. Not as a permanent limit on the agent’s autonomy, but as a hard engineering requirement for the current state of the technology.

A circuit breaker that delegates ambiguous cases to a human is not a weakness in the agent architecture. This is what makes the architecture reliable enough to actually run in production. Intent-based validation actually formalizes this: defining what the correct behavior of the agent looks like before deployment, then continuously checking whether those bounds hold under live system conditions.

The organizations that deploy autonomous agents reliably at scale are not those with the most sophisticated models. They are the ones who realized, even before anything wrong happened, that every action of the agents is an anarchic event and they created their governance system accordingly.

The practical first step is straightforward: audit every autonomous agent currently touching the infrastructure, map its action surface against your live SLO burn rate signal, and define clear floor conditions below which the agent is required to wait or proceed. That audit will expose agents working completely outside of your flexible accounting.

Most organizations running agents at scale today have multiple agents. Find them before production starts.

Sayali Patil has spent over 6 years at Cisco Systems and Splunk building the reliability and automation systems that keep enterprise AI infrastructure running at scale.

<a href

AI agents are quietly generating chaos engineering failures enterprises don’t track yet

Judgment call that agents leave

Absorptive capacity is a resource; Most systems don’t treat it this way

Where language models help, and exactly where they fail

What does this mean for how enterprises control agents in production?

Like this:

Related

Leave a Comment Cancel reply

Judgment call that agents leave

Absorptive capacity is a resource; Most systems don’t treat it this way

Where language models help, and exactly where they fail

What does this mean for how enterprises control agents in production?

Share this:

Like this:

Related

Leave a Comment Cancel reply