Karpathy’s March Of Nines Shows Why 90% AI Reliability Isn’t Even Close To Enough

“When you get a demo and something works 90% of the time, that’s just the first nine.” — Andrej Karpathy

The “March of Nines” outlines a common production reality: you can reach 90% reliability first with a strong demo, and each additional nine often requires comparable engineering effort. For enterprise teams, the distance between “works normally” and “works like trusted software” determines adoption.

The complex mathematics behind the March of Nine

“Each nine takes the same amount of work.” – Andrzej Karpathy

Failure occurs in agent workflows. A typical enterprise flow may include: intent parsing, context retrieval, planning, one or more tool calls, validation, formatting, and audit logging. If a workflow has n steps and each step succeeds with probability PEnd-to-end success is almost p^n.

In a 10-step workflow, failures at each step lead to success from start to finish. Unless you harden shared dependencies, correlated outages (auth, rate limits, connectors) will dominate.

Per-step success (P)	10-Step Success (P^10)	workflow failure rate	On 10 workflows per day	What does this mean in practice
90.00%	34.87%	65.13%	~6.5 interruptions/day	Prototype area. Most workflows are disrupted
99.00%	90.44%	9.56%	~1 every 1.0 days	OK for demo, but in actual use interruptions still occur frequently.
99.90%	99.00%	1.00%	~1 every 10.0 days	Still seems unbelievable because omissions remain common.
99.99%	99.90%	0.10%	~1 every 3.3 months	This is where it starts to feel like reliable enterprise-grade software.

Define Reliability as Measurable SLOs

“It’s more worthwhile to spend a little more time making your signals more concrete.” — Andrej Karpathy

Teams get high marks by turning reliability into measurable objectives, then investing in controls that reduce variance. Start with a small set of SLIs that describe both the model behavior and the surrounding system:

Workflow completion rate (success or apparent increase).
Tool-call success rate within timeouts, with strict schema validation on input and output.
Schema-validated output rate for each structured response (JSON/arguments).
Policy compliance rates (PII, secrets and security barriers).
p95 End-to-end latency and cost per workflow.
Fallback rate (secure models, cached data, or human review).

Set SLO targets per workflow level (low/medium/high impact) and manage an error budget so usage remains controlled.

Nine levers that reliably connect nine

1) Disrupt autonomy with a clear workflow graph

Reliability increases when the system has limit states and deterministic management of retries, timeouts, and terminal outcomes.

The model call sits inside a state machine or DAG, where each node defines allowed devices, maximum attempts, and a success predicate.
Persist the state with inactive keys so that retries are safe and debuggable.

2) Enforce contracts at each border

Most production failures start as interface drift: malformed JSON, missing fields, wrong entities, or invented identifiers.

Use JSON schema/protobuf for each structured output and validate server-side before any tools execute.
Use enums, canonical IDs and normalize time (ISO-8601 + timezone) and units (SI).

3) Layer Validator: Syntax, Semantics, Business Rules

Schema validation captures the formatting. Semantic and business-rule checking prevents potential answers that break the system.

Semantic checks: referential integrity, numerical limits, permission checks, and deterministic join by ID when available.
Business rules: approval for write operations, data residency constraints, and client-level constraints.

4) Averting risk using uncertainty signals

High-impact actions deserve high assurance. Risk-based routing turns uncertainty into product feature.

Use confidence signals (classifiers, consistency checks, or other model validators) to decide routing.
Take risky steps behind stronger models, additional validation, or human approval.

5) Engineer tool calls like distributed system

Connectors and dependencies often dominate failure rates in agentive systems.

Enforce per-tool timeouts, backoff with jitter, circuit breakers, and concurrency limits.
Validate versioned tool schema and tool responses to avoid silently breaking when APIs change.

6) Make recovery predictable and observable

Recovery quality determines how ground-level your application will be. Treat it like a versioned data product with coverage metrics.

Track empty-retrieval rate, document freshness and hit rate on labeled queries.
The ship index changes with the canary, so you can know if something will fail before it fails.
Enforce least-privilege access and redaction at the recovery layer to reduce the risk of leakage.

7) Create a production evaluation pipeline

The latter nine depend on finding rare failures quickly and preventing regressions.

Maintain an event-driven golden set from production traffic and run it on every change.
Run shadow mode and A/B canaries with automatic rollback on SLI regression.

8) Invest in observation and operational feedback

Once failures become rare, speed of diagnosis and treatment becomes the limiting factor.

Emit traces/spans per step, store modified signals and tool I/O with strong access controls, and classify each failure into a classification.
Use runbooks and “Safe Mode” toggles for fast mitigation (disable risky devices, switch models, require human approval).

9) Ship an Autonomy Slider with Deterministic Fallback

Faulty systems require supervision, and production software needs a safe way to gain autonomy over time. Treat autonomy as a knob, not a switch, and make Safe Path the default.

By default, read-only or reversible operations require explicit confirmation (or approval workflow) for write and irreversible operations.
Create deterministic fallbacks: retrieval-only replies, cached responses, rule-based handlers, or increased human review when confidence is low.
Uncover per-tenant secure mode: Disable risky devices/connectors, enforce a stronger model, lower temperatures, and tighten timeouts during incidents.
Design resumable handoffs: persist status, show plan/differences, and let a reviewer approve and resume from the exact step with an idempotency key.

Implementation sketch: a bounded stage wrapper

A small wrapper around each model/tool step transforms unpredictability into policy-driven control: strict validation, limited retries, timeouts, telemetry, and explicit fallbacks.

def run_step(name, attempts_fn, valid_fn, *, max_attempts=3, timeout_s=15):

# Find all retries within a period

span = start_span(name)

For attempts in category (1, max attempts + 1):

try:

# Bound latency so a step can’t stop the workflow

With timeout(timeout_s):

out = try_fn()

# Get: Schema + Semantic + Business Invariants

valid_fn(out)

#success path

metric("stage_success"Name,Prayas=Prayas)

return out

(timeout error, upstream error) except as e:

# Transient: Retry with panic to avoid retry storms

span.log({"try": try, "to make a mistake":str(e)})

sleep(nervous_retreat(attempt))

Except validation error as e:

# Bad output: Try once again in “safe” mode (low temperature/hard signal)

span.log({"try": try, "to make a mistake":str(e)})

out=attempt_fn(mode="Safe")

#Fallback: Keep system safe when retry ends

metric("step_fallback"Name)

return EscalateToHuman(reason=f"{name} failed")

Why do enterprises insist on the latter nine?

The reliability gap translates into business risk. McKinsey’s 2025 Global Survey report that 51% of organizations using AI have experienced at least one negative outcome, and nearly one-third have reported consequences associated with AI inaccuracy. These results raise the demand for robust measurements, safeguards, and operational controls.

closing checklist

Select a top workflow, define its completion SLO and device terminal status code.
Add contracts + validators around each model output and tool input/output.
Treat connectors and recovery as first class reliability functions (timeouts, circuit breakers, canaries).
Route high-impact actions through higher assurance paths (verification or approval).
Turn each event into a regression test in your golden set.

The nines come through disciplined engineering: bounded workflows, tight interfaces, flexible dependencies, and fast operational learning loops.

Nikhil Mungail Building distributed systems and AI teams in SaaS companies for over 15 years.

<a href

Karpathy’s March of Nines shows why 90% AI reliability isn’t even close to enough

The complex mathematics behind the March of Nine

Define Reliability as Measurable SLOs

Nine levers that reliably connect nine

1) Disrupt autonomy with a clear workflow graph

2) Enforce contracts at each border

3) Layer Validator: Syntax, Semantics, Business Rules

4) Averting risk using uncertainty signals

5) Engineer tool calls like distributed system

6) Make recovery predictable and observable

7) Create a production evaluation pipeline

8) Invest in observation and operational feedback

9) Ship an Autonomy Slider with Deterministic Fallback

Implementation sketch: a bounded stage wrapper

Why do enterprises insist on the latter nine?

closing checklist

Like this:

Related

Leave a Comment Cancel reply

The complex mathematics behind the March of Nine

Define Reliability as Measurable SLOs

Nine levers that reliably connect nine

1) Disrupt autonomy with a clear workflow graph

2) Enforce contracts at each border

3) Layer Validator: Syntax, Semantics, Business Rules

4) Averting risk using uncertainty signals

5) Engineer tool calls like distributed system

6) Make recovery predictable and observable

7) Create a production evaluation pipeline

8) Invest in observation and operational feedback

9) Ship an Autonomy Slider with Deterministic Fallback

Implementation sketch: a bounded stage wrapper

Why do enterprises insist on the latter nine?

closing checklist

Share this:

Like this:

Related

Leave a Comment Cancel reply