{"id":512149,"date":"2026-03-09T02:14:14","date_gmt":"2026-03-09T02:14:14","guid":{"rendered":"https:\/\/www.newsbeep.com\/us\/512149\/"},"modified":"2026-03-09T02:14:14","modified_gmt":"2026-03-09T02:14:14","slug":"karpathys-march-of-nines-shows-why-90-ai-reliability-isnt-even-close-to-enough","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/us\/512149\/","title":{"rendered":"Karpathy\u2019s March of Nines shows why 90% AI reliability isn\u2019t even close to enough"},"content":{"rendered":"<p>\u201cWhen you get a demo and something works 90% of the time, that\u2019s just the first nine.\u201d \u2014 <a href=\"https:\/\/www.dwarkesh.com\/p\/andrej-karpathy\" rel=\"nofollow noopener\" target=\"_blank\">Andrej Karpathy<\/a><\/p>\n<p>The \u201c<a href=\"https:\/\/www.superagent.sh\/blog\/the-march-of-nines\" rel=\"nofollow noopener\" target=\"_blank\">March of Nines<\/a>\u201d frames a common production reality: You can reach the first 90% reliability with a strong demo, and each additional nine often requires comparable engineering effort. For enterprise teams, the distance between \u201cusually works\u201d and \u201coperates like dependable software\u201d determines adoption.<\/p>\n<p>The compounding math behind the March of Nines<\/p>\n<p>\u201cEvery single nine is the same amount of work.\u201d \u2014 Andrej Karpathy<\/p>\n<p>Agentic workflows compound failure. A typical enterprise flow might include: intent parsing, context retrieval, planning, one or more tool calls, <a href=\"https:\/\/venturebeat.com\/security\/when-ai-lies-the-rise-of-alignment-faking-in-autonomous-systems\" rel=\"nofollow noopener\" target=\"_blank\">validation<\/a>, formatting, and <a href=\"https:\/\/venturebeat.com\/orchestration\/shadow-mode-drift-alerts-and-audit-logs-inside-the-modern-audit-loop\" rel=\"nofollow noopener\" target=\"_blank\">audit logging<\/a>. If a workflow has n steps and each step succeeds with probability p, end-to-end success is approximately p^n.<\/p>\n<p>In a 10-step workflow, the end-to-end success compounds due to the failures of each step. Correlated outages (auth, rate limits, connectors) will dominate unless you harden shared dependencies.<\/p>\n<p>Per-step success (p)<\/p>\n<p>10-step success (p^10)<\/p>\n<p>Workflow failure rate<\/p>\n<p>At 10 workflows\/day<\/p>\n<p>What does this mean in practice<\/p>\n<p>90.00%<\/p>\n<p>34.87%<\/p>\n<p>65.13%<\/p>\n<p>~6.5 interruptions\/day<\/p>\n<p>Prototype territory. Most workflows get interrupted<\/p>\n<p>99.00%<\/p>\n<p>90.44%<\/p>\n<p>9.56%<\/p>\n<p>~1 every 1.0 days<\/p>\n<p>Fine for a demo, but interruptions are still frequent in real use.<\/p>\n<p>99.90%<\/p>\n<p>99.00%<\/p>\n<p>1.00%<\/p>\n<p>~1 every 10.0 days<\/p>\n<p>Still feels unreliable because misses remain common.<\/p>\n<p>99.99%<\/p>\n<p>99.90%<\/p>\n<p>0.10%<\/p>\n<p>~1 every 3.3 months<\/p>\n<p>This is where it starts to feel like dependable enterprise-grade software.<\/p>\n<p>Define reliability as measurable SLOs<\/p>\n<p>\u201cIt makes a lot more sense to spend a bit more time to be more concrete in your prompts.\u201d \u2014 <a href=\"https:\/\/singjupost.com\/andrej-karpathy-software-is-changing-again\/\" rel=\"nofollow noopener\" target=\"_blank\">Andrej Karpathy<\/a><\/p>\n<p>Teams achieve higher nines by turning reliability into measurable objectives, then investing in controls that reduce variance. Start with a small set of SLIs that describe both model behavior and the surrounding system:<\/p>\n<p>Workflow completion rate (success or explicit escalation).<\/p>\n<p>Tool-call success rate within timeouts, with strict schema validation on inputs and outputs.<\/p>\n<p>Schema-valid output rate for every structured response (JSON\/arguments).<\/p>\n<p>Policy compliance rate (PII, secrets, and security constraints).<\/p>\n<p>p95 end-to-end latency and cost per workflow.<\/p>\n<p>Fallback rate (safer model, cached data, or human review).<\/p>\n<p>Set SLO targets per workflow tier (low\/medium\/high impact) and manage an error budget so experiments stay controlled.<\/p>\n<p>Nine levers that reliably add nines1) Constrain autonomy with an explicit workflow graph<\/p>\n<p>Reliability rises when the system has bounded states and deterministic handling for retries, timeouts, and terminal outcomes.<\/p>\n<p>Model calls sit inside a state machine or a DAG, where each node defines allowed tools, max attempts, and a success predicate.<\/p>\n<p>Persist state with idempotent keys so retries are safe and debuggable.<\/p>\n<p>2) Enforce contracts at every boundary<\/p>\n<p>Most production failures start as interface drift: malformed JSON, missing fields, wrong units, or invented identifiers.<\/p>\n<p>Use JSON Schema\/protobuf for every structured output and validate server-side before any tool executes.<\/p>\n<p>Use enums, canonical IDs, and normalize time (ISO-8601 + timezone) and units (SI).<\/p>\n<p>3) Layer validators: syntax, semantics, business rules<\/p>\n<p>Schema validation catches formatting. Semantic and business-rule checks prevent plausible answers that break systems.<\/p>\n<p>Semantic checks: referential integrity, numeric bounds, permission checks, and deterministic joins by ID when available.<\/p>\n<p>Business rules: approvals for write actions, data residency constraints, and customer-tier constraints.<\/p>\n<p>4) Route by risk using uncertainty signals<\/p>\n<p>High-impact actions deserve higher assurance. Risk-based routing turns uncertainty into a product feature.<\/p>\n<p>Use confidence signals (classifiers, consistency checks, or a second-model verifier) to decide routing.<\/p>\n<p>Gate risky steps behind stronger models, additional verification, or human approval.<\/p>\n<p>5) Engineer tool calls like distributed systems<\/p>\n<p>Connectors and dependencies often dominate failure rates in agentic systems.<\/p>\n<p>Apply per-tool timeouts, backoff with jitter, circuit breakers, and concurrency limits.<\/p>\n<p>Version tool schemas and validate tool responses to prevent silent breakage when APIs change.<\/p>\n<p>6) Make retrieval predictable and observable<\/p>\n<p>Retrieval quality determines how grounded your application will be. Treat it like a versioned data product with coverage metrics.<\/p>\n<p>Track empty-retrieval rate, document freshness, and hit rate on labeled queries.<\/p>\n<p>Ship index changes with canaries, so you know if something will fail before it fails.<\/p>\n<p>Apply least-privilege access and redaction at the retrieval layer to reduce leakage risk.<\/p>\n<p>7) Build a production evaluation pipeline<\/p>\n<p>The later nines depend on finding rare failures quickly and preventing regressions.<\/p>\n<p>8) Invest in observability and operational response<\/p>\n<p>Once failures become rare, the speed of diagnosis and remediation becomes the limiting factor.<\/p>\n<p>Emit traces\/spans per step, store redacted prompts and tool I\/O with strong access controls, and classify every failure into a taxonomy.<\/p>\n<p>Use runbooks and \u201csafe mode\u201d toggles (disable risky tools, switch models, require human approval) for fast mitigation.<\/p>\n<p>9) Ship an autonomy slider with deterministic fallbacks<\/p>\n<p>Fallible systems need supervision, and production software needs a safe way to dial autonomy up over time. Treat <a href=\"https:\/\/venturebeat.com\/orchestration\/vibe-coding-with-overeager-ai-lessons-learned-from-treating-google-ai-studio\" rel=\"nofollow noopener\" target=\"_blank\">autonomy<\/a> as a knob, not a switch, and make the safe path the default.<\/p>\n<p>Default to read-only or reversible actions, require explicit confirmation (or approval workflows) for writes and irreversible operations.<\/p>\n<p>Build deterministic fallbacks: retrieval-only answers, cached responses, rules-based handlers, or escalation to human review when confidence is low.<\/p>\n<p>Expose per-tenant safe modes: disable risky tools\/connectors, force a stronger model, lower temperature, and tighten timeouts during incidents.<\/p>\n<p>Design resumable handoffs: persist state, show the plan\/diff, and let a reviewer approve and resume from the exact step with an idempotency key.<\/p>\n<p>Implementation sketch: a bounded step wrapper<\/p>\n<p>A small wrapper around each model\/tool step converts unpredictability into policy-driven control: strict validation, bounded retries, timeouts, telemetry, and explicit fallbacks.<\/p>\n<p>def run_step(name, attempt_fn, validate_fn, *, max_attempts=3, timeout_s=15):<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0# trace all retries under one span<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0span = start_span(name)<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0for attempt in range(1, max_attempts + 1):<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0try:<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# bound latency so one step can\u2019t stall the workflow<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0with deadline(timeout_s):<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0out = attempt_fn()<\/p>\n<p>\n# gate: schema + semantic + business invariants<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0validate_fn(out)<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# success path<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0metric(&#8220;step_success&#8221;, name, attempt=attempt)<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0return out<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0except (TimeoutError, UpstreamError) as e:<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# transient: retry with jitter to avoid retry storms<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0span.log({&#8220;attempt&#8221;: attempt, &#8220;err&#8221;: str(e)})<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0sleep(jittered_backoff(attempt))<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0except ValidationError as e:<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# bad output: retry once in \u201csafer\u201d mode (lower temp \/ stricter prompt)<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0span.log({&#8220;attempt&#8221;: attempt, &#8220;err&#8221;: str(e)})<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0out = attempt_fn(mode=&#8221;safer&#8221;)<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0# fallback: keep system safe when retries are exhausted<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0metric(&#8220;step_fallback&#8221;, name)<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0return EscalateToHuman(reason=f&#8221;{name} failed&#8221;)<\/p>\n<p>Why enterprises insist on the later nines<\/p>\n<p>Reliability gaps translate into business risk. <a href=\"https:\/\/www.mckinsey.com\/capabilities\/quantumblack\/our-insights\/the-state-of-ai\" rel=\"nofollow noopener\" target=\"_blank\">McKinsey\u2019s 2025 global survey<\/a> reports that 51% of organizations using AI experienced at least one negative consequence, and nearly one-third reported consequences tied to AI inaccuracy. These outcomes drive demand for stronger measurement, guardrails, and operational controls.<\/p>\n<p>Closing checklist<\/p>\n<p>Pick a top workflow, define its completion SLO, and instrument terminal status codes.<\/p>\n<p>Add contracts + validators around every model output and tool input\/output.<\/p>\n<p>Treat connectors and retrieval as first-class reliability work (timeouts, circuit breakers, canaries).<\/p>\n<p>Route high-impact actions through higher assurance paths (verification or approval).<\/p>\n<p>Turn every incident into a regression test in your golden set.<\/p>\n<p>The nines arrive through disciplined engineering: bounded workflows, strict interfaces, resilient dependencies, and fast operational learning loops.<\/p>\n<p><a href=\"https:\/\/mungel.com\" rel=\"nofollow noopener\" target=\"_blank\">Nikhil Mungel<\/a> has been building distributed systems and AI teams at SaaS companies for more than 15 years.  <\/p>\n<p>Welcome to the VentureBeat community!<\/p>\n<p>Our guest posting program is where technical experts share insights and provide neutral, non-vested deep dives on AI, data infrastructure, cybersecurity and other cutting-edge technologies shaping the future of enterprise.<\/p>\n<p><a href=\"https:\/\/venturebeat.com\/category\/DataDecisionMakers\" rel=\"nofollow noopener\" target=\"_blank\">Read more<\/a> from our guest post program \u2014 and check out our <a href=\"https:\/\/venturebeat.com\/guest-posts\" rel=\"nofollow noopener\" target=\"_blank\">guidelines<\/a> if you\u2019re interested in contributing an article of your own!<\/p>\n","protected":false},"excerpt":{"rendered":"\u201cWhen you get a demo and something works 90% of the time, that\u2019s just the first nine.\u201d \u2014&hellip;\n","protected":false},"author":2,"featured_media":512150,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[45],"tags":[182,181,507,74],"class_list":{"0":"post-512149","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/posts\/512149","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/comments?post=512149"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/posts\/512149\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/media\/512150"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/media?parent=512149"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/categories?post=512149"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/tags?post=512149"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}