Arya's Substack

Iterative Refinement of a Multi-Agent Pipeline: A Case Study

Arya Tafvizi — Wed, 13 May 2026 15:52:23 GMT

Most of software engineering is an iterative practice of evaluation and debugging. You build a system, measure how it performs, find what’s broken, fix it, and measure again. We’ve always had this loop for traditional ML systems: train a model, evaluate on a held-out set, inspect the errors, retrain. For multi-agent systems, the evaluation part is starting to mature, but the debugging part has lagged behind. When a pipeline of 8 LLM-powered agents produces a wrong answer, the trace tells you every agent passed. It doesn’t tell you which agent to fix.

At Google, we relied on established workflows to make ML systems for Search highly deterministic and controllable. Those tools didn’t exist at DeepMind, and they are largely missing from today’s multi-agent ecosystems. I’m working to create the science and tools to bring that same rigor to AI systems. (For background on why emergent behavior in multi-agent systems makes this kind of tooling necessary, see the previous posts.) The counterfact tool we use below is a step in that direction.

The key advantage of counterfactual attribution: it analyzes multiple failure modes simultaneously and tells you where to invest your debugging time. Reading a single trace through many agents is already a cumbersome task for an engineer. Reading dozens of traces is worse, and each one may point at a different agent as the culprit. As shown later in this post, standard LLM-based trace analysis fails to identify cascading errors, confusing correlation with causation. Counterfact evaluates every possible combination of agents in minutes and returns a quantitative, causal report: not “this agent looks suspicious” but “this agent degrades precision by 0.236 on average across your eval set.”

Below, we walk through this workflow on an 8-agent financial RAG pipeline using counterfact: identify the failure, diagnose it with Shapley attribution, apply fixes one at a time, and re-run the diagnosis after each step to verify improvement and re-prioritize.

The Pipeline

This is a sequential pipeline: each agent receives the output of all previous agents and adds its own contribution to the shared state. The first four agents prepare data (parsing the question, retrieving the document, extracting numbers, adding context). The second four progressively refine a single analysis field (synthesizing an answer, verifying it, editing the tone, formatting the output).

Every prompt contains a common practice: relying on the LLM for reasoning or world knowledge. This is a discouraged shortcut that often leads to subtle bugs:

Table Extractor: “Present values in a clean, readable format.”
Context Enricher: “Based on your knowledge, add relevant industry context.”
Fact Checker: “Minor rounding differences and presentation choices are acceptable.”
Tone Editor: “Simplify large numbers for readability.”

No obviously wrong instructions. The bugs are in how Claude interprets these subjective terms and in architectural decisions like letting an LLM generate context from training data rather than a verified source.

The Failure

We ran the pipeline on 5 FinanceBench questions from 3M’s FY2018 cash flow statement. The correct answer for each is a single dollar figure from the 10-K filing.

Evaluation criteria: For this evaluation, we enforce strict classification: exact figures, exact units, and zero fabricated claims. Financial figures must be precise enough to audit back to the source document. This strictness eliminates ambiguity in the before-and-after comparison.

Bold = wrong figure. Italics = fabricated claims from Claude’s training data, not in the source document.

Every answer follows the same pattern: exact figures from the 10-K are rounded and converted from millions to billions, and each response includes fabricated peer comparisons. All eight agents pass in every trace. Nothing indicates which agent introduced the rounding or the fabricated claims.

Diagnosis: 5-Query Aggregate Shapley Attribution

Counterfact ablates each agent in every possible coalition, re-runs the pipeline with real Claude calls, and computes Shapley values. Three classifiers score each run:

Accuracy: Does the output contain the correct dollar figure?
Precision: Are the numbers exact, not rounded?
Grounding: Are all claims traceable to the source document?

Each classifier scores on a 0-to-1 scale. Average quality is the weighted mean of the three: accuracy (weight 2.0), precision (1.5), grounding (1.0). A perfect pipeline scores 1.0.

36 coalition evaluations per query, 5 queries run in parallel:

Avg baseline quality: 0.378

Agent                  Avg Shapley
──────────────────────────────────────────
tone_editor            -0.088  █          ← most harmful
context_enricher       -0.061  █
table_extractor        -0.047
query_parser           +0.029
fact_checker           +0.065  █
output_formatter       +0.134  ██
synthesizer            +0.160  ███
doc_retriever          +0.250  ████       ← most helpful

Per-Classifier Breakdown

accuracy  : table_extractor=-0.049    ...  doc_retriever=+0.250
precision : tone_editor=-0.236        ...  doc_retriever=+0.250
grounding : query_parser=-0.224       ...  synthesizer=+0.160

Each classifier surfaces a different bottleneck:

Accuracy: The table extractor is the worst agent (-0.049). Its “clean, readable format” instruction rounds source figures before downstream agents see them.
Precision: The tone editor is the worst agent (-0.236). “Simplify large numbers” triggers M-to-B unit conversion, the single largest source of precision loss.
Grounding: The query parser is the worst agent (-0.224). Misclassifying queries as “estimation” rather than “lookup” causes the synthesizer to approximate rather than cite exact figures.

No single agent is the worst on every classifier. A single-metric evaluation would have prioritized the wrong fix.

Iterative Debugging in Action

The diagnosis points at three agents to fix. We apply these prompt edits cumulatively, isolating variables and re-running the Shapley diagnosis after each fix to measure the marginal impact.

The three prompt edits

Step 0: Broken pipeline

Avg Quality: 0.378 · Exact Answers: 0/5

Agent                  Avg Shapley
──────────────────────────────────────────
tone_editor            -0.088  █
context_enricher       -0.061  █
table_extractor        -0.047
query_parser           +0.029
fact_checker           +0.065  █
output_formatter       +0.134  ██
synthesizer            +0.160  ███
doc_retriever          +0.250  ████

Per-classifier worst:
  accuracy:   table_extractor  -0.049
  precision:  tone_editor      -0.236
  grounding:  query_parser     -0.224

Step 1: Fix context enricher

Avg Quality: 0.356 · Exact Answers: 0/5

Quality didn’t improve. The context enricher fix removed the fabricated benchmarks, but the tone editor is still converting millions to billions, which destroys precision on every query. Removing the fabricated context actually made the synthesizer lean harder on rounded numbers from the table extractor, which the tone editor then amplifies.

Agent                  Avg Shapley
──────────────────────────────────────────
tone_editor            -0.107  ██   ← worse than before
table_extractor        -0.020
context_enricher       +0.002       ← fixed, now neutral
query_parser           +0.061  █
synthesizer            +0.069  █
output_formatter       +0.074  █
fact_checker           +0.137  ██
doc_retriever          +0.183  ███

Per-classifier worst:
  accuracy:   context_enricher -0.115
  precision:  tone_editor      -0.219  (still dominant)
  grounding:  synthesizer      -0.216

The tool now tells us: fix the tone editor next.

Step 2: + Fix tone editor

Avg Quality: 0.747 · Exact Answers: 4/5

Preventing M-to-B conversion is the primary bottleneck. This single edit improves exact matches from 0/5 to 4/5 and nearly doubles average quality.

Agent                  Avg Shapley
──────────────────────────────────────────
context_enricher       -0.057  █    ← re-emerges
fact_checker           -0.012
query_parser           -0.009
tone_editor            -0.006       ← fixed, now neutral
synthesizer            +0.007
table_extractor        +0.008
output_formatter       +0.092  █
doc_retriever          +0.163  ███

Per-classifier worst:
  accuracy:   context_enricher -0.193
  precision:  context_enricher -0.127
  grounding:  synthesizer      -0.259

With tone fixed, the context enricher surfaces again as worst on accuracy and precision. The synthesizer is worst on grounding. One query (OpCF) still returns “$6.4 billion.”

Step 3: + Fix fact checker

Avg Quality: 0.836 · Exact Answers: 4/5

The stricter fact checker catches the remaining rounding error on OpCF and corrects fabricated context that slipped through. But Buybacks still returns “$4.87 billion.” The per-query Shapley points at the table extractor: its “clean, readable format” instruction rounds $4,870 before downstream agents even see it.

Agent                  Avg Shapley
──────────────────────────────────────────
context_enricher       -0.015
query_parser           -0.005
synthesizer            +0.012
output_formatter       +0.022
doc_retriever          +0.025
tone_editor            +0.026
fact_checker           +0.032
table_extractor        +0.063  █

All agents within [-0.015, +0.063]

Step 4: + Fix table extractor

Avg Quality: 0.822 · Exact Answers: 5/5

“Present values in a clean, readable format” → “Present exact values as reported in the source document. Do NOT round.” Buybacks now returns “$4,870 million.” All five queries produce exact answers.

Agent                  Avg Shapley
──────────────────────────────────────────
query_parser           -0.061  █
table_extractor        +0.003
context_enricher       +0.022
tone_editor            +0.026
output_formatter       +0.088  █
synthesizer            +0.092  █
fact_checker           +0.103  ██
doc_retriever          +0.137  ██

No agent degrading quality beyond noise

Incremental summary

By addressing the root causal bottlenecks rather than the symptoms, four prompt edits doubled the pipeline’s overall quality and achieved 5/5 exact answers.

Key Takeaways

Semantic correctness does not guarantee behavioral correctness. When we rely on LLMs to interpret vague instructions using their own judgment, “minor rounding differences are acceptable” and “simplify large numbers” become failure modes. The instructions read correctly to a human reviewer, but Claude’s interpretation at inference time produces systematically wrong outputs.

The worst agent varies by query. On CapEx, the query parser is worst. On Dividends, it’s the table extractor. On OpCF, it’s the synthesizer. A single-query evaluation would have prioritized the wrong fix.
Agent fixes interact. Fixing the context enricher alone made quality worse (-0.022), because the tone editor was still converting M to B. The tool detects this: after Step 1, it re-prioritizes tone editor as the worst agent.
Diagnosis is state-dependent. After fixing the tone editor in Step 2, the context enricher re-emerges as the worst agent on accuracy and precision. The diagnostic loop surfaces whatever matters most given the current state of the pipeline.
Four prompt edits doubled quality. The diagnosis identified which four prompts to rewrite, in what order, and which quality dimension each was degrading.

Every failure in this pipeline is local: a single prompt instruction producing unintended behavior. There are no feedback loops, no architectural mismatches, and no emergent interactions between agents beyond simple data propagation. These are the easy cases. Future posts will cover systemic failures, cascading error amplification, conflicting objectives between agents, and architectural bottlenecks, where counterfactual attribution becomes essential rather than merely useful.

Baseline: Can an LLM Diagnose This From Traces?

We gave Claude Sonnet all 5 execution traces (every agent’s prompt and output) plus the classifier scores and asked it to rank the agents by severity. Here’s what it returned:

The LLM gets the top two right. After that, it diverges.

Where it goes wrong

Confuses correlation with causation. The LLM ranks output_formatter #3 because it “displays rounded numbers.” But Shapley shows output_formatter has a positive contribution (+0.134). It’s not causing the rounding; it’s just receiving already-rounded input from the tone editor. The LLM can’t distinguish “this agent received bad data” from “this agent caused bad data.”
Misses the table extractor. Shapley ranks table_extractor #3 because its “clean, readable format” instruction rounds $1,577 to $1,600 before any downstream agent sees it. The LLM never flags this because the table extractor’s output looks reasonable in the trace. You can only detect its impact by removing it and seeing what changes.
Can’t predict interaction effects. The LLM would suggest fixing context_enricher and tone_editor independently. It has no way to know that fixing context_enricher alone makes quality worse (-0.022) because the tone editor is still converting M to B. Shapley catches this because it tests every combination of agents.
Blames agents that are helping. The LLM flags synthesizer as #5, but Shapley shows synthesizer has a strongly positive contribution (+0.160). Removing it would make the pipeline worse, not better.

The LLM diagnosis is plausible, well-reasoned, and partially correct. That’s exactly what makes it dangerous: an engineer reading it would feel confident acting on it, but two of the five recommendations would either waste time (output_formatter) or actively degrade the pipeline (synthesizer). Causal attribution avoids this by testing each agent’s actual marginal contribution rather than reasoning about what looks wrong in a trace.

Reproduce It

pip install "counterfact[anthropic] @ git+https://github.com/counterfact-labs/counterfact.git"
export ANTHROPIC_API_KEY=...

# Diagnose the broken pipeline
python examples/financebench_diagnose.py

# Incremental fix cycle (Steps 0-4)
python examples/financebench_improve_prompts.py

Source: counterfact-labs/counterfact. Claude Sonnet 4.6 (synthesizer) + Claude Haiku 4.5 (all other agents).

Full execution trace (all 8 agents, input/output for each): Appendix on GitHub

Beyond Tracing: Understanding Multi-Agent Systems Requires Causal Inference

Arya Tafvizi — Fri, 08 May 2026 22:07:57 GMT

As AI architecture trends from monolithic LLMs toward multi-agent pipelines, often featuring ten or more interacting components on frameworks like LangGraph or AutoGen, our diagnostic methods are lagging behind.

We have robust methods for analyzing traditional machine learning models, like feature importance analysis, and mature interpretability tools for individual LLMs. But for multi-agent systems, standard diagnostic methods simply don’t exist, making the theoretical promise of “easier to debug modular systems” difficult to achieve in production.

In a modular system, each agent handles a distinct and well-defined responsibility- a researcher gathers context, a synthesizer drafts a summary, and a critic validates the output. Individually, frontier models handle these bounded tasks well. However, because agents pass qualitative natural language between one another, minor misinterpretations tend to compound across the pipeline. The interaction of multiple generative agents yields emergent and unpredictable behavior.

To solve the debugging challenge this creates, I built the Counterfactual Debugger. It is an open-source causal diagnostic toolkit designed specifically for multi-agent systems. Understanding why such a tool is necessary requires looking at how these systems actually fail.

The Computational Irreducibility of AI Systems

Having guided multi-agent systems from prototype to production, I’ve noticed they behave much like cellular automata- simple, localized rules that produce highly complex macro-patterns.

Multi-agent systems are similarly computationally irreducible. You cannot predict the system’s final state or its potential failure modes purely through static analysis or by reading prompt code. The only way to understand the complex interaction of these agents is to run the system.

This presents a practical engineering dilemma. If you cannot predict a system’s behavior analytically, how do you debug it when it produces an unexpected result?

The Limits of Tracing

Standard observability tooling in AI relies heavily on tracing. We log the raw input and output strings of every agent in the chain.

Tracing is useful but backward-looking. It records exactly what happened. However, as we see a trend toward pipelines utilizing ten or more interacting components, manually parsing these logs becomes impossible for a human developer. Furthermore, in a non-deterministic pipeline, knowing what happened is rarely enough.

If an N-agent pipeline outputs a well-formatted but historically inaccurate research report, the tracing logs will show the exact sequence of events. They will not explain why the semantic drift occurred. They won’t specify if the error originated from the researcher retrieving an ambiguous source or if the critic failed to challenge a false premise set by the user. Tracing highlights the symptom but does not isolate the root cause.

Applying Causal Inference

Debugging emergent behavior requires counterfactual simulations. We have to ask what would have happened if the searcher agent had returned paragraph B instead of paragraph A. Would the synthesizer still hallucinate if the user’s prompt omitted a misleading premise?

This requires applying causal inference. By systematically perturbing the system’s graph, for example by ablating components, we can measure the causal impact of any single node on the final result. Using techniques inspired by Shapley values and agent-boundary perturbation, we can isolate the exact node and semantic shift responsible for a systemic hallucination, even if every individual agent appeared to be following its instructions.

The Counterfactual Debugger

This is the exact problem the Counterfactual Debugger solves. Instead of generating a static trace of a failure, the toolkit automatically orchestrates probabilistic stress tests and counterfactual simulations across the agent graph.

When an error occurs, it systematically perturbs the pipeline’s intermediate steps to calculate a causal attribution score. This quantifies each agent’s contribution to the failure.

Here is a brief look at how it works:

from counterfact import StateGraph, END

# 1. Build your pipeline with counterfact’s StateGraph (same API as LangGraph)
graph = StateGraph(MyState)

graph.add_node(”searcher”, searcher_fn)
graph.add_node(”synthesizer”, synthesizer_fn)
graph.add_node(”critic”, critic_fn)

graph.set_entry_point(”searcher”)
graph.add_edge(”searcher”, “synthesizer”)
graph.add_edge(”synthesizer”, “critic”)
graph.add_edge(”critic”, END)

compiled = graph.compile()

# 2. Run it
result = compiled.invoke({”query”: “Did Thomas Edison electrocute Topsy?”})

# 3. Diagnose: actually re-runs the pipeline with agents ablated
report = compiled.diagnose(
    input_state={”query”: “Did Thomas Edison electrocute Topsy?”},
    domain=”rag”,
)

# 4. Real output: Shapley values with bootstrap CIs, not “causal impact %”
print(report.shapley_values)

# {”searcher”: -0.11, “synthesizer”: +0.78, “critic”: +0.10}
# Result: The synthesizer is highly accountable (+0.78) for the failure.
# The debugger proves the hallucination was generated during synthesis,
# not driven by poor context retrieved by the searcher.

print(report.classification)       # FailureClassification(failure_type=”local”, ...)
print(report.recommendations)      # [Recommendation(...)]

The tool converts qualitative failures into localized, actionable bugs. It enables offline testing of fixes in a shadow environment before pushing updates to production.

As we rely more heavily on multi-agent architectures, our maintenance methods must adapt. Print statements and standard trace logs are insufficient for computationally irreducible systems. Understanding these pipelines requires moving beyond observation and adopting causal inference.

The code and documentation are available below. If you are building complex agentic systems and looking for more rigorous ways to evaluate them, feel free to review the repository or open an issue.

https://github.com/counterfact-labs/counterfact

Emergent Complex Behavior in Simple Multi-Agent Systems

Arya Tafvizi — Thu, 05 Mar 2026 17:10:16 GMT

Disclaimer: The views and opinions expressed in this article are my own and do not necessarily reflect the official policy or position of Google or DeepMind.

From 2019 to 2022, I led the analysis of primary machine learning systems for Search Ranking at Google. My team’s job was to discover and help address unexpected behaviors in some of the most widely used predictive models in the world. I later contributed to AI Overviews in Search and to Gemini, both of which use networks of generative LLMs to fulfill a user’s query.

As I look across the broader AI industry today, I see a collective shift from shallow ensembles of discriminative models to deep ensembles of generative models. This fundamental change in system architecture will introduce novel vulnerabilities and incomprehensible behaviors.

While pieces of this puzzle are actively debated in disparate academic forums, particularly within AI alignment and safety research, they are rarely viewed through the lens of production infrastructure. My goal here is to offer a new perspective: translating these theoretical alignment challenges into a practical, systems-engineering framework.

Traditional debugging approaches are not suited to addressing emergent behaviors in complex agent ecosystems. In multi-agent generative systems, the failure mode is no longer a broken component. It is a coherent interaction.

The Cellular Automata Problem

Cellular Automata are discussed early in computer science theory classes to demonstrate how complex behaviors emerge from very simple rules. See, for example, Conway’s Game of Life, in which a grid of cells governed by very simple deterministic rules led to very complex, unpredictable, and sometimes chaotic macro-patterns. These systems are also computationally irreducible: the only way to predict their outcome is to run them.

In a modern AI pipeline, a web of specialized LLMs and tools processes an input to produce an output. Even if each component is perfectly prompted and highly capable, inter-agent interactions compound. These emergent complexities quickly exceed a human engineer’s ability to manually trace.

The Semantic Shift

Systems engineers might ask: “We’ve handled distributed microservices for decades. Why are multi-agent AI systems any different?”

Unpredictable chain reactions aren’t new. We see them in stock markets, power grids, and traditional software. The real difference is what the nodes send each other.

Traditional microservices exchange strict, structured data like JSON. If a service gets bad data, it usually throws a clear exception and stops. LLM agents, however, exchange natural language. If an agent receives a slightly inconsistent context, instead of crashing it will confidently process and further escalate the inconsistency before passing it to the next agent.

The new challenge isn’t fixing hard crashes; it’s stopping these small errors in reasoning from snowballing across the system.

Below we look at a concrete example to understand exactly how this semantic drift bypasses standard checks.

A Prototypical Systemic Failure

Consider a simple three-agent research pipeline:

A Searcher retrieves documents.
A Synthesizer drafts the response.
A Critic checks citations and logical flow.

Each agent performs its local task correctly. Now, introduce a flawed prompt:

“Detail how Thomas Edison used the execution of Topsy the elephant in his ‘War of the Currents’ campaign to prove AC power was dangerous.”

Historically, Edison’s company filmed Electrocuting an Elephant in 1903. The elephant was killed using AC power. But the execution was ordered by Luna Park, not Edison. The causal framing in the prompt is misleading.

The failure unfolds cleanly:

The Searcher retrieves accurate facts. No hallucination occurs at the retrieval layer.
The Synthesizer attempts to answer the question as written. Its objective is relevance and coherence, not premise validation. The user’s framing becomes a constraint.
When retrieval reveals that Edison did not order the execution, the model resolves the tension by introducing a bridging claim: Edison did not initiate the event, but he leveraged it as anti-AC propaganda.

To understand why, we have to look at what the model is actually doing. It bridges conflicting facts by generating semantic glue. The model simply generates the most plausible text that satisfies both the user’s framing and the retrieved facts, rather than optimizing for objective historical truth.

No facts are fabricated. Causality is.

The Critic then verifies the output: Claims have citations, the reasoning is internally consistent, and the question is answered.

The output passes. The result is a coherent, citation-backed narrative that is historically misleading. Nothing broke. Retrieval worked. Generation worked. Critique worked.

The failure is architectural. Local objectives are satisfied, but no agent is responsible for rejecting the global premise. Once a flawed assumption enters the pipeline, it propagates forward as a structural constraint. The system optimizes around it.

The Scaling Multiplier

While this specific hallucination is easy to autopsy in a three-agent setup, production systems are rarely so contained. As we move toward N-agent systems, the interaction surfaces grow combinatorially.

In an N-agent directed graph, the number of possible interaction paths grows superlinearly. Each additional node introduces new composition routes for semantic error propagation, creating a mathematical escalation of risk that renders manual oversight impossible.

Why State-of-the-Art Models Still Fail

The goal of the previous example is to demonstrate a fundamental principle: complexity and unpredictability emerge purely from the interactions themselves. An engineer’s first instinct is to patch this. “I’ll just add a Premise-Checking Agent,” or “I’ll rewrite the Synthesizer’s system prompt to be more skeptical.” But these remedies are subject to the exact same issues. This is the Whack-a-Mole fallacy of agent architectures By adding more agents or rigid rules, you aren’t eliminating emergent behavior; you are simply increasing the number of network interactions, moving the complexity and the vulnerability to a new, untested part of the system.

Even state-of-the-art models fail inside poorly governed multi-agent architectures for three reasons:

Local Optimization Blindness: The Critic agent optimizes strictly for its local objective function, remaining blind to the global truth of the narrative.
Cascading Drift: When Agent A introduces a minor semantic error, the context it passes to Agent B becomes slightly distorted. As information flows through the network, these compounding distortions degrade the system’s overall reasoning capabilities. Agents blindly build their logic on increasingly unstable foundations, turning small misinterpretations into systemic hallucinations.
Reward Hacking & The Sycophancy Trap: Models are fine-tuned via Reinforcement Learning from Human Feedback (RLHF) to prioritize user satisfaction. When faced with a prompt containing a false premise, the model might treat the user’s framing as a heavy statistical prior, agreeing with the user rather than zooming out to correct the fundamental error.

The New Diagnostic Playbook

To build reliable multi-agent systems, we need bespoke diagnostic tooling designed to untangle, trace, and manage compounding interactions.

This requires three diagnostic pillars:

1. The Shadow Sandbox (Offline Diagnosis & Counterfactuals): When a failure occurs, traditional print logs are useless. New observability tools can trace prompts, responses, and agent calls across a system, giving us visibility into what happened. But tracing alone is insufficient. In generative systems, errors are rarely isolated bugs; instead, they are small misunderstandings that gradually compound as agents interact.

Observability shows what happened; a Shadow Sandbox would let us test what would have happened instead.

Since a single failing node cannot always be pinpointed, standard evaluation frameworks must evolve into full trajectory simulation. We need probabilistic stress tests and counterfactuals: pausing a trace, injecting corrections, and measuring the fix’s impact across historical runs.

2. The AI Judge (Process Reward Models & Out-of-Band Evaluation): You cannot rely on agents inside the loop to police the loop. Expanding on the LLM-as-a-Judge paradigm, a secondary AI must continuously evaluate the trajectory against core system rules. Crucially, this cannot just be an Outcome Reward Model (ORM) that grades the final output (which easily falls for the Topsy trap).

It must act as a Process Reward Model (PRM), evaluating the step-by-step reasoning trace to catch the exact moment semantic drift occurs.

Guided by explicit epistemic constraints rather than user satisfaction, this out-of-band judge catches the emergent errors that local agents miss.

3. The Circuit Breaker (Representation Engineering & Latent Monitoring): We need live monitoring to evaluate system stability. Rather than relying solely on semantic guardrails that parse text outputs, advanced circuit breakers must utilize probes on the model’s intermediate activations.

Existing techniques such as Representation Engineering (RepE) allow us to mathematically map these internal states during generation. Combined with Latent Monitoring, this detects internal signatures of uncertainty, contradiction, or cognitive conflict before a faulty generation, halting the pipeline proactively.

These techniques are admittedly difficult to operationalize and scale in production environments.

The Frontier of Multi-Agent Systems

Understanding these emergent failure modes is not just an engineering challenge; it defines the next generation of multi-agent AI architectures. Designing networks where agents interact predictably, ground each other in reality, and can be practically audited for truth is an open research frontier.

Print-statement debugging is insufficient: emergent failure is a system property, not broken code. Our diagnostic tools must evolve accordingly.

The World Stayed Bright

Arya Tafvizi — Mon, 02 Mar 2026 19:32:15 GMT

The colors seemed more vibrant that day than they had in several months. The trees in Boston Common looked greener, the sky above them bluer. It wasn’t long after my left leg was strong enough to walk there without assistance. I was on my way to my gym by the park to exercise on the stair machine.

This was the fall of 2017. It was a few months after my skydiving accident, a 75-foot fall that brought with it ten broken bones and a traumatic brain injury.

The colors came back a few days later, then again after a shorter period, then every other day, until they stopped flickering. The world stayed bright.

Realizing Boston couldn’t offer the quiet refuge I needed to heal, I moved to my sister’s basement in Los Altos. There, I studied graduate textbooks in statistics, slowly relearning a language I once spoke fluently. By the time I went to Google for my on-site interviews, the experience felt more like teaching than interviewing.

A job at Google was the stable platform I needed to rebuild myself physically, mentally, and emotionally. But it was a lot more than that.

When I was around ten years old, my dad bought a bulky desktop computer with a CRT monitor for our home. He had someone show us how to connect to the internet with a dial-up modem. Navigating to google.com, he emphasized that it was the best way to find anything online. “Look, it found us a hundred thousand results in a fraction of a second.”

Later at MIT, our professor in Mathematics for Computer Science used Google’s PageRank algorithm to teach us about Markov Chains, one of my favorite mathematical constructs. In the early 2000s, it made Google the place to go to find results that were not only topical but also trustworthy.

Search Ranking was still the crown jewel of Google when I arrived. I learned as much as I could about the new ways Google’s algorithms found trustworthy sites and worked to improve them. Over the following eight years, I contributed to projects that were both interesting and important, helping shape how billions of people find and interact with the world’s information.

I learned and I taught. I mentored and was mentored. I found things to improve and created entirely new things: from analysis methods to production serving, and from internal articles to external papers. I moved from seeking stability to seeking problems worth solving. That pursuit led me from Search to DeepMind, where I explored novel problems at the frontier of AI in my final year at Google.

I am grateful for what I got to build. I am most proud of the person I became in the process. Recovery taught me a lesson that my engineering experiences reinforced: progress is rarely a breakthrough; it is a practice of trial, learning, and growth, repeated time after time.

I’m most fulfilled when my life serves a purpose larger than myself. Now, I choose how to serve the world in my own way, the best I can. The world has stayed bright, and I have never been more prepared for the road ahead.

North Shore of Oahu, Summer 2020

My Question for AI Maximalists

Arya Tafvizi — Sun, 22 Feb 2026 00:17:09 GMT

Whenever someone declares that AI has achieved or will achieve “superhuman judgment,” (a common occurrence these days) I always ask the same question: How do you even measure that?

The raw power of modern models is not in dispute. If the goal is passing a standardized exam or writing code to a clear spec, the debate is over. These are closed systems with a clear ground truth, agreed upon and defined by humans. Because the metric for success is perfectly definable, an optimization engine can directly solve for it.

However, as AI has saturated traditional metrics that reward the recall of training data, the industry is shifting toward benchmarks that measure fluid reasoning. The ARC Prize’s AGI metrics are a prime example; they measure intelligence through skill acquisition efficiency (how quickly it learns) by presenting puzzles where the AI isn’t even told the “rules of the game” beforehand. It’s an attempt to see if a model can reason from first principles rather than relying on past examples.

But even as we improve these benchmarks, the vast majority of human choices happen in environments where no perfect metric exists. Consider a senior product engineer making an architectural decision. They aren’t just optimizing for clean code; they are balancing system latency, launch deadlines, costs, and team morale, among others.

The technical bottleneck here is incommensurability. In a closed system, every variable can be converted into a single “score.” But in the real world, you cannot mathematically trade 50ms of latency for a 10% drop in team morale. There is no universal exchange rate between “code health” and “market timing.” Because machine learning requires a scalar objective, a single number to maximize, it is forced to collapse these distinct categories into an arbitrary calculation.

When we can’t define a goal mathematically, we rely on proxy metrics. Consider Nick Bostrom’s Paperclip Maximizer thought experiment. If you task an AI with a single, measurable goal, like maximizing the production of paperclips, it will ruthlessly optimize for that metric at the expense of everything else, eventually consuming all human resources to build more paperclips. The AI isn’t malicious; it is simply following a mathematically defined target that lacks the unmeasurable context of human values.

This is the essence of Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure. By optimizing for a proxy, the AI games the metric while abandoning the spirit of the task.

This measurement problem explains why AI acts as a precise tool for measurable work, yet hitting a wall the moment it encounters complex human trade-offs. Until we solve a philosophical measurement problem humanity hasn’t cracked in thousands of years, the claim that AI will replace judgment remains unfalsifiable. True judgment resists quantification, and for now, that remains our moat.

From Prototype to Production: Guiding Agents at Scale

Arya Tafvizi — Fri, 06 Feb 2026 06:21:00 GMT

The barrier to building software has collapsed. Today, someone with little formal engineering experience can generate thousands of lines of code in a single afternoon using intent-first development, colloquially known as “vibe coding.”

This shift is undeniably exciting. It lowers the barrier to entry and accelerates the “zero to one” phase of building. But as we recently discovered while helping a friend set up the engineering stack for their early-stage startup, this speed comes with a hidden tax.

While looking into modifying features in an open-source library, we found ourselves leaning heavily on AI agents to get quick wins. In one specific instance, the agent was more than willing to cut corners to implement the narrow feature we requested. It worked, but it introduced a fragility that would clearly prove costly as the codebase grew.

It clarified a distinction that is often lost in the hype: You can vibe code a prototype. You cannot vibe code a system.

The Illusion of Velocity

The mistake many teams are making right now isn’t using AI to write code. It is assuming that generating code is the same as engineering a system.

When you “vibe code,” you are optimizing for immediate functionality. You describe an outcome, and the model gives you the shortest path to that outcome. In a prototype, this is perfect. You want to validate an idea, test an API, or build an internal tool.

However, production software lives under different constraints. It requires reliability, observability, and the ability to be maintained by a team (human or AI) over years.

There is a counter-argument that AI reduces the cost of maintenance, so code quality matters less. While true that AI can help write tests or refactor functions, it is beside the point. AI agents struggle with context windows just as humans struggle with cognitive load. If you don’t have a good setup, clear boundaries, good hygiene, and logical flow, things get exponentially difficult for both humans and AI.

To survive the transition from prototype to production, you need to invest in the environment.

1. Magic Requires Modularity

A “magical” user experience is rarely the result of one massive, intelligent algorithm. It is usually the result of many small, boring components handling edge cases gracefully.

To create magic, you need modularity.

A system of interacting modules, where each is responsible for a simple, well-defined job, allows you to tweak and evaluate components independently. This is critical for AI-assisted development. If an AI agent attempts to modify a monolithic script, the blast radius of a mistake is the entire system. If it modifies a contained module with a defined interface, the risk is capped.

2. Guide the Ghost (Prompting Architecture)

One of the most surprising findings from our work is that AI agents actually respond well to design principles, but only if you explicitly express them.

Agents default to the path of least resistance. If you ask for a feature, they will give you the quickest implementation (often “spaghetti code”). However, if you explicitly guide them by regularly reminding them of your modularity constraints, error-handling standards, and separation of concerns, they can be excellent architectural partners.

You cannot just prompt for code; you must prompt for architecture.

3. Simplicity is Long-Term Performance

There is a tension between simplicity and “performance” in the short term. You can vibe code a Proof of Concept (PoC) in a single day without caring about the overall health of the system.

While that implementation might appear efficient, it often creates a spaghetti codebase that halts future development. Real performance is not just runtime speed; it is velocity over time. Simplicity allows you, and your AI agents, to understand the state of the system six months from now.

4. The “Rewrite” Myth

A common pitfall in engineering is deprioritizing hygiene under the assumption that “one day we will replace this with a clean version.”

History teaches us this is a lie. Users and downstream tasks (including other AI agents) get used to the quirks of your existing system. Replacing it requires reproducing those quirks or suffering a painful migration. The messy system you build today is likely the system you will be maintaining in three years.

5. Debugging is Development

In the age of probabilistic software (LLMs), software engineering looks less like writing logic and more like shepherding behavior.

A massive portion of the workflow is the debug-and-develop cycle: identifying problematic examples, tracing them through the system, and improving the module responsible. Therefore, investing in observability is non-negotiable. You need the ability to easily trace an arbitrary example’s path through the system. If you cannot see where the agent went wrong, you cannot fix it.

Conclusion: The Environment is the Product

Before adding extensive new features, we recommend investing first in the environment:

Observability: Can you see what’s happening?
Protection: Do you have tests that catch regressions?
Decoupling: Are you building on external modules rather than in them?

These concerns sound remote when you are rushing to ship a demo, but the pain from careless design arrives faster than you expect, likely within weeks.

AI makes it easier than ever to build software quickly. But engineering discipline is still the only thing that makes it possible to build software that lasts. You can vibe code a prototype, but you can’t vibe code a platform.

Not yet.

Coming soon

Arya Tafvizi — Mon, 01 Dec 2025 23:36:50 GMT

This is Arya's Substack.

Subscribe now