Why 88% of AI Agents Fail in Production: The 5-Layer Harness Architecture Is the Answer
When GPT-4 first came out, I—along with most developers around me—shared the same misconception: "Isn't a good model all you need?" We'd slap a few prompt lines together, build a prototype that seemed to work, and ship it to production. Then, not long after, we'd find ourselves facing agents deleting the wrong files, losing context and spinning in infinite loops, or burning through tokens at an alarming rate.
A number commonly cited across harness engineering resources is that 88% of AI agent projects never reach production (Faros.ai, Milvus Blog). Not because the models are bad. Because the surrounding systems don't exist. We've moved from writing prompts well, to assembling context well, to now designing the entire environment in which agents operate. This third stage is Harness Engineering.
In this post, we'll look at each layer of the 5-layer harness architecture for production-grade AI agents—what each layer is, what breaks without it, and how to implement it—with code. By the end, you'll have code patterns you can apply to each layer immediately. This assumes intermediate Python or above.
Core Concepts
Why Do We Need a Harness?
Using a car analogy: the model is the engine. The harness is everything that makes that engine actually drive on the road—the transmission, brakes, dashboard, and seatbelts.
Harness: The operational layer surrounding a language model that handles context assembly, tool access, memory persistence, control loop execution, and quality gate enforcement. The infrastructure that transforms a model from a text generator into an autonomous execution agent.
Now that GPT-4-class models are pouring out of multiple companies, real product differentiation has shifted to how well you wrap the model. The field that systematically addresses all of this wrapping work is harness engineering.
Full Structure of the 5-Layer Architecture
A production-grade harness consists of the following five layers. Each layer should be independently designable and testable.
| Layer | Name | One-Line Role | Without It? |
|---|---|---|---|
| 1 | Tool Orchestration | How the agent reaches into the external world | Stops at text generation |
| 2 | Verification Loops | The checkpoint between "did it" and "did it correctly" | Declares completion but actually failed |
| 3 | Context & Memory | Managing memory and state across conversations and tasks | Loses the goal in long tasks |
| 4 | Guardrails | Policies that enforce lines that must never be crossed | The agent deletes your files |
| 5 | Observability | The eyes that let you understand what happened after the fact | Failures occur but you don't know why |
Layer 1 — Tool Orchestration
The backbone that connects the agent to the file system, shell commands, internal APIs, and external services. Without it, the agent remains a tool that only generates text.
MCP (Model Context Protocol), released by Anthropic in late 2024, has greatly accelerated standardization at this layer. By standardizing the communication method between tools and models, a tool implemented once works across all compatible clients. In short, a common interface was created for plugging and unplugging tools like plugins.
# Example of MCP-based tool registration (Claude Agent SDK)
from anthropic import Anthropic
client = Anthropic()
tools = [
{
"name": "read_file",
"description": "Reads and returns the contents of a file",
"input_schema": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "Absolute path of the file to read"
}
},
"required": ["path"]
}
},
{
"name": "run_tests",
"description": "Runs the specified test suite and returns the results",
"input_schema": {
"type": "object",
"properties": {
"test_path": {
"type": "string",
"description": "Path to the test file or directory"
}
},
"required": ["test_path"]
}
}
]
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=4096,
tools=tools,
messages=[{"role": "user", "content": "Read src/auth.py and run the tests"}]
)Layer 2 — Verification Loops
Honestly, this is simultaneously the most important and the most frequently omitted layer. Just because an agent declares "task complete" doesn't mean it actually is.
The tendency for agents to declare completion without actually verifying the result is called "Victory Declaration Bias" in practice. This was the thing that burned me the most when I first deployed agents. You get situations where an agent reports a failed task as "successful" and proceeds to the next step. The PEV (Plan-Execute-Verify) pattern is the core structure that resolves this.
# PEV pattern implementation example
async def pev_loop(task: str, max_retries: int = 3) -> dict:
plan = await agent.plan(task) # 1. Create a plan
for attempt in range(max_retries):
result = await agent.execute(plan) # 2. Execute
# Delegate verification to an independent function, not the agent itself
verification = await verifier.check(
task=task,
plan=plan,
result=result
) # 3. Verify
if verification.passed:
return {"status": "success", "result": result}
# On verification failure, incorporate feedback, revise the plan, and retry
plan = await agent.replan(
original_task=task,
failed_result=result,
feedback=verification.feedback
)
return {"status": "failed", "last_result": result}Phase Gate: An architectural pattern that forces progression through each stage only after the previous stage's verification passes. Prevents agents from skipping intermediate results and generating only final output.
Layer 3 — Context & Memory
An LLM's context window is finite. In long tasks or multi-session agents, previous task content, decision history, and current goal state must be persistently managed somewhere.
When I first applied this distinction, the trickiest part was "how to define medium-term memory." Here's the guideline I use in practice:
- Short-term memory: Conversation history within the current session. Maintained directly within the context window.
- Medium-term memory: Intermediate results of the current task, list of executed tools, partial artifacts. Discarded at task end or promoted to long-term.
- Long-term memory: User preferences, project conventions, recurring domain knowledge. Persists across sessions.
# Hierarchical memory management example
# mem0: Python library that provides long-term memory for agents (pip install mem0ai)
from mem0 import Memory
memory = Memory()
class StatefulAgent:
def __init__(self, user_id: str):
self.user_id = user_id
self.session_history = [] # Short-term: maintained only within the session
async def process(self, message: str) -> str:
# Retrieve context relevant to the current message from long-term memory
long_term = memory.search(
query=message,
user_id=self.user_id,
limit=5
)
# Synthesize short-term + long-term memory to build the system prompt
context = self._build_context(
session=self.session_history,
long_term=long_term
)
response = await llm.complete(system=context, user=message)
# Update short-term memory
self.session_history.append({"role": "user", "content": message})
self.session_history.append({"role": "assistant", "content": response})
# Save important information (project conventions, user preferences, etc.) to long-term memory
memory.add(
messages=[
{"role": "user", "content": message},
{"role": "assistant", "content": response}
],
user_id=self.user_id
)
return response
def _build_context(self, session: list, long_term: list) -> str:
# Insert long-term memory at the beginning of the system prompt
memory_context = "\n".join(f"- {m['memory']}" for m in long_term)
return f"Past memories:\n{memory_context}\n\n"Layer 4 — Guardrails
Cases where AI-generated code introduces security vulnerabilities continue to be reported (arXiv 2511.07669). Actions like an agent deleting files, transmitting data externally, or changing configurations are hard to reverse. Guardrails are the layer that places mandatory checkpoints on these high-risk actions.
# High-risk action guardrail implementation
from enum import Enum
from typing import Callable
class RiskLevel(Enum):
LOW = "low"
MEDIUM = "medium"
HIGH = "high"
CRITICAL = "critical"
# It's best to define risk level and policy together at the time of tool registration
ACTION_POLICIES = {
"read_file": RiskLevel.LOW,
"write_file": RiskLevel.MEDIUM,
"delete_file": RiskLevel.CRITICAL,
"send_http_request": RiskLevel.HIGH,
"run_shell_command": RiskLevel.HIGH,
}
async def guarded_execute(
action: str,
params: dict,
human_approval_fn: Callable
) -> dict:
risk = ACTION_POLICIES.get(action, RiskLevel.HIGH)
if risk == RiskLevel.CRITICAL:
# CRITICAL actions require explicit human approval
approved = await human_approval_fn(
action=action,
params=params,
reason=f"Risk level CRITICAL: approval required to execute {action}"
)
if not approved:
return {"status": "rejected", "reason": "Not approved by user"}
elif risk == RiskLevel.HIGH:
# HIGH can be auto-approved after a policy check
if not policy_check(action, params): # policy check helper
return {"status": "blocked", "reason": "Policy violation"}
return await execute_action(action, params) # actual tool executionIn the healthcare domain, this goes a step further. When the model outputs a medication care plan, the guardrail layer immediately cross-validates drug interactions against a deterministic medical dictionary tool. Rather than trusting the model's judgment, it cross-references against a verifiable external source.
Layer 5 — Observability
When an agent fails in production, logs alone aren't enough to understand "why it failed." You need to be able to trace which tool was called at which layer, how many tokens were used, and where the loop broke.
I once suffered greatly trying to add tracing after the fact. Once the architecture gets tangled, just figuring out where to attach what becomes its own job. It's much better to include it in your first agent.
# Agent tracing with LangSmith
# LangSmith: observability tool for tracking agent execution flow, token usage, and failure points
# pip install langsmith
from langsmith import traceable
@traceable(name="agent-main-loop", tags=["production"])
async def agent_loop(task: str) -> str:
"""@traceable automatically traces the entire function execution."""
plan = await create_plan(task)
results = []
for step in plan.steps:
result = await execute_step(step)
results.append(result)
return synthesize_results(results)
# Applying it to individual steps records the tokens and time consumed per step
@traceable(name="execute-step")
async def execute_step(step) -> dict:
result = await run_tool(step.tool, step.params) # actual tool call
return {"step_type": step.type, "result": result}In an enterprise data pipeline case, it was reported that 65% of AI failures due to context drift, schema mismatches, and state degradation were blocked at the harness layer (Milvus Blog). For that blocking to be possible, observability must come first.
Practical Application
Example 1: Anthropic 3-Agent Coding Harness
This is the 3-agent pattern reportedly used internally at Anthropic. A short brief is transformed into a full product spec, code is implemented, and verification follows automatically.
import asyncio
from anthropic import Anthropic
client = Anthropic()
async def planner_agent(brief: str) -> dict:
"""Short brief → complete product spec (returns JSON)"""
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=8192,
system="""You are a product architect.
Expand the simple requirements provided into a technical spec.
Always return in JSON format:
{"features": [...], "api_contracts": {...}, "test_criteria": [...]}""",
messages=[{"role": "user", "content": brief}]
)
return parse_spec(response.content[0].text) # JSON parsing helper
async def generator_agent(spec: dict, sprint_idx: int) -> dict:
"""Spec → sprint-unit code implementation"""
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=8192,
system="You are a senior engineer. Implement code according to the specification.",
messages=[{
"role": "user",
"content": f"Implement Sprint {sprint_idx}:\n{spec['features'][sprint_idx]}"
}]
)
return {"code": response.content[0].text, "sprint": sprint_idx}
async def evaluator_agent(code: dict, test_criteria: list) -> dict:
"""Run tests, analyze results, and determine pass/fail"""
# run_tests: helper that calls actual test tools like pytest / Playwright
test_results = await run_tests(code["code"], test_criteria)
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
system="Analyze the test results and provide a pass/fail judgment with specific feedback.",
messages=[{"role": "user", "content": f"Test results:\n{test_results}"}]
)
return {
"passed": "PASS" in response.content[0].text,
"feedback": response.content[0].text
}
async def three_agent_harness(brief: str) -> str:
spec = await planner_agent(brief) # Planning
final_code = []
for sprint_idx in range(len(spec["features"])):
retries = 0
while retries < 3:
code = await generator_agent(spec, sprint_idx) # Implementation
evaluation = await evaluator_agent( # Layer 2: verification loop
code, spec["test_criteria"]
)
if evaluation["passed"]:
final_code.append(code)
break
# Incorporate feedback into the spec for use in the next attempt
spec["features"][sprint_idx] += f"\nFeedback: {evaluation['feedback']}"
retries += 1
return assemble_final_code(final_code) # Assemble per-sprint code into final deliverable| Component | Primary Harness Layer | Role |
|---|---|---|
planner_agent |
1 (Tool Orchestration) | Converts requirements → actionable spec |
generator_agent |
1, 3 (Tool + Memory) | Sprint implementation, maintains previous feedback context |
evaluator_agent |
2 (Verification Loop) | Independent verification — does not self-verify its own output |
while retries < 3 |
2 (Verification Loop) | Automatic retry with feedback incorporated on verification failure |
Example 2: Enterprise Coding Agent — Full 5-Layer Integration
A TypeScript setup applying all five layers to an AI agent that creates and reviews PRs. A scenario frequently encountered in practice.
// Full 5-layer harness setup in TypeScript
import Anthropic from "@anthropic-ai/sdk";
import { Client as LangSmithClient } from "langsmith";
import { traceable } from "langsmith/traceable"; // Layer 5: Observability
import { Memory } from "mem0ai"; // Layer 3: Context & Memory
const client = new Anthropic();
const langsmith = new LangSmithClient();
const memory = new Memory();
// Layer 4: Guardrails — list of high-risk operations to block
const BLOCKED_OPERATIONS = ["force_push", "delete_branch_main", "drop_table"];
// Wrapping with traceable() automatically records all executions to LangSmith
const codingAgentHarness = traceable(
async (task: string, repoContext: string) => {
// Layer 3: Load project long-term memory
const projectMemory = await memory.search({ query: task, limit: 10 });
// Layer 4: Pre-task risk check
const isBlocked = BLOCKED_OPERATIONS.some(op => task.includes(op));
if (isBlocked) {
return { status: "blocked", reason: "Policy-violating operation detected" };
}
// Layer 1: Tool definitions (Tool Orchestration)
const tools: Anthropic.Tool[] = [
{
name: "read_file",
description: "Reads and returns the contents of a file",
input_schema: {
type: "object" as const,
properties: {
path: { type: "string", description: "Path of the file to read" }
},
required: ["path"]
}
},
{
name: "write_file",
description: "Writes content to a file",
input_schema: {
type: "object" as const,
properties: {
path: { type: "string", description: "Path of the file to write" },
content: { type: "string", description: "Content to write to the file" }
},
required: ["path", "content"]
}
},
{
name: "run_tests",
description: "Runs the test suite and returns the results",
input_schema: {
type: "object" as const,
properties: {
test_path: { type: "string", description: "Path to the test directory" }
},
required: ["test_path"]
}
}
];
let passed = false;
let attempts = 0;
// Layer 2: PEV verification loop
while (!passed && attempts < 3) {
await client.messages.create({
model: "claude-opus-4-7",
max_tokens: 8192,
tools,
system: buildSystemPrompt(projectMemory, repoContext), // inject memory
messages: [{ role: "user", content: task }]
});
// Verify with an independent function, not the agent itself
const testResult = await runTestSuite();
passed = testResult.passed;
attempts++;
}
// Layer 3: Save task result to long-term memory
await memory.add({ task, result: "completed", attempts });
return { status: passed ? "success" : "failed", attempts };
},
{ name: "coding-agent", tags: ["production"] }
);Pros and Cons Analysis
The most painfully memorable lesson was token explosion. I enthusiastically added verification loops early on, then was shocked to see a simple code change consuming four times the tokens. It's realistic to lighten the verification loop for low-risk tasks.
Advantages
| Item | Details |
|---|---|
| Reliability | In a reality where 88% of AI agent projects fall short of production, the 5-layer harness is the most proven means of closing that gap |
| Model independence | The harness provides a consistent interface, minimizing changes to agent logic when switching from GPT-4 to Claude |
| Policy centralization | Security, compliance, and cost control concentrated in a single harness layer for ease of auditing and management |
| Per-layer debugging | Failure root causes can be traced to a specific layer, reducing debugging cost |
| Potential for autonomous evolution | Evolving toward using observability data as feedback to automatically improve the harness itself (Agentic Harness Engineering) (arXiv 2604.25850) |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Performance overhead | Verification and retry policies can increase runtime and token consumption by up to 2x | Lighten or skip the verification loop for low-risk tasks |
| Token explosion | Token usage can increase by an order of magnitude or more when switching to autonomous loops | Context compression, summarization strategies, cost budget limits |
| Complexity | Correctly designing and testing all five layers requires significant engineering investment | Incremental adoption. The order 5→4→2 is effective in practice |
| Data quality limits | 27% of agent failures are data quality issues — the harness alone cannot solve them (Milvus Blog) | Maintain data pipeline quality in parallel |
| Victory Declaration Bias | Agents tend to declare task completion without verification | Separate verification into an independent agent or deterministic function |
Context Drift: The phenomenon where the initial goal or state gradually degrades over the course of a long agent execution. Especially severe when the memory layer is absent.
Most Common Mistakes in Practice
-
Delegating verification to the agent itself — When an agent verifies its own output, bias is introduced. I didn't fully grasp this principle until I actually witnessed an agent passing its own bugs. It's best to separate verification into a dedicated agent or a deterministic function (test execution, schema validation, etc.).
-
Registering high-risk tools without guardrails — The "attach everything first, restrict later" approach is dangerous in production. Defining the risk level and policy together at the time of tool registration is far safer in the long run.
-
Adding observability last — Tracing and cost tracking are very difficult to retrofit. Once the architecture gets tangled, just figuring out where to put
@traceablebecomes its own job. Including Layer 5 from the very first agent is ultimately faster.
Closing Thoughts
Harness engineering is the infrastructure that elevates AI agents from "demos that seem like they might work" to "products that actually work." After reading this post, if the question "how many layers does this agent have right now?" naturally comes to mind when you look at an agent again, that's enough. You don't need to perfectly implement all five layers at once. Building them up one by one, starting with what you can do right now, is more than enough to make a significant difference.
Three steps you can start right now:
-
Turn on observability first — Attaching LangSmith or Langfuse to an existing agent takes just a few lines of code. Just running
pip install langsmithand adding@traceableto the agent entry point is enough to visualize token usage and failure points. You'll start seeing intermediate results that were silently failing. -
Add a guardrail to your most dangerous tool — Pick one tool that's hard to reverse—file deletion, external API calls—assign it a risk level, and add an approval gate. You'll immediately feel the accident-prevention effect.
-
Apply a PEV loop in one place — Adding code that verifies the result with an independent function (test execution, schema validation, etc.) after agent execution will immediately show you just how different completion declarations are from actual completion.
References
- What Is Harness Engineering for AI Agents? | Milvus Blog
- Harness Engineering: Making AI Coding Agents Work in 2026 | Faros.ai
- What Is Harness Engineering AI? The Definitive 2026 Guide | Atlan
- AI Harness Engineering: The Layer That Makes Your LLM Applications Actually Work | Pinggy
- Harness Engineering for AI Coding Agents | Augment Code
- Agentic Harness Engineering: Observability-Driven Automatic Evolution | arXiv 2604.25850
- Making LLMs Reliable: A Five-Layer Architecture for High-Stakes Decisions | arXiv 2511.07669
- Agent Harness Engineering: The Rise of the AI Control Plane | Medium
- GitHub: awesome-harness-engineering | ai-boost