Privacy Policy© 2026 DEV BAK - TECH BLOG. All rights reserved.
DEV BAK - TECH BLOG
AI

Why 88% of AI Agents Fail in Production: The 5-Layer Harness Architecture Is the Answer

When GPT-4 first came out, I—along with most developers around me—shared the same misconception: "Isn't a good model all you need?" We'd slap a few prompt lines together, build a prototype that seemed to work, and ship it to production. Then, not long after, we'd find ourselves facing agents deleting the wrong files, losing context and spinning in infinite loops, or burning through tokens at an alarming rate.

A number commonly cited across harness engineering resources is that 88% of AI agent projects never reach production (Faros.ai, Milvus Blog). Not because the models are bad. Because the surrounding systems don't exist. We've moved from writing prompts well, to assembling context well, to now designing the entire environment in which agents operate. This third stage is Harness Engineering.

In this post, we'll look at each layer of the 5-layer harness architecture for production-grade AI agents—what each layer is, what breaks without it, and how to implement it—with code. By the end, you'll have code patterns you can apply to each layer immediately. This assumes intermediate Python or above.


Core Concepts

Why Do We Need a Harness?

Using a car analogy: the model is the engine. The harness is everything that makes that engine actually drive on the road—the transmission, brakes, dashboard, and seatbelts.

Harness: The operational layer surrounding a language model that handles context assembly, tool access, memory persistence, control loop execution, and quality gate enforcement. The infrastructure that transforms a model from a text generator into an autonomous execution agent.

Now that GPT-4-class models are pouring out of multiple companies, real product differentiation has shifted to how well you wrap the model. The field that systematically addresses all of this wrapping work is harness engineering.

Full Structure of the 5-Layer Architecture

A production-grade harness consists of the following five layers. Each layer should be independently designable and testable.

Layer Name One-Line Role Without It?
1 Tool Orchestration How the agent reaches into the external world Stops at text generation
2 Verification Loops The checkpoint between "did it" and "did it correctly" Declares completion but actually failed
3 Context & Memory Managing memory and state across conversations and tasks Loses the goal in long tasks
4 Guardrails Policies that enforce lines that must never be crossed The agent deletes your files
5 Observability The eyes that let you understand what happened after the fact Failures occur but you don't know why

Layer 1 — Tool Orchestration

The backbone that connects the agent to the file system, shell commands, internal APIs, and external services. Without it, the agent remains a tool that only generates text.

MCP (Model Context Protocol), released by Anthropic in late 2024, has greatly accelerated standardization at this layer. By standardizing the communication method between tools and models, a tool implemented once works across all compatible clients. In short, a common interface was created for plugging and unplugging tools like plugins.

python
# Example of MCP-based tool registration (Claude Agent SDK)
from anthropic import Anthropic
 
client = Anthropic()
 
tools = [
    {
        "name": "read_file",
        "description": "Reads and returns the contents of a file",
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {
                    "type": "string",
                    "description": "Absolute path of the file to read"
                }
            },
            "required": ["path"]
        }
    },
    {
        "name": "run_tests",
        "description": "Runs the specified test suite and returns the results",
        "input_schema": {
            "type": "object",
            "properties": {
                "test_path": {
                    "type": "string",
                    "description": "Path to the test file or directory"
                }
            },
            "required": ["test_path"]
        }
    }
]
 
response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    tools=tools,
    messages=[{"role": "user", "content": "Read src/auth.py and run the tests"}]
)

Layer 2 — Verification Loops

Honestly, this is simultaneously the most important and the most frequently omitted layer. Just because an agent declares "task complete" doesn't mean it actually is.

The tendency for agents to declare completion without actually verifying the result is called "Victory Declaration Bias" in practice. This was the thing that burned me the most when I first deployed agents. You get situations where an agent reports a failed task as "successful" and proceeds to the next step. The PEV (Plan-Execute-Verify) pattern is the core structure that resolves this.

python
# PEV pattern implementation example
async def pev_loop(task: str, max_retries: int = 3) -> dict:
    plan = await agent.plan(task)        # 1. Create a plan
 
    for attempt in range(max_retries):
        result = await agent.execute(plan)   # 2. Execute
 
        # Delegate verification to an independent function, not the agent itself
        verification = await verifier.check(
            task=task,
            plan=plan,
            result=result
        )                                     # 3. Verify
 
        if verification.passed:
            return {"status": "success", "result": result}
 
        # On verification failure, incorporate feedback, revise the plan, and retry
        plan = await agent.replan(
            original_task=task,
            failed_result=result,
            feedback=verification.feedback
        )
 
    return {"status": "failed", "last_result": result}

Phase Gate: An architectural pattern that forces progression through each stage only after the previous stage's verification passes. Prevents agents from skipping intermediate results and generating only final output.

Layer 3 — Context & Memory

An LLM's context window is finite. In long tasks or multi-session agents, previous task content, decision history, and current goal state must be persistently managed somewhere.

When I first applied this distinction, the trickiest part was "how to define medium-term memory." Here's the guideline I use in practice:

  • Short-term memory: Conversation history within the current session. Maintained directly within the context window.
  • Medium-term memory: Intermediate results of the current task, list of executed tools, partial artifacts. Discarded at task end or promoted to long-term.
  • Long-term memory: User preferences, project conventions, recurring domain knowledge. Persists across sessions.
python
# Hierarchical memory management example
# mem0: Python library that provides long-term memory for agents (pip install mem0ai)
from mem0 import Memory
 
memory = Memory()
 
class StatefulAgent:
    def __init__(self, user_id: str):
        self.user_id = user_id
        self.session_history = []  # Short-term: maintained only within the session
 
    async def process(self, message: str) -> str:
        # Retrieve context relevant to the current message from long-term memory
        long_term = memory.search(
            query=message,
            user_id=self.user_id,
            limit=5
        )
 
        # Synthesize short-term + long-term memory to build the system prompt
        context = self._build_context(
            session=self.session_history,
            long_term=long_term
        )
 
        response = await llm.complete(system=context, user=message)
 
        # Update short-term memory
        self.session_history.append({"role": "user", "content": message})
        self.session_history.append({"role": "assistant", "content": response})
 
        # Save important information (project conventions, user preferences, etc.) to long-term memory
        memory.add(
            messages=[
                {"role": "user", "content": message},
                {"role": "assistant", "content": response}
            ],
            user_id=self.user_id
        )
 
        return response
 
    def _build_context(self, session: list, long_term: list) -> str:
        # Insert long-term memory at the beginning of the system prompt
        memory_context = "\n".join(f"- {m['memory']}" for m in long_term)
        return f"Past memories:\n{memory_context}\n\n"

Layer 4 — Guardrails

Cases where AI-generated code introduces security vulnerabilities continue to be reported (arXiv 2511.07669). Actions like an agent deleting files, transmitting data externally, or changing configurations are hard to reverse. Guardrails are the layer that places mandatory checkpoints on these high-risk actions.

python
# High-risk action guardrail implementation
from enum import Enum
from typing import Callable
 
class RiskLevel(Enum):
    LOW      = "low"
    MEDIUM   = "medium"
    HIGH     = "high"
    CRITICAL = "critical"
 
# It's best to define risk level and policy together at the time of tool registration
ACTION_POLICIES = {
    "read_file":         RiskLevel.LOW,
    "write_file":        RiskLevel.MEDIUM,
    "delete_file":       RiskLevel.CRITICAL,
    "send_http_request": RiskLevel.HIGH,
    "run_shell_command": RiskLevel.HIGH,
}
 
async def guarded_execute(
    action: str,
    params: dict,
    human_approval_fn: Callable
) -> dict:
    risk = ACTION_POLICIES.get(action, RiskLevel.HIGH)
 
    if risk == RiskLevel.CRITICAL:
        # CRITICAL actions require explicit human approval
        approved = await human_approval_fn(
            action=action,
            params=params,
            reason=f"Risk level CRITICAL: approval required to execute {action}"
        )
        if not approved:
            return {"status": "rejected", "reason": "Not approved by user"}
 
    elif risk == RiskLevel.HIGH:
        # HIGH can be auto-approved after a policy check
        if not policy_check(action, params):  # policy check helper
            return {"status": "blocked", "reason": "Policy violation"}
 
    return await execute_action(action, params)  # actual tool execution

In the healthcare domain, this goes a step further. When the model outputs a medication care plan, the guardrail layer immediately cross-validates drug interactions against a deterministic medical dictionary tool. Rather than trusting the model's judgment, it cross-references against a verifiable external source.

Layer 5 — Observability

When an agent fails in production, logs alone aren't enough to understand "why it failed." You need to be able to trace which tool was called at which layer, how many tokens were used, and where the loop broke.

I once suffered greatly trying to add tracing after the fact. Once the architecture gets tangled, just figuring out where to attach what becomes its own job. It's much better to include it in your first agent.

python
# Agent tracing with LangSmith
# LangSmith: observability tool for tracking agent execution flow, token usage, and failure points
# pip install langsmith
from langsmith import traceable
 
@traceable(name="agent-main-loop", tags=["production"])
async def agent_loop(task: str) -> str:
    """@traceable automatically traces the entire function execution."""
    plan = await create_plan(task)
 
    results = []
    for step in plan.steps:
        result = await execute_step(step)
        results.append(result)
 
    return synthesize_results(results)
 
# Applying it to individual steps records the tokens and time consumed per step
@traceable(name="execute-step")
async def execute_step(step) -> dict:
    result = await run_tool(step.tool, step.params)  # actual tool call
    return {"step_type": step.type, "result": result}

In an enterprise data pipeline case, it was reported that 65% of AI failures due to context drift, schema mismatches, and state degradation were blocked at the harness layer (Milvus Blog). For that blocking to be possible, observability must come first.


Practical Application

Example 1: Anthropic 3-Agent Coding Harness

This is the 3-agent pattern reportedly used internally at Anthropic. A short brief is transformed into a full product spec, code is implemented, and verification follows automatically.

python
import asyncio
from anthropic import Anthropic
 
client = Anthropic()
 
async def planner_agent(brief: str) -> dict:
    """Short brief → complete product spec (returns JSON)"""
    response = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=8192,
        system="""You are a product architect.
Expand the simple requirements provided into a technical spec.
Always return in JSON format:
{"features": [...], "api_contracts": {...}, "test_criteria": [...]}""",
        messages=[{"role": "user", "content": brief}]
    )
    return parse_spec(response.content[0].text)  # JSON parsing helper
 
async def generator_agent(spec: dict, sprint_idx: int) -> dict:
    """Spec → sprint-unit code implementation"""
    response = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=8192,
        system="You are a senior engineer. Implement code according to the specification.",
        messages=[{
            "role": "user",
            "content": f"Implement Sprint {sprint_idx}:\n{spec['features'][sprint_idx]}"
        }]
    )
    return {"code": response.content[0].text, "sprint": sprint_idx}
 
async def evaluator_agent(code: dict, test_criteria: list) -> dict:
    """Run tests, analyze results, and determine pass/fail"""
    # run_tests: helper that calls actual test tools like pytest / Playwright
    test_results = await run_tests(code["code"], test_criteria)
 
    response = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=2048,
        system="Analyze the test results and provide a pass/fail judgment with specific feedback.",
        messages=[{"role": "user", "content": f"Test results:\n{test_results}"}]
    )
    return {
        "passed": "PASS" in response.content[0].text,
        "feedback": response.content[0].text
    }
 
async def three_agent_harness(brief: str) -> str:
    spec = await planner_agent(brief)      # Planning
 
    final_code = []
    for sprint_idx in range(len(spec["features"])):
        retries = 0
        while retries < 3:
            code = await generator_agent(spec, sprint_idx)   # Implementation
            evaluation = await evaluator_agent(              # Layer 2: verification loop
                code, spec["test_criteria"]
            )
 
            if evaluation["passed"]:
                final_code.append(code)
                break
 
            # Incorporate feedback into the spec for use in the next attempt
            spec["features"][sprint_idx] += f"\nFeedback: {evaluation['feedback']}"
            retries += 1
 
    return assemble_final_code(final_code)  # Assemble per-sprint code into final deliverable
Component Primary Harness Layer Role
planner_agent 1 (Tool Orchestration) Converts requirements → actionable spec
generator_agent 1, 3 (Tool + Memory) Sprint implementation, maintains previous feedback context
evaluator_agent 2 (Verification Loop) Independent verification — does not self-verify its own output
while retries < 3 2 (Verification Loop) Automatic retry with feedback incorporated on verification failure

Example 2: Enterprise Coding Agent — Full 5-Layer Integration

A TypeScript setup applying all five layers to an AI agent that creates and reviews PRs. A scenario frequently encountered in practice.

typescript
// Full 5-layer harness setup in TypeScript
import Anthropic from "@anthropic-ai/sdk";
import { Client as LangSmithClient } from "langsmith";
import { traceable } from "langsmith/traceable"; // Layer 5: Observability
import { Memory } from "mem0ai";                 // Layer 3: Context & Memory
 
const client = new Anthropic();
const langsmith = new LangSmithClient();
const memory = new Memory();
 
// Layer 4: Guardrails — list of high-risk operations to block
const BLOCKED_OPERATIONS = ["force_push", "delete_branch_main", "drop_table"];
 
// Wrapping with traceable() automatically records all executions to LangSmith
const codingAgentHarness = traceable(
  async (task: string, repoContext: string) => {
    // Layer 3: Load project long-term memory
    const projectMemory = await memory.search({ query: task, limit: 10 });
 
    // Layer 4: Pre-task risk check
    const isBlocked = BLOCKED_OPERATIONS.some(op => task.includes(op));
    if (isBlocked) {
      return { status: "blocked", reason: "Policy-violating operation detected" };
    }
 
    // Layer 1: Tool definitions (Tool Orchestration)
    const tools: Anthropic.Tool[] = [
      {
        name: "read_file",
        description: "Reads and returns the contents of a file",
        input_schema: {
          type: "object" as const,
          properties: {
            path: { type: "string", description: "Path of the file to read" }
          },
          required: ["path"]
        }
      },
      {
        name: "write_file",
        description: "Writes content to a file",
        input_schema: {
          type: "object" as const,
          properties: {
            path:    { type: "string", description: "Path of the file to write" },
            content: { type: "string", description: "Content to write to the file" }
          },
          required: ["path", "content"]
        }
      },
      {
        name: "run_tests",
        description: "Runs the test suite and returns the results",
        input_schema: {
          type: "object" as const,
          properties: {
            test_path: { type: "string", description: "Path to the test directory" }
          },
          required: ["test_path"]
        }
      }
    ];
 
    let passed = false;
    let attempts = 0;
 
    // Layer 2: PEV verification loop
    while (!passed && attempts < 3) {
      await client.messages.create({
        model: "claude-opus-4-7",
        max_tokens: 8192,
        tools,
        system: buildSystemPrompt(projectMemory, repoContext), // inject memory
        messages: [{ role: "user", content: task }]
      });
 
      // Verify with an independent function, not the agent itself
      const testResult = await runTestSuite();
      passed = testResult.passed;
      attempts++;
    }
 
    // Layer 3: Save task result to long-term memory
    await memory.add({ task, result: "completed", attempts });
 
    return { status: passed ? "success" : "failed", attempts };
  },
  { name: "coding-agent", tags: ["production"] }
);

Pros and Cons Analysis

The most painfully memorable lesson was token explosion. I enthusiastically added verification loops early on, then was shocked to see a simple code change consuming four times the tokens. It's realistic to lighten the verification loop for low-risk tasks.

Advantages

Item Details
Reliability In a reality where 88% of AI agent projects fall short of production, the 5-layer harness is the most proven means of closing that gap
Model independence The harness provides a consistent interface, minimizing changes to agent logic when switching from GPT-4 to Claude
Policy centralization Security, compliance, and cost control concentrated in a single harness layer for ease of auditing and management
Per-layer debugging Failure root causes can be traced to a specific layer, reducing debugging cost
Potential for autonomous evolution Evolving toward using observability data as feedback to automatically improve the harness itself (Agentic Harness Engineering) (arXiv 2604.25850)

Disadvantages and Caveats

Item Details Mitigation
Performance overhead Verification and retry policies can increase runtime and token consumption by up to 2x Lighten or skip the verification loop for low-risk tasks
Token explosion Token usage can increase by an order of magnitude or more when switching to autonomous loops Context compression, summarization strategies, cost budget limits
Complexity Correctly designing and testing all five layers requires significant engineering investment Incremental adoption. The order 5→4→2 is effective in practice
Data quality limits 27% of agent failures are data quality issues — the harness alone cannot solve them (Milvus Blog) Maintain data pipeline quality in parallel
Victory Declaration Bias Agents tend to declare task completion without verification Separate verification into an independent agent or deterministic function

Context Drift: The phenomenon where the initial goal or state gradually degrades over the course of a long agent execution. Especially severe when the memory layer is absent.

Most Common Mistakes in Practice

  1. Delegating verification to the agent itself — When an agent verifies its own output, bias is introduced. I didn't fully grasp this principle until I actually witnessed an agent passing its own bugs. It's best to separate verification into a dedicated agent or a deterministic function (test execution, schema validation, etc.).

  2. Registering high-risk tools without guardrails — The "attach everything first, restrict later" approach is dangerous in production. Defining the risk level and policy together at the time of tool registration is far safer in the long run.

  3. Adding observability last — Tracing and cost tracking are very difficult to retrofit. Once the architecture gets tangled, just figuring out where to put @traceable becomes its own job. Including Layer 5 from the very first agent is ultimately faster.


Closing Thoughts

Harness engineering is the infrastructure that elevates AI agents from "demos that seem like they might work" to "products that actually work." After reading this post, if the question "how many layers does this agent have right now?" naturally comes to mind when you look at an agent again, that's enough. You don't need to perfectly implement all five layers at once. Building them up one by one, starting with what you can do right now, is more than enough to make a significant difference.

Three steps you can start right now:

  1. Turn on observability first — Attaching LangSmith or Langfuse to an existing agent takes just a few lines of code. Just running pip install langsmith and adding @traceable to the agent entry point is enough to visualize token usage and failure points. You'll start seeing intermediate results that were silently failing.

  2. Add a guardrail to your most dangerous tool — Pick one tool that's hard to reverse—file deletion, external API calls—assign it a risk level, and add an approval gate. You'll immediately feel the accident-prevention effect.

  3. Apply a PEV loop in one place — Adding code that verifies the result with an independent function (test execution, schema validation, etc.) after agent execution will immediately show you just how different completion declarations are from actual completion.


References

  • What Is Harness Engineering for AI Agents? | Milvus Blog
  • Harness Engineering: Making AI Coding Agents Work in 2026 | Faros.ai
  • What Is Harness Engineering AI? The Definitive 2026 Guide | Atlan
  • AI Harness Engineering: The Layer That Makes Your LLM Applications Actually Work | Pinggy
  • Harness Engineering for AI Coding Agents | Augment Code
  • Agentic Harness Engineering: Observability-Driven Automatic Evolution | arXiv 2604.25850
  • Making LLMs Reliable: A Five-Layer Architecture for High-Stakes Decisions | arXiv 2511.07669
  • Agent Harness Engineering: The Rise of the AI Control Plane | Medium
  • GitHub: awesome-harness-engineering | ai-boost
#AI에이전트#하네스엔지니어링#LLM#MCP#멀티에이전트#Observability#가드레일#PEV패턴#컨텍스트메모리#Python
Share

Table of Contents

Core ConceptsWhy Do We Need a Harness?Full Structure of the 5-Layer ArchitectureLayer 1 — Tool OrchestrationLayer 2 — Verification LoopsLayer 3 — Context &#x26; MemoryLayer 4 — GuardrailsLayer 5 — ObservabilityPractical ApplicationExample 1: Anthropic 3-Agent Coding HarnessExample 2: Enterprise Coding Agent — Full 5-Layer IntegrationPros and Cons AnalysisAdvantagesDisadvantages and CaveatsMost Common Mistakes in PracticeClosing ThoughtsReferences

Recommended Posts

LangGraph Supervisor Pattern: How to Stay in Control in a Multi-Agent System
AI

LangGraph Supervisor Pattern: How to Stay in Control in a Multi-Agent System

The most common mistake when first designing a multi-agent system is connecting agents loosely under the vague expectation that "they'll figure out how to collaborate." I thought the same thing at first, and the result was always the same: you can't tell where the control flow is, you can't trace where it failed, and debugging inevitably leads you to redesign everything from scratch.

May 30, 202622 min read
Comparing Long-Term Memory for AI Agents: Mem0 vs Letta vs Zep — Three Philosophies and How to Choose
AI

Comparing Long-Term Memory for AI Agents: Mem0 vs Letta vs Zep — Three Philosophies and How to Choose

If you've ever built an LLM-based app, you've hit this wall. "How do I make it remember past conversations?" You might think you can just shove the entire conve...

May 30, 202629 min read
Building a Multimodal RAG Pipeline: Making LLMs Understand Images and Tables
AI

Building a Multimodal RAG Pipeline: Making LLMs Understand Images and Tables

When I first introduced RAG, I had a similar experience. I parsed a few hundred PDFs, loaded them into a vector DB, and ran some searches — it retrieved text-he...

May 30, 202620 min read
FP4 Quantization + Blackwell GPU: Conditions for 4× Throughput over H100 and When Not to Use It
AI

FP4 Quantization + Blackwell GPU: Conditions for 4× Throughput over H100 and When Not to Use It

llm-compressorscheme="NVFP4"ignore=["lm_head"]num_calibration_samplespip install llmcompressornvfp4_experts_onlynvfp4_experts_onlytorch.cuda.get_device_capabili...

May 29, 202622 min read
XGrammar-2: The Design Principles Behind 80x Faster Structured Output
AI

XGrammar-2: The Design Principles Behind 80x Faster Structured Output

When an LLM calls a tool or returns JSON, there's actually quite a heavy operation running behind the scenes. Every time the model emits a token, it must determ...

May 28, 202623 min read
Why Serving DeepSeek-V3 on 96 H100s Is Possible: SGLang Expert Parallelism's Communication Optimization and Memory Fragmentation Solutions
AI

Why Serving DeepSeek-V3 on 96 H100s Is Possible: SGLang Expert Parallelism's Communication Optimization and Memory Fragmentation Solutions

52,300 input tokens/s. This is the figure LMSYS announced in May 2025 when they became the first to openly deploy DeepSeek-V3 on 96 H100 GPUs. It was initially ...

May 28, 202622 min read