Privacy Policy© 2026 DEV BAK - TECH BLOG. All rights reserved.
DEV BAK - TECH BLOG
AI

LLM Agent Output Validation: Why Hallucinations Pass JSON Schema and How to Design a 3-Layer Defense

Once you put an LLM-based agent into production, you'll encounter strange failures sooner than you expect. Digging through the logs, you'll find that JSON parsing succeeded, schema validation passed, yet a downstream service throws an error complaining about a bad value. At first I kept asking myself, "The schema matched — why?" Only to discover that hallucinated values were sitting perfectly intact inside structurally valid JSON.

These "silent failures" are the hardest type to catch in an agent pipeline. They look like successes on the surface, invisible to monitoring, and only surface as customer complaints after real user traffic hits. A single regex-based filter catches only about 60–70% of LLM output contamination, but research (Kalvium Labs, 2025) shows that combining an LLM-based classifier with output validation in a multi-layer approach raises the block rate to 99.1% — the practical value of stacking validation layers is substantial.

Structurally valid JSON does not mean correct output. The real challenge of agent output validation is designing an architecture that systematically closes this gap — across three layers: structure, semantics, and rules.


Core Concepts

Validation Is Not a Single-Point Event

If you think of output validation as "parse after receiving a response," you'll stop at layer 1. In practice, validation must occur at three points in the pipeline.

Pre-generation: Explicitly inject a JSON schema into the prompt and, where possible, enable structured output features to reduce the probability of format errors from the model in the first place.

python
import json
from pydantic import BaseModel
from anthropic import Anthropic
 
class AnalysisResult(BaseModel):
    summary: str
    confidence: float
    sources: list[str]
 
client = Anthropic()
 
# Directly inject the JSON schema into the system prompt
schema_json = json.dumps(AnalysisResult.model_json_schema(), indent=2)
system_prompt = f"""Always return only responses that exactly follow this JSON schema:
{schema_json}
 
Never include any text outside the JSON object."""
 
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=system_prompt,
    messages=[{"role": "user", "content": "Analyze the latest AI trends"}]
)

Post-generation: JSON parsing, schema validation, and automatic retry on validation failure. This is the domain of layer 1.

Pre-consumption: Before passing output downstream, ask "is this content accurate?" — the semantic validation layer. Patterns like LLM-as-Judge and multi-verifier voting belong here.

Why are three points necessary? Pre-generation constraints alone can only catch some format errors. Post-generation checks see structure but not semantics. Only with pre-consumption validation can you answer the question "is the content correct?"

The 3-Layer Defense Structure

┌─────────────────────────────────────────────┐
│  Layer 3: Rules Validation                  │  ← Domain policies, external fact-check APIs
├─────────────────────────────────────────────┤
│  Layer 2: Semantic Validation               │  ← LLM-as-Judge, multi-verifier voting
├─────────────────────────────────────────────┤
│  Layer 1: Structural Validation             │  ← Pydantic, JSON Schema, retry loop
└─────────────────────────────────────────────┘

Until recently, most teams stopped at layer 1. A ZenML analysis of 1,200 production deployments found that "schema validation only" still accounted for the majority of cases. Then, as failures of the "schema passed but content was wrong" variety recurred, interest in layers 2 and 3 grew rapidly.

A few notable trends are also emerging. The Reflection Pattern — one of Andrew Ng's four core agent patterns — is gaining renewed attention as a key mechanism for improving output quality (from 80% on GPT-4 HumanEval with a single call to 91% with Reflection applied). A paper published on arXiv in February 2025 (2502.20379) introduced Multi-Agent Verification (MAV) — running multiple independent verifiers in parallel and reaching a final verdict by majority vote — as a new axis of test-time scaling. This approach dilutes single-judge bias through cross-verification across multiple models.


Practical Application

Now it's time to translate theory into code. We'll start from layer 1 and build upward.

Example 1: Schema-First Enforcement + Automatic Retry (Layer 1)

This is the first pattern to adopt. Using Pydantic AI or Instructor, you get schema definition and a retry loop almost for free. Pydantic AI provides the agent runtime itself, which is advantageous when handling complex agent logic together, while Instructor patches existing OpenAI/Anthropic clients in a lightweight way, letting you bolt it onto existing code with minimal changes. Honestly, this alone handles the vast majority of structural errors.

python
# Requires pydantic-ai >= 0.0.30
from pydantic import BaseModel, field_validator
from pydantic_ai import Agent
 
class ResearchSummary(BaseModel):
    title: str
    key_points: list[str]
    confidence: float
 
    @field_validator("confidence")
    @classmethod
    def check_range(cls, v: float) -> float:
        if not 0.0 <= v <= 1.0:
            raise ValueError("confidence must be between 0 and 1")
        return v
 
# Initialize at module level — creating inside a function on every call incurs unnecessary repeated initialization cost
research_agent = Agent(
    "anthropic:claude-sonnet-4-6",
    result_type=ResearchSummary,   # Auto-injects schema + validates
    retries=3,                     # On validation failure, re-queries with the reason for failure
)
 
async def summarize(topic: str) -> ResearchSummary:
    result = await research_agent.run(topic)
    return result.output

Specifying a Pydantic model in result_type automatically injects the schema into the prompt and re-queries with the reason for failure if the response doesn't match the schema. Using @field_validator to enforce the confidence range sits at roughly layer 1.5 — somewhere between structural and semantic validation.

Example 2: Reflection (Self-Critique Loop, Layer 2)

This pattern separates generation from review. Asking "did I do well?" within the same prompt is quite different in quality from having a separate critic agent review with a checklist.

python
from pydantic_ai import Agent
 
# Separating models is the key — the same model shares the same biases
generator = Agent("anthropic:claude-sonnet-4-6")
critic = Agent("anthropic:claude-opus-4-8")   # A more powerful model is recommended for the Judge
 
async def reflection_loop(task: str, max_rounds: int = 3) -> str:
    draft = (await generator.run(task)).output
 
    for _ in range(max_rounds):
        critique = await critic.run(
            f"Review the following draft against each of these criteria:\n"
            f"① Factual errors ② Logical leaps ③ Missing evidence\n\n"
            f"Draft:\n{draft}\n\n"
            f"If there are no issues, return 'APPROVED'. Otherwise, return specific revision instructions."
        )
 
        if "APPROVED" in critique.output:
            break
 
        draft = (await generator.run(
            f"Original task: {task}\n\n"
            f"Issues with the previous draft:\n{critique.output}\n\n"
            f"Write an improved version that addresses the issues above."
        )).output
 
    return draft

The instructions for the critique step must be specific. Early on I used vague instructions like "find what's wrong," and the critic agent repeatedly gave feedback on tone and style rather than content errors. Providing a checklist format like "① factual errors ② logical leaps ③ missing evidence" produces much sharper feedback. Using different models for the generator and critic also matters — the same model shares the same biases, so when the main agent is wrong, the critic may miss it for the same reason.

Example 3: LLM-as-Judge + Threshold Gate (Layer 2)

Another core layer 2 pattern. Use a powerful Judge model to score the quality of output, and block the pipeline when it falls below the threshold.

python
from pydantic import BaseModel
from pydantic_ai import Agent
 
class JudgeVerdict(BaseModel):
    score: int    # 0-10
    reason: str
    passed: bool
 
class OutputValidationError(Exception):
    pass
 
# Manage as a singleton — creating inside a function every time repeats initialization cost
judge_agent = Agent("anthropic:claude-opus-4-8", result_type=JudgeVerdict)
main_agent = Agent("anthropic:claude-sonnet-4-6")
 
async def judge_output(output: str, criteria: str) -> JudgeVerdict:
    verdict = await judge_agent.run(
        f"Evaluate the following output against the criteria: '{criteria}'.\n\nOutput:\n{output}"
    )
    return verdict.output
 
async def validated_pipeline(task: str) -> str:
    result = await main_agent.run(task)
 
    verdict = await judge_output(
        result.output,
        criteria="factual accuracy, logical completeness, relevance to topic"
    )
 
    if not verdict.passed or verdict.score < 7:
        raise OutputValidationError(
            f"Quality threshold not met (score={verdict.score}): {verdict.reason}"
        )
 
    return result.output

The reason for using Opus as the Judge is verdict accuracy. If the main agent and the Judge use the same model, a "colluding error" can occur where they share the same biases. When first adopting this, it's recommended to log only the score without setting the if not verdict.passed gate, in order to establish a baseline of current output quality — identifying which categories score low before setting the threshold significantly reduces false positives.

LLM-as-Judge: A pattern where an LLM evaluates the output of another LLM. It enables fast, automated quality assessment, but carries the risk of "collusion errors" where two models share the same biases. Using different providers or different model combinations is recommended when possible.

Example 4: Multi-Agent Verification (MAV, Layer 2)

A pattern that dilutes the limitations of a single Judge by using multiple independent verifiers. The core of MAV is not simply adding more models — it's using models with different biases. Running the same model three times just repeats the same mistake three times.

python
import asyncio
from pydantic import BaseModel
from pydantic_ai import Agent
 
class VerifierResult(BaseModel):
    refuted: bool
    reason: str
 
# Different model/provider combinations for bias diversity — this is the real value of MAV
VERIFIER_MODELS = [
    "anthropic:claude-sonnet-4-6",
    "anthropic:claude-haiku-4-5",   # Even within the same provider, different models have different characteristics
    "openai:gpt-4o-mini",           # Using a different provider maximizes bias diversity
]
 
async def multi_verifier_vote(claim: str) -> bool:
    async def verify(model: str) -> bool:
        verifier = Agent(model, result_type=VerifierResult)
        result = await verifier.run(
            f"If the following claim contains any errors or factual inconsistencies that can be refuted, "
            f"answer refuted=True. If unsure, refuted=True is the safe choice.\n\n"
            f"Claim: {claim}"
        )
        return result.output.refuted
 
    votes = await asyncio.gather(*[verify(model) for model in VERIFIER_MODELS])
    refuted_count = sum(votes)
 
    # Fewer than a majority refuted → verdict: valid output
    return refuted_count < (len(VERIFIER_MODELS) / 2)

It's good practice to set the number of verifiers to an odd number (3 or 5) to prevent ties. For high-risk domains (medical, financial, legal), you may set the criteria conservatively to "fail if even one verifier refutes."

Example 5: Domain Rules Validation (Layer 3)

Layer 3 often appears only in diagrams, left as a vague term like "Automated Reasoning," but in practice it's a fairly concrete pattern. There are extreme cases that require formal verification tools like SMT solvers or Z3, but most situations are well served by a rule engine derived from domain policies.

python
from dataclasses import dataclass
import httpx
 
class PolicyViolationError(Exception):
    pass
 
@dataclass
class MedicalSummary:
    diagnosis: str
    recommended_dosage_mg: float
    patient_age: int
 
# Domain policy rules — deterministically validated in code, not by an LLM
def validate_medical_rules(summary: MedicalSummary) -> None:
    # Rule 1: Do not exceed the adult dosage cap for pediatric patients (under 18)
    if summary.patient_age < 18 and summary.recommended_dosage_mg > 500:
        raise PolicyViolationError(
            f"Prescribing {summary.recommended_dosage_mg}mg to a pediatric patient violates policy"
        )
 
    # Rule 2: Verify the diagnosis exists in an external database
    if not _verify_diagnosis_exists(summary.diagnosis):
        raise PolicyViolationError(
            f"'{summary.diagnosis}' is not a diagnosis found in the verified disease database"
        )
 
def _verify_diagnosis_exists(diagnosis: str) -> bool:
    # Query the public ICD-10 API
    response = httpx.get(
        "https://clinicaltables.nlm.nih.gov/api/icd10cm/v3/search",
        params={"terms": diagnosis, "maxList": 1}
    )
    return response.json()[0] > 0

The key to layer 3 is that you don't ask the LLM "is this rule correct?" — you check it directly in code. This approach supplements areas where LLMs can be wrong with deterministic rules, making auditing far easier as well.


Pros and Cons

Now that we've looked at the concepts and code, it's time to map out the realistic tradeoffs.

Advantages

Item Description
Hallucination suppression Combining structural + semantic validation dramatically reduces error rates compared to schema enforcement alone
Pipeline stability Blocks upstream errors before they propagate to downstream stages
Automatic recovery Retry loops handle transient generation failures without human intervention
Auditability Logging the verdict reason makes failure analysis and regulatory compliance far easier
Security hardening Regex + LLM classifier combination raises prompt injection block rate to 99.1%

Disadvantages and Caveats

Item Description Mitigation
Increased latency Each additional Reflection round requires at least one more LLM call; P99 latency can spike sharply Set per-layer timeouts, use async parallel processing, define latency tolerance in advance
Risk of cost explosion Without a retry cap, validation loops can repeat indefinitely, causing token costs to grow exponentially max_retries + fallback policy are mandatory design requirements
False positive problem LLM-as-Judge may rate content as correct despite being factually wrong, if it is internally consistent Introduce MAV pattern, add domain-specific validation layers
Intermediate step error propagation 17% of agent failures involve step repetition, 14% involve reasoning-action mismatches — undetectable by looking at the final output alone Log per-step validation, use agent tracing tools like Arize Phoenix
Schema hallucination Silent failure where hallucinated values exist inside valid JSON Design the semantic validation layer as mandatory, not optional

What I've felt most acutely after adopting this firsthand is the latency problem. "95% accuracy + 5-second latency" is often a less usable system in real service than "90% accuracy + 100ms latency." I recommend defining your P99 latency tolerance before adding validation layers.

Silent failure: A state where schema validation passes but the actual values are wrong. It is characterized by errors not surfacing immediately, only being discovered downstream. It is one of the most dangerous error types in an agent pipeline.

The Most Common Mistakes in Practice

  1. Not setting max_retries: If you build a retry loop, you must also build a ceiling for it. If the validation criteria are too strict or the prompt contains a contradiction, the loop won't stop and token costs will explode.

  2. Using the same model for both the Judge and the main agent: The same model shares the same biases. When the main agent is wrong, the Judge may fail to detect it for the same reason. Use a different model if possible, ideally from a different provider.

  3. Deploying to production with only layer 1: "Let's just do schema validation for now and add more later" usually means "later" never comes. Structural validation alone cannot prevent silent failures, and those failures only surface after real user traffic arrives.


Closing Thoughts

The core of agent output validation is reflecting in your design the fact that structurally valid JSON does not mean correct output. Only when all three layers — structure, semantics, and rules — are in place can you truly call it production-grade.

Three steps you can start right now:

  1. Introduce layer 1 first: After pip install pydantic-ai or pip install instructor, try replacing the response type of your existing agent with a Pydantic model. A single retries=3 automatically handles a significant portion of structural errors.

  2. Attach a Judge agent to the end of the pipeline: Try applying the judge_output() function from the example above to just your single most important agent. Don't set a gate at first — just log the score to establish a baseline of current output quality. Identifying which categories score low before setting the threshold significantly reduces false positives.

  3. Add Guardrails AI or Lakera Guard: Once you have structural and semantic validation in place, you can add a security layer. Lakera Guard blocks prompt injection and PII with under 35ms of overhead, making it especially worth adopting for agents that accept external user input.

How many layers deep is your agent built right now?


References

  • Building an AI Agent Evaluation Pipeline: 2026 Methodology
  • 5 Agent Design Patterns Every Developer Needs to Know in 2026
  • Stop LLMs from Lying: Build Self-Correcting Agents with the Reflection Pattern
  • LLM Guardrails in Production: Input, Output, and Runtime Checks
  • Mastering LLM Guardrails: Complete 2026 Guide
  • GitHub - guardrails-ai/guardrails
  • Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers (arXiv 2502.20379)
  • When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs (arXiv 2508.02994)
  • How to build LLM-as-a-Judge evaluators that hold up in production — Arize AI
  • Structured Output Isn't Reliable Output — Rotascale
  • AI Agent Guardrails & Output Validation in 2026: Tools, Patterns & Best Practices
  • Self-reflection enhances large language models towards substantial academic response — npj Artificial Intelligence
  • What 1,200 Production Deployments Reveal About LLMOps in 2025 — ZenML
#LLM에이전트#할루시네이션#Pydantic#LLM-as-Judge#출력검증#ReflectionPattern#멀티에이전트#JSON스키마#PydanticAI#가드레일
Share

Table of Contents

Core ConceptsValidation Is Not a Single-Point EventThe 3-Layer Defense StructurePractical ApplicationExample 1: Schema-First Enforcement + Automatic Retry (Layer 1)Example 2: Reflection (Self-Critique Loop, Layer 2)Example 3: LLM-as-Judge + Threshold Gate (Layer 2)Example 4: Multi-Agent Verification (MAV, Layer 2)Example 5: Domain Rules Validation (Layer 3)Pros and ConsAdvantagesDisadvantages and CaveatsThe Most Common Mistakes in PracticeClosing ThoughtsReferences

Recommended Posts

Why AI Agent LLM Costs Explode and Strategies to Cut Them by 60–80%
AI

Why AI Agent LLM Costs Explode and Strategies to Cut Them by 60–80%

If you've ever deployed an agentic AI system in production, you've probably stared at a bill in disbelief at least once. "I only sent a few prompts — why is it ...

June 5, 202621 min read
Running Qwen3-Coder Locally: Setting Up an SWE-bench 70% AI Coding Agent with a Single RTX 3090
AI

Running Qwen3-Coder Locally: Setting Up an SWE-bench 70% AI Coding Agent with a Single RTX 3090

After watching my cloud AI bills double two months in a row, I started seriously looking for alternatives. Honestly, it wasn't so much a bias of "how good could...

June 6, 202622 min read
Open-Weight vs Closed AI 2026: Now That the Benchmark Gap Has Narrowed, the Criteria for Choosing Has Changed
AI

Open-Weight vs Closed AI 2026: Now That the Benchmark Gap Has Narrowed, the Criteria for Choosing Has Changed

To be honest, until a year ago I thought closed models would maintain an overwhelming lead for some time. It seemed only natural to plug in an OpenAI API key to...

June 6, 202623 min read
AI Agent State Management Architecture — Achieving Production Reliability with LangGraph Checkpointing
AI

AI Agent State Management Architecture — Achieving Production Reliability with LangGraph Checkpointing

The first problem I encountered when building my first agent was this: "Why doesn't this agent remember what it just did?" For a simple chatbot that's fine, but...

June 5, 202627 min read
Building Your Own LLM Evaluation Framework vs. Off-the-Shelf Tools: Team Decision Criteria for 2026
AI

Building Your Own LLM Evaluation Framework vs. Off-the-Shelf Tools: Team Decision Criteria for 2026

If your team is shipping RAG, chatbots, or agents to production, this decision is waiting for you If you've ever shipped an AI feature to your product and th...

May 30, 202624 min read
Pydantic AI: Implementing Type-Safe LLM Tool Calls in Python AI Agents
AI

Pydantic AI: Implementing Type-Safe LLM Tool Calls in Python AI Agents

Catching runtime errors at write time with RunContext · output_type · dependency injection When you layer LLM-powered features onto Python code, a nagging an...

May 30, 202624 min read