개인정보처리방침© 2026 DEV BAK - 기술블로그. All rights reserved.
DEV BAK - 기술블로그
AI

Cutting Long-Horizon Agent Costs by 60–90%: Caching, Compression, and Routing Strategies

I still remember the shock of receiving that first bill after putting an AI agent into production. A simple chatbot would have been predictable, but agents were different. Fixing a single bug could stack up tens of thousands of tokens of context, and if it failed, all that cost was gone and you started over from scratch. A single run exceeding ten times the expected cost was routine.

At first I thought, "Can't we just switch to a cheaper model?" — but in practice, performance dropped noticeably first. The real problem wasn't model selection; it was design. This post breaks down where costs explode structurally, then walks through four strategies — prompt caching, context compression, model routing, and checkpointing — with real code. Cuts of 60–90% are achievable while maintaining performance; the reasoning and implementation for each are covered in detail in their respective sections.

By the end, you'll spot things you can apply today. We'll go in order: from caching, which delivers immediate results with a single line of code, to checkpointing, which eliminates the failure cost of long-running agents at the source.


Where Costs Explode

Three Cost Dimensions

A long-horizon agent is an autonomous system that cycles through planning, reasoning, tool calls, and self-correction over minutes, hours, or even days. This characteristic causes the cost structure to behave entirely differently from ordinary LLM API calls.

When people talk about agent costs, they often think only of "token costs" — but in practice, three dimensions are intertwined.

Cost Type Cause Growth Pattern
Token cost Input/output token API charges Linear
Context cost Accumulation of conversation history and tool outputs Non-linear growth¹
Failure cost Full retry when a mid-run failure occurs Exponential

¹ The API charge itself scales linearly with token count. However, as context grows, the attention computation complexity inside the model increases as O(n²), and more importantly, the probability of retries and failures due to context drift rises alongside it — causing indirect costs to grow non-linearly.

Context cost is especially dangerous because, left unchecked, context drift sets in.

Context Drift vs. Context Exhaustion

  • Context exhaustion: The context window fills up physically. Even a 200K window can eventually be filled.
  • Context drift: Even with window space remaining, stale or irrelevant content accumulates and quietly degrades the model's reasoning quality.

Drift arrives much more quietly, and is far more dangerous.

Failure cost is also an easy trap to overlook. According to the Trajectory Reduction research, when task duration doubles, the failure rate quadruples. There is also data showing that agent success rates begin to decline noticeably after roughly 35 minutes — which means simply increasing the context window size cannot solve this problem on its own.


4 Core Cost-Reduction Strategies

Strategy 1: Prompt Caching — The Easiest Savings to Capture

A KV cache (Key-Value Cache) is a structure in which transformer models store intermediate attention computation results in GPU memory to avoid recomputation. Major LLM providers have begun officially supporting this KV cache at the API level. Honestly, if you're not using this, you're essentially throwing money away.

For Anthropic Claude, a cache hit makes input token costs 90% cheaper than the standard price ($0.30/M vs. $3.00/M). OpenAI automatically caches prefixes without any configuration, and cache-hit tokens are billed at 50% of the standard price (this means the unit price of hit tokens is halved, not that average savings are 50%).

When our team first applied this, the daily cost of a code review agent dropped to less than half — overnight. The implementation was surprisingly simple.

python
import anthropic
 
client = anthropic.Anthropic()
 
LONG_CODING_GUIDELINES = "..."  # coding guidelines spanning thousands of tokens
 
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a specialized code review agent.\n" + LONG_CODING_GUIDELINES,
            "cache_control": {"type": "ephemeral"}  # designate this block for caching
        }
    ],
    messages=[
        {"role": "user", "content": f"Please review the following code:\n{user_code}"}
    ]
)
 
# Check whether the cache was hit
usage = response.usage
print(f"Cache hit tokens: {usage.cache_read_input_tokens}")
print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")

Key Design Point: For caching to be effective, the prompt must be structured as static prefix (system prompt, tool definitions, long documents) + dynamic part (user input, conversation history). If the dynamic part comes first, the cache will not work at all.

Once caching handles static costs, it's time to address the context that accumulates dynamically.


Strategy 2: Context Compression — The Core Defense Against Drift

To prevent the context window from ballooning in long-running agents, periodic compression is necessary. AgentDiet (a research framework that removes unnecessary steps from agent execution trajectories) demonstrated 39–60% reduction in input tokens, and ACON (a research approach that iteratively improves compression strategies by comparing successful and failed trajectories) also achieved 26–54% reductions in peak tokens.

The pattern most commonly used in practice is a rolling summary.

python
from typing import List, Dict
import anthropic
 
class ContextManager:
    def __init__(self, max_tokens: int = 50_000, summary_threshold: int = 40_000):
        self.max_tokens = max_tokens
        self.summary_threshold = summary_threshold
        self.messages: List[Dict] = []
        self.summary: str = ""
        self.client = anthropic.Anthropic()
 
    async def add_message(self, role: str, content: str, token_count: int):
        self.messages.append({
            "role": role,
            "content": content,
            "tokens": token_count
        })
 
        current_total = sum(m["tokens"] for m in self.messages)
 
        if current_total > self.summary_threshold:
            await self._compress_old_messages()
 
    async def _compress_old_messages(self):
        cutoff = len(self.messages) // 2
        to_summarize = self.messages[:cutoff]
 
        conversation_text = "\n".join(
            f"[{m['role']}]: {m['content']}" for m in to_summarize
        )
 
        # Use a cheap, small model for compression itself — this is the key savings point
        response = self.client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": f"Please concisely summarize the key information from the following conversation:\n\n{conversation_text}"
            }]
        )
        new_summary = response.content[0].text
 
        self.summary = f"{self.summary}\n\n{new_summary}".strip()
        self.messages = self.messages[cutoff:]
 
    def get_context(self) -> List[Dict]:
        context = []
        if self.summary:
            context.append({
                "role": "system",
                "content": f"[Previous Conversation Summary]\n{self.summary}"
            })
        context.extend(
            {"role": m["role"], "content": m["content"]}
            for m in self.messages
        )
        return context

Because the compression step itself incurs LLM call costs, the key is to use a cheap, small model for summarization rather than an expensive frontier model. This single pattern produces a noticeable reduction in context-related costs.

Once compression keeps accumulation in check, let's move on to selecting the right model for each task.


Strategy 3: Model Routing — The Largest Cost Reduction

At first I thought, "Can't we just use Claude Opus or GPT-4 for everything?" — but in practice that turns out to be a surprisingly expensive assumption. Production experience shows that roughly 85% of enterprise queries can be handled adequately by lower-cost models. Using Opus for simple summaries, translations, or format conversions is like paying taxi fare to ride a bus.

python
from enum import Enum
 
class TaskComplexity(Enum):
    SIMPLE = "simple"      # Haiku-tier — translation, summarization, format conversion
    MODERATE = "moderate"  # Sonnet-tier — general analysis, code explanation
    COMPLEX = "complex"    # Opus-tier — architecture design, security analysis
 
class ModelRouter:
    ROUTING_RULES = {
        TaskComplexity.SIMPLE: "claude-haiku-4-5-20251001",
        TaskComplexity.MODERATE: "claude-sonnet-4-6",
        TaskComplexity.COMPLEX: "claude-opus-4-8",
    }
 
    def classify_task(self, task: str, context_tokens: int) -> TaskComplexity:
        # Large context likely indicates a complex task
        if context_tokens > 30_000:
            return TaskComplexity.COMPLEX
 
        # Keyword-based complexity assessment (a dedicated classifier is recommended at scale)
        complex_signals = ["architecture design", "security vulnerability", "performance optimization", "refactoring"]
        simple_signals = ["translation", "summarization", "format conversion", "simple question"]
 
        if any(signal in task for signal in complex_signals):
            return TaskComplexity.COMPLEX
        if any(signal in task for signal in simple_signals):
            return TaskComplexity.SIMPLE
        return TaskComplexity.MODERATE
 
    def get_model(self, task: str, context_tokens: int) -> str:
        complexity = self.classify_task(task, context_tokens)
        return self.ROUTING_RULES[complexity]

Expanding this into a hierarchical multi-agent structure is even more powerful: the orchestrator (strategy decisions) uses a large model, while workers (actual execution) use small models. The OPTIMA research (a training framework for optimizing multi-agent conversation trajectories) reported achieving massive cost reductions while retaining 97.7% of performance.

Even if implementing the routing logic feels tedious, the ROI becomes overwhelming at scale. Once model selection is sorted, let's move on to the final strategy that eliminates failure costs at the source.


Strategy 4: Checkpointing — The Safety Net That Stops Failure Costs

There is a particularly painful situation in long-running agents: a task that ran for 40 minutes fails at the very end and has to restart from scratch. Checkpointing is essential infrastructure to prevent this.

typescript
interface AgentCheckpoint {
  taskId: string;
  step: number;
  state: Record<string, unknown>;
  completedActions: string[];
  timestamp: number;
}
 
const MAX_STEPS = 100;
 
class CheckpointedAgent {
  private checkpointInterval = 5; // save every 5 steps
 
  async execute(task: string, taskId: string): Promise<void> {
    const checkpoint = await this.loadCheckpoint(taskId);
    const startStep = checkpoint?.step ?? 0;
    let state: Record<string, unknown> = checkpoint?.state ?? this.initState(task);
 
    console.log(`Resuming from step ${startStep}`);
 
    for (let step = startStep; step < MAX_STEPS; step++) {
      state = await this.executeStep(step, state);
 
      if (step % this.checkpointInterval === 0) {
        await this.saveCheckpoint({
          taskId,
          step: step + 1,
          state,
          completedActions: (state.actions as string[]) ?? [],
          timestamp: Date.now()
        });
      }
 
      if (this.isComplete(state)) break;
    }
  }
 
  private async saveCheckpoint(cp: AgentCheckpoint): Promise<void> {
    await redis.setex(`checkpoint:${cp.taskId}`, 3600, JSON.stringify(cp));
  }
 
  private async loadCheckpoint(taskId: string): Promise<AgentCheckpoint | null> {
    const data = await redis.get(`checkpoint:${taskId}`);
    return data ? JSON.parse(data) : null;
  }
 
  private initState(task: string): Record<string, unknown> {
    return { task, actions: [], status: "started" };
  }
 
  private async executeStep(
    step: number,
    state: Record<string, unknown>
  ): Promise<Record<string, unknown>> {
    // implement actual step execution logic
    return state;
  }
 
  private isComplete(state: Record<string, unknown>): boolean {
    return state.status === "done";
  }
}

Real-World Application

Example 1: Software Engineering Automation Agent

A workflow that reads a codebase → diagnoses a bug → applies a fix → runs tests → iteratively corrects on failure is a canonical example of a long-horizon agent. Because compile logs and test results accumulate continuously, context management is central. This is a situation encountered frequently in practice, and the pattern below applies three optimizations simultaneously.

python
import anthropic
from dataclasses import dataclass, field
 
@dataclass
class BugFixAgent:
    client: anthropic.Anthropic = field(default_factory=anthropic.Anthropic)
    context_manager: ContextManager = field(default_factory=ContextManager)
    router: ModelRouter = field(default_factory=ModelRouter)
 
    SYSTEM_PROMPT = """You are a senior software engineer.
    You specialize in diagnosing and fixing bugs.
    You always write tests and maintain code quality."""
 
    async def fix_bug(self, bug_report: str, codebase_context: str) -> str:
        # [Optimization 1] Select model based on task complexity — model routing
        analysis_model = self.router.get_model("bug analysis", len(codebase_context))
 
        analysis = await self.client.messages.create(
            model=analysis_model,
            max_tokens=2048,
            system=[{
                "type": "text",
                "text": self.SYSTEM_PROMPT + "\n\n" + codebase_context,
                "cache_control": {"type": "ephemeral"}  # [Optimization 2] cache the codebase
            }],
            messages=[{
                "role": "user",
                "content": f"Bug report: {bug_report}\n\nPlease analyze the root cause."
            }]
        )
 
        # [Optimization 3] Switch to a large model for complex fixes
        # Same codebase_context → cache hit occurs (reusing the caching benefit)
        fix = await self.client.messages.create(
            model="claude-opus-4-8",
            max_tokens=4096,
            system=[{
                "type": "text",
                "text": self.SYSTEM_PROMPT + "\n\n" + codebase_context,
                "cache_control": {"type": "ephemeral"}
            }],
            messages=[
                {"role": "user", "content": f"Bug report: {bug_report}"},
                {"role": "assistant", "content": analysis.content[0].text},
                {"role": "user", "content": "Based on the analysis above, please write the fix."}
            ]
        )
 
        return fix.content[0].text

Here is how the three optimizations fit together:

Optimization Point Code Location Savings Mechanism
System prompt caching cache_control: ephemeral 90% cost reduction on repeated calls
Codebase context caching Same system across analysis→fix steps Second call also gets a cache hit
Per-step model routing analysis_model vs claude-opus-4-8 Small model for analysis, large model for fix

Example 2: Deep Research Agent

In a workflow of web search → document reading → synthesis report writing, how you manage repeated search results makes all the difference in cost. There is a case where context optimization alone cut daily costs from $150 to $45 ($38,000/year saved) for an automated workload running at 1,000 queries/day.

python
from typing import AsyncGenerator
import anthropic
 
class ResearchAgent:
    def __init__(self):
        self.client = anthropic.Anthropic()  # explicitly initialized at class level
        self.context_manager = ContextManager(
            max_tokens=80_000,
            summary_threshold=60_000
        )
 
    async def research(self, query: str) -> AsyncGenerator[str, None]:
        search_results = []
 
        async for result in self._search(query):
            # Summarize search results immediately — store only summaries, not full text, in context
            summary = await self._summarize_result(result, query)
            search_results.append(summary)
 
            # Token count estimate — approximate value for English
            # For non-Latin scripts, word count based on spaces diverges significantly from actual token count;
            # using tiktoken or the API's token counter is recommended
            estimated_tokens = int(len(summary.split()) * 1.5)
 
            await self.context_manager.add_message(
                role="tool",
                content=f"[Search Result Summary]\n{summary}",
                token_count=estimated_tokens
            )
 
            yield f"In progress: {len(search_results)} results collected"
 
        # Use a large model for the final report
        report = await self._synthesize(
            query=query,
            context=self.context_manager.get_context()
        )
 
        yield report
 
    async def _summarize_result(self, result: str, query: str) -> str:
        # use self.client — not a global variable
        response = self.client.messages.create(
            model="claude-haiku-4-5-20251001",  # use a cheap small model for summarization
            max_tokens=256,
            messages=[{
                "role": "user",
                "content": f"Summarize only what is relevant to the query '{query}' in 3 sentences:\n{result[:2000]}"
            }]
        )
        return response.content[0].text
 
    async def _search(self, query: str) -> AsyncGenerator[str, None]:
        # implement actual search logic
        pass
 
    async def _synthesize(self, query: str, context: list) -> str:
        # implement final report generation logic
        pass

Trade-off Analysis

Each strategy has a different implementation difficulty, and the point at which its effect appears also differs. This table should help you decide where to start.

ROI Comparison by Strategy

Strategy Implementation Difficulty Immediate Effect Savings Scale
Prompt caching Low (one line of code) Immediate 50–90% on cache-hit tokens
Context compression Medium Effect grows as context accumulates 39–60% of input tokens
Model routing Medium–High Immediate 60–90% of total cost
Checkpointing High Effective for long-running tasks Full retry cost

Implementation Trade-offs

Item Risk Mitigation
Cache miss Cache does not work at all if the dynamic part precedes the static part Enforce a static→dynamic prompt structure by design
Information loss during compression Higher compression ratios risk losing important context Validate compression quality by comparing successful vs. failed trajectories, as in the ACON approach
Routing error Sending a complex task to a small model causes a sharp quality drop Fall back to a higher-tier model when classification confidence is low
Cost unpredictability Branching logic and retries cause non-linear token usage growth Token budget caps and automatic circuit-breaker safeguards are essential
State serialization difficulty Complex agent state can be difficult to serialize into checkpoints in some cases Design a serializable state structure from the start

These four strategies are each effective on their own, but applying them together multiplies the impact. That said, trying to implement all of them at once raises complexity, so it's more practical to apply them one at a time in the order: caching → compression → routing → checkpointing.

The Most Common Mistakes in Practice

  1. Designing prompts without considering cache structure: A structure that prepends user input before the system prompt completely neutralizes the cache. If your cache hit rate is 0%, this is almost always the culprit. It's worth considering the static/dynamic separation from the very beginning of prompt design.

  2. Expanding context indefinitely without compression: "We have a 200K context window, so it'll be fine" invites both drift and a cost explosion. It's worth remembering that performance degradation starts after the roughly 35-minute threshold.

  3. Applying frontier models to every task: Using Opus for simple summarization, translation, or format conversion is like paying taxi fare to ride a bus. Even if implementing the routing logic feels tedious, the ROI is overwhelming at scale.


Closing Thoughts

The core of long-horizon agent cost optimization is not choosing cheaper models — it is designing when not to use expensive ones.

Three steps you can start right now:

  1. Measure your current cache hit rate first: For Anthropic, you can log response.usage.cache_read_input_tokens; for OpenAI, usage.prompt_tokens_details.cached_tokens. A hit rate of 0% is very likely a prompt structure issue.

  2. Add one line of cache_control: It's worth wrapping your system prompt and static context (documents, codebase) with cache_control: {"type": "ephemeral"}. This is the highest-ROI first step — a single line of code that can cut repeat call costs by up to 90%.

  3. Attach a rolling summary: You can add a ContextManager that compresses old messages with a small model once context exceeds a certain threshold (e.g., 40K tokens). Based on the pattern introduced above, a good starting point is to apply it to just your longest-running session and observe the change in cost.


References

  • ACON: Optimizing Context Compression for Long-horizon LLM Agents | arXiv
  • Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks | arXiv
  • Reducing Cost of LLM Agents with Trajectory Reduction | arXiv
  • Long-Horizon Agents Are Here. Full Autopilot Isn't | DEV Community
  • Long-horizon agents explained: Hype, reality, engineering lessons | EPAM Insights
  • Context Engineering in 2025: The Complete Guide to AI Agent Optimization | Mem0 Blog
  • AI Agent Context Compression: Strategies for Long-Running Sessions | Zylos Research
  • Prompt Caching for Anthropic and OpenAI Models | DigitalOcean Blog
  • LangChain Cost Optimization: Agent Execution Cost Analysis and Reduction
  • Long Horizon Document Agents | LlamaIndex Blog
  • Context Window Management Strategies for Long-Context AI Agents | Maxim AI
  • Long-Running AI Agents and Task Decomposition 2026 | Zylos Research
#LongHorizonAgent#프롬프트캐싱#컨텍스트압축#모델라우팅#체크포인팅#LLM비용최적화#멀티에이전트#KVCache#AnthropicClaude#Python
Share

Table of Contents

Where Costs ExplodeThree Cost Dimensions4 Core Cost-Reduction StrategiesStrategy 1: Prompt Caching — The Easiest Savings to CaptureStrategy 2: Context Compression — The Core Defense Against DriftStrategy 3: Model Routing — The Largest Cost ReductionStrategy 4: Checkpointing — The Safety Net That Stops Failure CostsReal-World ApplicationExample 1: Software Engineering Automation AgentExample 2: Deep Research AgentTrade-off AnalysisROI Comparison by StrategyImplementation Trade-offsThe Most Common Mistakes in PracticeClosing ThoughtsReferences

Recommended Posts

AI Writes It, AI Reviews It: Building a `/code-review ultra` Multi-Agent Pipeline
AI

AI Writes It, AI Reviews It: Building a `/code-review ultra` Multi-Agent Pipeline

Honestly, when I first heard about this concept, my reaction was "does that actually work?" It's already remarkable that an agent can write code on its own — bu...

June 7, 202620 min read
How AI Coding Agents Are Reshaping Dev Team Structure: How to Transition into an Orchestrator
AI

How AI Coding Agents Are Reshaping Dev Team Structure: How to Transition into an Orchestrator

To be honest, when I first heard "we're restructuring the team after adopting coding agents," I dismissed it as inflated marketing speak. I could feel that AI-a...

June 12, 202625 min read
How to Fine-Tune a Domain-Specific SLM with QLoRA on a Single Consumer GPU
AI

How to Fine-Tune a Domain-Specific SLM with QLoRA on a Single Consumer GPU

The moment GPT-4o API costs start piling up, you naturally wonder: "Could we train this ourselves for our domain?" I had the same thought. It started when I saw...

June 12, 202620 min read
Type-Safe LLM Response Validation with Pydantic AI
AI

Type-Safe LLM Response Validation with Pydantic AI

If you've ever wired an LLM into production, you've probably hit this situation at least once. You carefully wrote a system prompt telling GPT to respond in JSO...

June 7, 202622 min read
How to Make LLMs Directly Call Your Internal REST APIs: TypeScript MCP Server Implementation and the Gateway Pattern
AI

How to Make LLMs Directly Call Your Internal REST APIs: TypeScript MCP Server Implementation and the Gateway Pattern

Have you ever tried to introduce an AI agent to your team, only to get stuck on the question "so how do we connect our internal APIs?" I started out trying to p...

June 7, 202619 min read
7 Major Patterns of Agentic AI Design
AI

7 Major Patterns of Agentic AI Design

Use + ReAct | KB, ticket DB, and other external systems with repeated lookups | | Response writing | Response agent | Reflection | Self-review of tone and accu...

June 6, 20269 min read