Cutting Long-Horizon Agent Costs by 60–90%: Caching, Compression, and Routing Strategies

I still remember the shock of receiving that first bill after putting an AI agent into production. A simple chatbot would have been predictable, but agents were different. Fixing a single bug could stack up tens of thousands of tokens of context, and if it failed, all that cost was gone and you started over from scratch. A single run exceeding ten times the expected cost was routine.

At first I thought, "Can't we just switch to a cheaper model?" — but in practice, performance dropped noticeably first. The real problem wasn't model selection; it was design. This post breaks down where costs explode structurally, then walks through four strategies — prompt caching, context compression, model routing, and checkpointing — with real code. Cuts of 60–90% are achievable while maintaining performance; the reasoning and implementation for each are covered in detail in their respective sections.

By the end, you'll spot things you can apply today. We'll go in order: from caching, which delivers immediate results with a single line of code, to checkpointing, which eliminates the failure cost of long-running agents at the source.

Where Costs Explode

Three Cost Dimensions

A long-horizon agent is an autonomous system that cycles through planning, reasoning, tool calls, and self-correction over minutes, hours, or even days. This characteristic causes the cost structure to behave entirely differently from ordinary LLM API calls.

When people talk about agent costs, they often think only of "token costs" — but in practice, three dimensions are intertwined.

Cost Type	Cause	Growth Pattern
Token cost	Input/output token API charges	Linear
Context cost	Accumulation of conversation history and tool outputs	Non-linear growth¹
Failure cost	Full retry when a mid-run failure occurs	Exponential

¹ The API charge itself scales linearly with token count. However, as context grows, the attention computation complexity inside the model increases as O(n²), and more importantly, the probability of retries and failures due to context drift rises alongside it — causing indirect costs to grow non-linearly.

Context cost is especially dangerous because, left unchecked, context drift sets in.

Context Drift vs. Context Exhaustion

Context exhaustion: The context window fills up physically. Even a 200K window can eventually be filled.

Context drift: Even with window space remaining, stale or irrelevant content accumulates and quietly degrades the model's reasoning quality.

Drift arrives much more quietly, and is far more dangerous.

Failure cost is also an easy trap to overlook. According to the Trajectory Reduction research, when task duration doubles, the failure rate quadruples. There is also data showing that agent success rates begin to decline noticeably after roughly 35 minutes — which means simply increasing the context window size cannot solve this problem on its own.

4 Core Cost-Reduction Strategies

Strategy 1: Prompt Caching — The Easiest Savings to Capture

A KV cache (Key-Value Cache) is a structure in which transformer models store intermediate attention computation results in GPU memory to avoid recomputation. Major LLM providers have begun officially supporting this KV cache at the API level. Honestly, if you're not using this, you're essentially throwing money away.

For Anthropic Claude, a cache hit makes input token costs 90% cheaper than the standard price ($0.30/M vs. $3.00/M). OpenAI automatically caches prefixes without any configuration, and cache-hit tokens are billed at 50% of the standard price (this means the unit price of hit tokens is halved, not that average savings are 50%).

When our team first applied this, the daily cost of a code review agent dropped to less than half — overnight. The implementation was surprisingly simple.

python

import anthropic
 
client = anthropic.Anthropic()
 
LONG_CODING_GUIDELINES = "..."  # coding guidelines spanning thousands of tokens
 
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a specialized code review agent.\n" + LONG_CODING_GUIDELINES,
            "cache_control": {"type": "ephemeral"}  # designate this block for caching
        }
    ],
    messages=[
        {"role": "user", "content": f"Please review the following code:\n{user_code}"}
    ]
)
 
# Check whether the cache was hit
usage = response.usage
print(f"Cache hit tokens: {usage.cache_read_input_tokens}")
print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")

Key Design Point: For caching to be effective, the prompt must be structured as static prefix (system prompt, tool definitions, long documents) + dynamic part (user input, conversation history). If the dynamic part comes first, the cache will not work at all.

Once caching handles static costs, it's time to address the context that accumulates dynamically.

Strategy 2: Context Compression — The Core Defense Against Drift

To prevent the context window from ballooning in long-running agents, periodic compression is necessary. AgentDiet (a research framework that removes unnecessary steps from agent execution trajectories) demonstrated 39–60% reduction in input tokens, and ACON (a research approach that iteratively improves compression strategies by comparing successful and failed trajectories) also achieved 26–54% reductions in peak tokens.

The pattern most commonly used in practice is a rolling summary.

python

from typing import List, Dict
import anthropic
 
class ContextManager:
    def __init__(self, max_tokens: int = 50_000, summary_threshold: int = 40_000):
        self.max_tokens = max_tokens
        self.summary_threshold = summary_threshold
        self.messages: List[Dict] = []
        self.summary: str = ""
        self.client = anthropic.Anthropic()
 
    async def add_message(self, role: str, content: str, token_count: int):
        self.messages.append({
            "role": role,
            "content": content,
            "tokens": token_count
        })
 
        current_total = sum(m["tokens"] for m in self.messages)
 
        if current_total > self.summary_threshold:
            await self._compress_old_messages()
 
    async def _compress_old_messages(self):
        cutoff = len(self.messages) // 2
        to_summarize = self.messages[:cutoff]
 
        conversation_text = "\n".join(
            f"[{m['role']}]: {m['content']}" for m in to_summarize
        )
 
        # Use a cheap, small model for compression itself — this is the key savings point
        response = self.client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": f"Please concisely summarize the key information from the following conversation:\n\n{conversation_text}"
            }]
        )
        new_summary = response.content[0].text
 
        self.summary = f"{self.summary}\n\n{new_summary}".strip()
        self.messages = self.messages[cutoff:]
 
    def get_context(self) -> List[Dict]:
        context = []
        if self.summary:
            context.append({
                "role": "system",
                "content": f"[Previous Conversation Summary]\n{self.summary}"
            })
        context.extend(
            {"role": m["role"], "content": m["content"]}
            for m in self.messages
        )
        return context

Because the compression step itself incurs LLM call costs, the key is to use a cheap, small model for summarization rather than an expensive frontier model. This single pattern produces a noticeable reduction in context-related costs.

Once compression keeps accumulation in check, let's move on to selecting the right model for each task.

Strategy 3: Model Routing — The Largest Cost Reduction

At first I thought, "Can't we just use Claude Opus or GPT-4 for everything?" — but in practice that turns out to be a surprisingly expensive assumption. Production experience shows that roughly 85% of enterprise queries can be handled adequately by lower-cost models. Using Opus for simple summaries, translations, or format conversions is like paying taxi fare to ride a bus.

python

from enum import Enum
 
class TaskComplexity(Enum):
    SIMPLE = "simple"      # Haiku-tier — translation, summarization, format conversion
    MODERATE = "moderate"  # Sonnet-tier — general analysis, code explanation
    COMPLEX = "complex"    # Opus-tier — architecture design, security analysis
 
class ModelRouter:
    ROUTING_RULES = {
        TaskComplexity.SIMPLE: "claude-haiku-4-5-20251001",
        TaskComplexity.MODERATE: "claude-sonnet-4-6",
        TaskComplexity.COMPLEX: "claude-opus-4-8",
    }
 
    def classify_task(self, task: str, context_tokens: int) -> TaskComplexity:
        # Large context likely indicates a complex task
        if context_tokens > 30_000:
            return TaskComplexity.COMPLEX
 
        # Keyword-based complexity assessment (a dedicated classifier is recommended at scale)
        complex_signals = ["architecture design", "security vulnerability", "performance optimization", "refactoring"]
        simple_signals = ["translation", "summarization", "format conversion", "simple question"]
 
        if any(signal in task for signal in complex_signals):
            return TaskComplexity.COMPLEX
        if any(signal in task for signal in simple_signals):
            return TaskComplexity.SIMPLE
        return TaskComplexity.MODERATE
 
    def get_model(self, task: str, context_tokens: int) -> str:
        complexity = self.classify_task(task, context_tokens)
        return self.ROUTING_RULES[complexity]

Expanding this into a hierarchical multi-agent structure is even more powerful: the orchestrator (strategy decisions) uses a large model, while workers (actual execution) use small models. The OPTIMA research (a training framework for optimizing multi-agent conversation trajectories) reported achieving massive cost reductions while retaining 97.7% of performance.

Even if implementing the routing logic feels tedious, the ROI becomes overwhelming at scale. Once model selection is sorted, let's move on to the final strategy that eliminates failure costs at the source.

Strategy 4: Checkpointing — The Safety Net That Stops Failure Costs

There is a particularly painful situation in long-running agents: a task that ran for 40 minutes fails at the very end and has to restart from scratch. Checkpointing is essential infrastructure to prevent this.

typescript

interface AgentCheckpoint {
  taskId: string;
  step: number;
  state: Record<string, unknown>;
  completedActions: string[];
  timestamp: number;
}
 
const MAX_STEPS = 100;
 
class CheckpointedAgent {
  private checkpointInterval = 5; // save every 5 steps
 
  async execute(task: string, taskId: string): Promise<void> {
    const checkpoint = await this.loadCheckpoint(taskId);
    const startStep = checkpoint?.step ?? 0;
    let state: Record<string, unknown> = checkpoint?.state ?? this.initState(task);
 
    console.log(`Resuming from step ${startStep}`);
 
    for (let step = startStep; step < MAX_STEPS; step++) {
      state = await this.executeStep(step, state);
 
      if (step % this.checkpointInterval === 0) {
        await this.saveCheckpoint({
          taskId,
          step: step + 1,
          state,
          completedActions: (state.actions as string[]) ?? [],
          timestamp: Date.now()
        });
      }
 
      if (this.isComplete(state)) break;
    }
  }
 
  private async saveCheckpoint(cp: AgentCheckpoint): Promise<void> {
    await redis.setex(`checkpoint:${cp.taskId}`, 3600, JSON.stringify(cp));
  }
 
  private async loadCheckpoint(taskId: string): Promise<AgentCheckpoint | null> {
    const data = await redis.get(`checkpoint:${taskId}`);
    return data ? JSON.parse(data) : null;
  }
 
  private initState(task: string): Record<string, unknown> {
    return { task, actions: [], status: "started" };
  }
 
  private async executeStep(
    step: number,
    state: Record<string, unknown>
  ): Promise<Record<string, unknown>> {
    // implement actual step execution logic
    return state;
  }
 
  private isComplete(state: Record<string, unknown>): boolean {
    return state.status === "done";
  }
}

Real-World Application

Example 1: Software Engineering Automation Agent

A workflow that reads a codebase → diagnoses a bug → applies a fix → runs tests → iteratively corrects on failure is a canonical example of a long-horizon agent. Because compile logs and test results accumulate continuously, context management is central. This is a situation encountered frequently in practice, and the pattern below applies three optimizations simultaneously.

python

import anthropic
from dataclasses import dataclass, field
 
@dataclass
class BugFixAgent:
    client: anthropic.Anthropic = field(default_factory=anthropic.Anthropic)
    context_manager: ContextManager = field(default_factory=ContextManager)
    router: ModelRouter = field(default_factory=ModelRouter)
 
    SYSTEM_PROMPT = """You are a senior software engineer.
    You specialize in diagnosing and fixing bugs.
    You always write tests and maintain code quality."""
 
    async def fix_bug(self, bug_report: str, codebase_context: str) -> str:
        # [Optimization 1] Select model based on task complexity — model routing
        analysis_model = self.router.get_model("bug analysis", len(codebase_context))
 
        analysis = await self.client.messages.create(
            model=analysis_model,
            max_tokens=2048,
            system=[{
                "type": "text",
                "text": self.SYSTEM_PROMPT + "\n\n" + codebase_context,
                "cache_control": {"type": "ephemeral"}  # [Optimization 2] cache the codebase
            }],
            messages=[{
                "role": "user",
                "content": f"Bug report: {bug_report}\n\nPlease analyze the root cause."
            }]
        )
 
        # [Optimization 3] Switch to a large model for complex fixes
        # Same codebase_context → cache hit occurs (reusing the caching benefit)
        fix = await self.client.messages.create(
            model="claude-opus-4-8",
            max_tokens=4096,
            system=[{
                "type": "text",
                "text": self.SYSTEM_PROMPT + "\n\n" + codebase_context,
                "cache_control": {"type": "ephemeral"}
            }],
            messages=[
                {"role": "user", "content": f"Bug report: {bug_report}"},
                {"role": "assistant", "content": analysis.content[0].text},
                {"role": "user", "content": "Based on the analysis above, please write the fix."}
            ]
        )
 
        return fix.content[0].text

Here is how the three optimizations fit together:

Optimization Point	Code Location	Savings Mechanism
System prompt caching	`cache_control: ephemeral`	90% cost reduction on repeated calls
Codebase context caching	Same system across analysis→fix steps	Second call also gets a cache hit
Per-step model routing	`analysis_model` vs `claude-opus-4-8`	Small model for analysis, large model for fix

Example 2: Deep Research Agent

In a workflow of web search → document reading → synthesis report writing, how you manage repeated search results makes all the difference in cost. There is a case where context optimization alone cut daily costs from $150 to $45 ($38,000/year saved) for an automated workload running at 1,000 queries/day.

python

from typing import AsyncGenerator
import anthropic
 
class ResearchAgent:
    def __init__(self):
        self.client = anthropic.Anthropic()  # explicitly initialized at class level
        self.context_manager = ContextManager(
            max_tokens=80_000,
            summary_threshold=60_000
        )
 
    async def research(self, query: str) -> AsyncGenerator[str, None]:
        search_results = []
 
        async for result in self._search(query):
            # Summarize search results immediately — store only summaries, not full text, in context
            summary = await self._summarize_result(result, query)
            search_results.append(summary)
 
            # Token count estimate — approximate value for English
            # For non-Latin scripts, word count based on spaces diverges significantly from actual token count;
            # using tiktoken or the API's token counter is recommended
            estimated_tokens = int(len(summary.split()) * 1.5)
 
            await self.context_manager.add_message(
                role="tool",
                content=f"[Search Result Summary]\n{summary}",
                token_count=estimated_tokens
            )
 
            yield f"In progress: {len(search_results)} results collected"
 
        # Use a large model for the final report
        report = await self._synthesize(
            query=query,
            context=self.context_manager.get_context()
        )
 
        yield report
 
    async def _summarize_result(self, result: str, query: str) -> str:
        # use self.client — not a global variable
        response = self.client.messages.create(
            model="claude-haiku-4-5-20251001",  # use a cheap small model for summarization
            max_tokens=256,
            messages=[{
                "role": "user",
                "content": f"Summarize only what is relevant to the query '{query}' in 3 sentences:\n{result[:2000]}"
            }]
        )
        return response.content[0].text
 
    async def _search(self, query: str) -> AsyncGenerator[str, None]:
        # implement actual search logic
        pass
 
    async def _synthesize(self, query: str, context: list) -> str:
        # implement final report generation logic
        pass

Trade-off Analysis

Each strategy has a different implementation difficulty, and the point at which its effect appears also differs. This table should help you decide where to start.

ROI Comparison by Strategy

Strategy	Implementation Difficulty	Immediate Effect	Savings Scale
Prompt caching	Low (one line of code)	Immediate	50–90% on cache-hit tokens
Context compression	Medium	Effect grows as context accumulates	39–60% of input tokens
Model routing	Medium–High	Immediate	60–90% of total cost
Checkpointing	High	Effective for long-running tasks	Full retry cost

Implementation Trade-offs

Item	Risk	Mitigation
Cache miss	Cache does not work at all if the dynamic part precedes the static part	Enforce a static→dynamic prompt structure by design
Information loss during compression	Higher compression ratios risk losing important context	Validate compression quality by comparing successful vs. failed trajectories, as in the ACON approach
Routing error	Sending a complex task to a small model causes a sharp quality drop	Fall back to a higher-tier model when classification confidence is low
Cost unpredictability	Branching logic and retries cause non-linear token usage growth	Token budget caps and automatic circuit-breaker safeguards are essential
State serialization difficulty	Complex agent state can be difficult to serialize into checkpoints in some cases	Design a serializable state structure from the start

These four strategies are each effective on their own, but applying them together multiplies the impact. That said, trying to implement all of them at once raises complexity, so it's more practical to apply them one at a time in the order: caching → compression → routing → checkpointing.

The Most Common Mistakes in Practice

Designing prompts without considering cache structure: A structure that prepends user input before the system prompt completely neutralizes the cache. If your cache hit rate is 0%, this is almost always the culprit. It's worth considering the static/dynamic separation from the very beginning of prompt design.
Expanding context indefinitely without compression: "We have a 200K context window, so it'll be fine" invites both drift and a cost explosion. It's worth remembering that performance degradation starts after the roughly 35-minute threshold.
Applying frontier models to every task: Using Opus for simple summarization, translation, or format conversion is like paying taxi fare to ride a bus. Even if implementing the routing logic feels tedious, the ROI is overwhelming at scale.

Closing Thoughts

The core of long-horizon agent cost optimization is not choosing cheaper models — it is designing when not to use expensive ones.

Three steps you can start right now:

Measure your current cache hit rate first: For Anthropic, you can log response.usage.cache_read_input_tokens; for OpenAI, usage.prompt_tokens_details.cached_tokens. A hit rate of 0% is very likely a prompt structure issue.
Add one line of cache_control: It's worth wrapping your system prompt and static context (documents, codebase) with cache_control: {"type": "ephemeral"}. This is the highest-ROI first step — a single line of code that can cut repeat call costs by up to 90%.
Attach a rolling summary: You can add a ContextManager that compresses old messages with a small model once context exceeds a certain threshold (e.g., 40K tokens). Based on the pattern introduced above, a good starting point is to apply it to just your longest-running session and observe the change in cost.

References

#LongHorizonAgent#프롬프트캐싱#컨텍스트압축#모델라우팅#체크포인팅#LLM비용최적화#멀티에이전트#KVCache#AnthropicClaude#Python

Cutting Long-Horizon Agent Costs by 60–90%: Caching, Compression, and Routing Strategies

Where Costs Explode

Three Cost Dimensions

When people talk about agent costs, they often think only of "token costs" — but in practice, three dimensions are intertwined.

Cost Type	Cause	Growth Pattern
Token cost	Input/output token API charges	Linear
Context cost	Accumulation of conversation history and tool outputs	Non-linear growth¹
Failure cost	Full retry when a mid-run failure occurs	Exponential

¹ The API charge itself scales linearly with token count. However, as context grows, the attention computation complexity inside the model increases as O(n²), and more importantly, the probability of retries and failures due to context drift rises alongside it — causing indirect costs to grow non-linearly.

Context cost is especially dangerous because, left unchecked, context drift sets in.

Context Drift vs. Context Exhaustion

Context exhaustion: The context window fills up physically. Even a 200K window can eventually be filled.

Context drift: Even with window space remaining, stale or irrelevant content accumulates and quietly degrades the model's reasoning quality.

Drift arrives much more quietly, and is far more dangerous.

4 Core Cost-Reduction Strategies

Strategy 1: Prompt Caching — The Easiest Savings to Capture

When our team first applied this, the daily cost of a code review agent dropped to less than half — overnight. The implementation was surprisingly simple.

python

import anthropic
 
client = anthropic.Anthropic()
 
LONG_CODING_GUIDELINES = "..."  # coding guidelines spanning thousands of tokens
 
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a specialized code review agent.\n" + LONG_CODING_GUIDELINES,
            "cache_control": {"type": "ephemeral"}  # designate this block for caching
        }
    ],
    messages=[
        {"role": "user", "content": f"Please review the following code:\n{user_code}"}
    ]
)
 
# Check whether the cache was hit
usage = response.usage
print(f"Cache hit tokens: {usage.cache_read_input_tokens}")
print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")

Key Design Point: For caching to be effective, the prompt must be structured as static prefix (system prompt, tool definitions, long documents) + dynamic part (user input, conversation history). If the dynamic part comes first, the cache will not work at all.

Once caching handles static costs, it's time to address the context that accumulates dynamically.

Strategy 2: Context Compression — The Core Defense Against Drift

The pattern most commonly used in practice is a rolling summary.

python

from typing import List, Dict
import anthropic
 
class ContextManager:
    def __init__(self, max_tokens: int = 50_000, summary_threshold: int = 40_000):
        self.max_tokens = max_tokens
        self.summary_threshold = summary_threshold
        self.messages: List[Dict] = []
        self.summary: str = ""
        self.client = anthropic.Anthropic()
 
    async def add_message(self, role: str, content: str, token_count: int):
        self.messages.append({
            "role": role,
            "content": content,
            "tokens": token_count
        })
 
        current_total = sum(m["tokens"] for m in self.messages)
 
        if current_total > self.summary_threshold:
            await self._compress_old_messages()
 
    async def _compress_old_messages(self):
        cutoff = len(self.messages) // 2
        to_summarize = self.messages[:cutoff]
 
        conversation_text = "\n".join(
            f"[{m['role']}]: {m['content']}" for m in to_summarize
        )
 
        # Use a cheap, small model for compression itself — this is the key savings point
        response = self.client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": f"Please concisely summarize the key information from the following conversation:\n\n{conversation_text}"
            }]
        )
        new_summary = response.content[0].text
 
        self.summary = f"{self.summary}\n\n{new_summary}".strip()
        self.messages = self.messages[cutoff:]
 
    def get_context(self) -> List[Dict]:
        context = []
        if self.summary:
            context.append({
                "role": "system",
                "content": f"[Previous Conversation Summary]\n{self.summary}"
            })
        context.extend(
            {"role": m["role"], "content": m["content"]}
            for m in self.messages
        )
        return context

Once compression keeps accumulation in check, let's move on to selecting the right model for each task.

Strategy 3: Model Routing — The Largest Cost Reduction

python

from enum import Enum
 
class TaskComplexity(Enum):
    SIMPLE = "simple"      # Haiku-tier — translation, summarization, format conversion
    MODERATE = "moderate"  # Sonnet-tier — general analysis, code explanation
    COMPLEX = "complex"    # Opus-tier — architecture design, security analysis
 
class ModelRouter:
    ROUTING_RULES = {
        TaskComplexity.SIMPLE: "claude-haiku-4-5-20251001",
        TaskComplexity.MODERATE: "claude-sonnet-4-6",
        TaskComplexity.COMPLEX: "claude-opus-4-8",
    }
 
    def classify_task(self, task: str, context_tokens: int) -> TaskComplexity:
        # Large context likely indicates a complex task
        if context_tokens > 30_000:
            return TaskComplexity.COMPLEX
 
        # Keyword-based complexity assessment (a dedicated classifier is recommended at scale)
        complex_signals = ["architecture design", "security vulnerability", "performance optimization", "refactoring"]
        simple_signals = ["translation", "summarization", "format conversion", "simple question"]
 
        if any(signal in task for signal in complex_signals):
            return TaskComplexity.COMPLEX
        if any(signal in task for signal in simple_signals):
            return TaskComplexity.SIMPLE
        return TaskComplexity.MODERATE
 
    def get_model(self, task: str, context_tokens: int) -> str:
        complexity = self.classify_task(task, context_tokens)
        return self.ROUTING_RULES[complexity]

Strategy 4: Checkpointing — The Safety Net That Stops Failure Costs

typescript

interface AgentCheckpoint {
  taskId: string;
  step: number;
  state: Record<string, unknown>;
  completedActions: string[];
  timestamp: number;
}
 
const MAX_STEPS = 100;
 
class CheckpointedAgent {
  private checkpointInterval = 5; // save every 5 steps
 
  async execute(task: string, taskId: string): Promise<void> {
    const checkpoint = await this.loadCheckpoint(taskId);
    const startStep = checkpoint?.step ?? 0;
    let state: Record<string, unknown> = checkpoint?.state ?? this.initState(task);
 
    console.log(`Resuming from step ${startStep}`);
 
    for (let step = startStep; step < MAX_STEPS; step++) {
      state = await this.executeStep(step, state);
 
      if (step % this.checkpointInterval === 0) {
        await this.saveCheckpoint({
          taskId,
          step: step + 1,
          state,
          completedActions: (state.actions as string[]) ?? [],
          timestamp: Date.now()
        });
      }
 
      if (this.isComplete(state)) break;
    }
  }
 
  private async saveCheckpoint(cp: AgentCheckpoint): Promise<void> {
    await redis.setex(`checkpoint:${cp.taskId}`, 3600, JSON.stringify(cp));
  }
 
  private async loadCheckpoint(taskId: string): Promise<AgentCheckpoint | null> {
    const data = await redis.get(`checkpoint:${taskId}`);
    return data ? JSON.parse(data) : null;
  }
 
  private initState(task: string): Record<string, unknown> {
    return { task, actions: [], status: "started" };
  }
 
  private async executeStep(
    step: number,
    state: Record<string, unknown>
  ): Promise<Record<string, unknown>> {
    // implement actual step execution logic
    return state;
  }
 
  private isComplete(state: Record<string, unknown>): boolean {
    return state.status === "done";
  }
}

Real-World Application

Example 1: Software Engineering Automation Agent

python

import anthropic
from dataclasses import dataclass, field
 
@dataclass
class BugFixAgent:
    client: anthropic.Anthropic = field(default_factory=anthropic.Anthropic)
    context_manager: ContextManager = field(default_factory=ContextManager)
    router: ModelRouter = field(default_factory=ModelRouter)
 
    SYSTEM_PROMPT = """You are a senior software engineer.
    You specialize in diagnosing and fixing bugs.
    You always write tests and maintain code quality."""
 
    async def fix_bug(self, bug_report: str, codebase_context: str) -> str:
        # [Optimization 1] Select model based on task complexity — model routing
        analysis_model = self.router.get_model("bug analysis", len(codebase_context))
 
        analysis = await self.client.messages.create(
            model=analysis_model,
            max_tokens=2048,
            system=[{
                "type": "text",
                "text": self.SYSTEM_PROMPT + "\n\n" + codebase_context,
                "cache_control": {"type": "ephemeral"}  # [Optimization 2] cache the codebase
            }],
            messages=[{
                "role": "user",
                "content": f"Bug report: {bug_report}\n\nPlease analyze the root cause."
            }]
        )
 
        # [Optimization 3] Switch to a large model for complex fixes
        # Same codebase_context → cache hit occurs (reusing the caching benefit)
        fix = await self.client.messages.create(
            model="claude-opus-4-8",
            max_tokens=4096,
            system=[{
                "type": "text",
                "text": self.SYSTEM_PROMPT + "\n\n" + codebase_context,
                "cache_control": {"type": "ephemeral"}
            }],
            messages=[
                {"role": "user", "content": f"Bug report: {bug_report}"},
                {"role": "assistant", "content": analysis.content[0].text},
                {"role": "user", "content": "Based on the analysis above, please write the fix."}
            ]
        )
 
        return fix.content[0].text

Here is how the three optimizations fit together:

Optimization Point	Code Location	Savings Mechanism
System prompt caching	`cache_control: ephemeral`	90% cost reduction on repeated calls
Codebase context caching	Same system across analysis→fix steps	Second call also gets a cache hit
Per-step model routing	`analysis_model` vs `claude-opus-4-8`	Small model for analysis, large model for fix

Example 2: Deep Research Agent

python

from typing import AsyncGenerator
import anthropic
 
class ResearchAgent:
    def __init__(self):
        self.client = anthropic.Anthropic()  # explicitly initialized at class level
        self.context_manager = ContextManager(
            max_tokens=80_000,
            summary_threshold=60_000
        )
 
    async def research(self, query: str) -> AsyncGenerator[str, None]:
        search_results = []
 
        async for result in self._search(query):
            # Summarize search results immediately — store only summaries, not full text, in context
            summary = await self._summarize_result(result, query)
            search_results.append(summary)
 
            # Token count estimate — approximate value for English
            # For non-Latin scripts, word count based on spaces diverges significantly from actual token count;
            # using tiktoken or the API's token counter is recommended
            estimated_tokens = int(len(summary.split()) * 1.5)
 
            await self.context_manager.add_message(
                role="tool",
                content=f"[Search Result Summary]\n{summary}",
                token_count=estimated_tokens
            )
 
            yield f"In progress: {len(search_results)} results collected"
 
        # Use a large model for the final report
        report = await self._synthesize(
            query=query,
            context=self.context_manager.get_context()
        )
 
        yield report
 
    async def _summarize_result(self, result: str, query: str) -> str:
        # use self.client — not a global variable
        response = self.client.messages.create(
            model="claude-haiku-4-5-20251001",  # use a cheap small model for summarization
            max_tokens=256,
            messages=[{
                "role": "user",
                "content": f"Summarize only what is relevant to the query '{query}' in 3 sentences:\n{result[:2000]}"
            }]
        )
        return response.content[0].text
 
    async def _search(self, query: str) -> AsyncGenerator[str, None]:
        # implement actual search logic
        pass
 
    async def _synthesize(self, query: str, context: list) -> str:
        # implement final report generation logic
        pass

Trade-off Analysis

Each strategy has a different implementation difficulty, and the point at which its effect appears also differs. This table should help you decide where to start.

ROI Comparison by Strategy

Strategy	Implementation Difficulty	Immediate Effect	Savings Scale
Prompt caching	Low (one line of code)	Immediate	50–90% on cache-hit tokens
Context compression	Medium	Effect grows as context accumulates	39–60% of input tokens
Model routing	Medium–High	Immediate	60–90% of total cost
Checkpointing	High	Effective for long-running tasks	Full retry cost

Implementation Trade-offs

Item	Risk	Mitigation
Cache miss	Cache does not work at all if the dynamic part precedes the static part	Enforce a static→dynamic prompt structure by design
Information loss during compression	Higher compression ratios risk losing important context	Validate compression quality by comparing successful vs. failed trajectories, as in the ACON approach
Routing error	Sending a complex task to a small model causes a sharp quality drop	Fall back to a higher-tier model when classification confidence is low
Cost unpredictability	Branching logic and retries cause non-linear token usage growth	Token budget caps and automatic circuit-breaker safeguards are essential
State serialization difficulty	Complex agent state can be difficult to serialize into checkpoints in some cases	Design a serializable state structure from the start

The Most Common Mistakes in Practice

Designing prompts without considering cache structure: A structure that prepends user input before the system prompt completely neutralizes the cache. If your cache hit rate is 0%, this is almost always the culprit. It's worth considering the static/dynamic separation from the very beginning of prompt design.
Expanding context indefinitely without compression: "We have a 200K context window, so it'll be fine" invites both drift and a cost explosion. It's worth remembering that performance degradation starts after the roughly 35-minute threshold.
Applying frontier models to every task: Using Opus for simple summarization, translation, or format conversion is like paying taxi fare to ride a bus. Even if implementing the routing logic feels tedious, the ROI is overwhelming at scale.

Closing Thoughts

The core of long-horizon agent cost optimization is not choosing cheaper models — it is designing when not to use expensive ones.

Three steps you can start right now:

Measure your current cache hit rate first: For Anthropic, you can log response.usage.cache_read_input_tokens; for OpenAI, usage.prompt_tokens_details.cached_tokens. A hit rate of 0% is very likely a prompt structure issue.
Add one line of cache_control: It's worth wrapping your system prompt and static context (documents, codebase) with cache_control: {"type": "ephemeral"}. This is the highest-ROI first step — a single line of code that can cut repeat call costs by up to 90%.
Attach a rolling summary: You can add a ContextManager that compresses old messages with a small model once context exceeds a certain threshold (e.g., 40K tokens). Based on the pattern introduced above, a good starting point is to apply it to just your longest-running session and observe the change in cost.

References

#LongHorizonAgent#프롬프트캐싱#컨텍스트압축#모델라우팅#체크포인팅#LLM비용최적화#멀티에이전트#KVCache#AnthropicClaude#Python

Where Costs Explode

Three Cost Dimensions

4 Core Cost-Reduction Strategies

Strategy 1: Prompt Caching — The Easiest Savings to Capture

Strategy 2: Context Compression — The Core Defense Against Drift

Strategy 3: Model Routing — The Largest Cost Reduction

Strategy 4: Checkpointing — The Safety Net That Stops Failure Costs

Real-World Application

Example 1: Software Engineering Automation Agent

Example 2: Deep Research Agent

Trade-off Analysis

ROI Comparison by Strategy

Implementation Trade-offs

The Most Common Mistakes in Practice

Closing Thoughts

References

Where Costs Explode

Three Cost Dimensions

4 Core Cost-Reduction Strategies

Strategy 1: Prompt Caching — The Easiest Savings to Capture

Strategy 2: Context Compression — The Core Defense Against Drift

Strategy 3: Model Routing — The Largest Cost Reduction

Strategy 4: Checkpointing — The Safety Net That Stops Failure Costs

Real-World Application

Example 1: Software Engineering Automation Agent

Example 2: Deep Research Agent

Trade-off Analysis

ROI Comparison by Strategy

Implementation Trade-offs

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

AI Writes It, AI Reviews It: Building a `/code-review ultra` Multi-Agent Pipeline

How AI Coding Agents Are Reshaping Dev Team Structure: How to Transition into an Orchestrator

How to Fine-Tune a Domain-Specific SLM with QLoRA on a Single Consumer GPU

Type-Safe LLM Response Validation with Pydantic AI

How to Make LLMs Directly Call Your Internal REST APIs: TypeScript MCP Server Implementation and the Gateway Pattern

7 Major Patterns of Agentic AI Design