Cutting Long-Horizon Agent Costs by 60–90%: Caching, Compression, and Routing Strategies
I still remember the shock of receiving that first bill after putting an AI agent into production. A simple chatbot would have been predictable, but agents were different. Fixing a single bug could stack up tens of thousands of tokens of context, and if it failed, all that cost was gone and you started over from scratch. A single run exceeding ten times the expected cost was routine.
At first I thought, "Can't we just switch to a cheaper model?" — but in practice, performance dropped noticeably first. The real problem wasn't model selection; it was design. This post breaks down where costs explode structurally, then walks through four strategies — prompt caching, context compression, model routing, and checkpointing — with real code. Cuts of 60–90% are achievable while maintaining performance; the reasoning and implementation for each are covered in detail in their respective sections.
By the end, you'll spot things you can apply today. We'll go in order: from caching, which delivers immediate results with a single line of code, to checkpointing, which eliminates the failure cost of long-running agents at the source.
Where Costs Explode
Three Cost Dimensions
A long-horizon agent is an autonomous system that cycles through planning, reasoning, tool calls, and self-correction over minutes, hours, or even days. This characteristic causes the cost structure to behave entirely differently from ordinary LLM API calls.
When people talk about agent costs, they often think only of "token costs" — but in practice, three dimensions are intertwined.
| Cost Type | Cause | Growth Pattern |
|---|---|---|
| Token cost | Input/output token API charges | Linear |
| Context cost | Accumulation of conversation history and tool outputs | Non-linear growth¹ |
| Failure cost | Full retry when a mid-run failure occurs | Exponential |
¹ The API charge itself scales linearly with token count. However, as context grows, the attention computation complexity inside the model increases as O(n²), and more importantly, the probability of retries and failures due to context drift rises alongside it — causing indirect costs to grow non-linearly.
Context cost is especially dangerous because, left unchecked, context drift sets in.
Context Drift vs. Context Exhaustion
- Context exhaustion: The context window fills up physically. Even a 200K window can eventually be filled.
- Context drift: Even with window space remaining, stale or irrelevant content accumulates and quietly degrades the model's reasoning quality.
Drift arrives much more quietly, and is far more dangerous.
Failure cost is also an easy trap to overlook. According to the Trajectory Reduction research, when task duration doubles, the failure rate quadruples. There is also data showing that agent success rates begin to decline noticeably after roughly 35 minutes — which means simply increasing the context window size cannot solve this problem on its own.
4 Core Cost-Reduction Strategies
Strategy 1: Prompt Caching — The Easiest Savings to Capture
A KV cache (Key-Value Cache) is a structure in which transformer models store intermediate attention computation results in GPU memory to avoid recomputation. Major LLM providers have begun officially supporting this KV cache at the API level. Honestly, if you're not using this, you're essentially throwing money away.
For Anthropic Claude, a cache hit makes input token costs 90% cheaper than the standard price ($0.30/M vs. $3.00/M). OpenAI automatically caches prefixes without any configuration, and cache-hit tokens are billed at 50% of the standard price (this means the unit price of hit tokens is halved, not that average savings are 50%).
When our team first applied this, the daily cost of a code review agent dropped to less than half — overnight. The implementation was surprisingly simple.
import anthropic
client = anthropic.Anthropic()
LONG_CODING_GUIDELINES = "..." # coding guidelines spanning thousands of tokens
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a specialized code review agent.\n" + LONG_CODING_GUIDELINES,
"cache_control": {"type": "ephemeral"} # designate this block for caching
}
],
messages=[
{"role": "user", "content": f"Please review the following code:\n{user_code}"}
]
)
# Check whether the cache was hit
usage = response.usage
print(f"Cache hit tokens: {usage.cache_read_input_tokens}")
print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")Key Design Point: For caching to be effective, the prompt must be structured as static prefix (system prompt, tool definitions, long documents) + dynamic part (user input, conversation history). If the dynamic part comes first, the cache will not work at all.
Once caching handles static costs, it's time to address the context that accumulates dynamically.
Strategy 2: Context Compression — The Core Defense Against Drift
To prevent the context window from ballooning in long-running agents, periodic compression is necessary. AgentDiet (a research framework that removes unnecessary steps from agent execution trajectories) demonstrated 39–60% reduction in input tokens, and ACON (a research approach that iteratively improves compression strategies by comparing successful and failed trajectories) also achieved 26–54% reductions in peak tokens.
The pattern most commonly used in practice is a rolling summary.
from typing import List, Dict
import anthropic
class ContextManager:
def __init__(self, max_tokens: int = 50_000, summary_threshold: int = 40_000):
self.max_tokens = max_tokens
self.summary_threshold = summary_threshold
self.messages: List[Dict] = []
self.summary: str = ""
self.client = anthropic.Anthropic()
async def add_message(self, role: str, content: str, token_count: int):
self.messages.append({
"role": role,
"content": content,
"tokens": token_count
})
current_total = sum(m["tokens"] for m in self.messages)
if current_total > self.summary_threshold:
await self._compress_old_messages()
async def _compress_old_messages(self):
cutoff = len(self.messages) // 2
to_summarize = self.messages[:cutoff]
conversation_text = "\n".join(
f"[{m['role']}]: {m['content']}" for m in to_summarize
)
# Use a cheap, small model for compression itself — this is the key savings point
response = self.client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Please concisely summarize the key information from the following conversation:\n\n{conversation_text}"
}]
)
new_summary = response.content[0].text
self.summary = f"{self.summary}\n\n{new_summary}".strip()
self.messages = self.messages[cutoff:]
def get_context(self) -> List[Dict]:
context = []
if self.summary:
context.append({
"role": "system",
"content": f"[Previous Conversation Summary]\n{self.summary}"
})
context.extend(
{"role": m["role"], "content": m["content"]}
for m in self.messages
)
return contextBecause the compression step itself incurs LLM call costs, the key is to use a cheap, small model for summarization rather than an expensive frontier model. This single pattern produces a noticeable reduction in context-related costs.
Once compression keeps accumulation in check, let's move on to selecting the right model for each task.
Strategy 3: Model Routing — The Largest Cost Reduction
At first I thought, "Can't we just use Claude Opus or GPT-4 for everything?" — but in practice that turns out to be a surprisingly expensive assumption. Production experience shows that roughly 85% of enterprise queries can be handled adequately by lower-cost models. Using Opus for simple summaries, translations, or format conversions is like paying taxi fare to ride a bus.
from enum import Enum
class TaskComplexity(Enum):
SIMPLE = "simple" # Haiku-tier — translation, summarization, format conversion
MODERATE = "moderate" # Sonnet-tier — general analysis, code explanation
COMPLEX = "complex" # Opus-tier — architecture design, security analysis
class ModelRouter:
ROUTING_RULES = {
TaskComplexity.SIMPLE: "claude-haiku-4-5-20251001",
TaskComplexity.MODERATE: "claude-sonnet-4-6",
TaskComplexity.COMPLEX: "claude-opus-4-8",
}
def classify_task(self, task: str, context_tokens: int) -> TaskComplexity:
# Large context likely indicates a complex task
if context_tokens > 30_000:
return TaskComplexity.COMPLEX
# Keyword-based complexity assessment (a dedicated classifier is recommended at scale)
complex_signals = ["architecture design", "security vulnerability", "performance optimization", "refactoring"]
simple_signals = ["translation", "summarization", "format conversion", "simple question"]
if any(signal in task for signal in complex_signals):
return TaskComplexity.COMPLEX
if any(signal in task for signal in simple_signals):
return TaskComplexity.SIMPLE
return TaskComplexity.MODERATE
def get_model(self, task: str, context_tokens: int) -> str:
complexity = self.classify_task(task, context_tokens)
return self.ROUTING_RULES[complexity]Expanding this into a hierarchical multi-agent structure is even more powerful: the orchestrator (strategy decisions) uses a large model, while workers (actual execution) use small models. The OPTIMA research (a training framework for optimizing multi-agent conversation trajectories) reported achieving massive cost reductions while retaining 97.7% of performance.
Even if implementing the routing logic feels tedious, the ROI becomes overwhelming at scale. Once model selection is sorted, let's move on to the final strategy that eliminates failure costs at the source.
Strategy 4: Checkpointing — The Safety Net That Stops Failure Costs
There is a particularly painful situation in long-running agents: a task that ran for 40 minutes fails at the very end and has to restart from scratch. Checkpointing is essential infrastructure to prevent this.
interface AgentCheckpoint {
taskId: string;
step: number;
state: Record<string, unknown>;
completedActions: string[];
timestamp: number;
}
const MAX_STEPS = 100;
class CheckpointedAgent {
private checkpointInterval = 5; // save every 5 steps
async execute(task: string, taskId: string): Promise<void> {
const checkpoint = await this.loadCheckpoint(taskId);
const startStep = checkpoint?.step ?? 0;
let state: Record<string, unknown> = checkpoint?.state ?? this.initState(task);
console.log(`Resuming from step ${startStep}`);
for (let step = startStep; step < MAX_STEPS; step++) {
state = await this.executeStep(step, state);
if (step % this.checkpointInterval === 0) {
await this.saveCheckpoint({
taskId,
step: step + 1,
state,
completedActions: (state.actions as string[]) ?? [],
timestamp: Date.now()
});
}
if (this.isComplete(state)) break;
}
}
private async saveCheckpoint(cp: AgentCheckpoint): Promise<void> {
await redis.setex(`checkpoint:${cp.taskId}`, 3600, JSON.stringify(cp));
}
private async loadCheckpoint(taskId: string): Promise<AgentCheckpoint | null> {
const data = await redis.get(`checkpoint:${taskId}`);
return data ? JSON.parse(data) : null;
}
private initState(task: string): Record<string, unknown> {
return { task, actions: [], status: "started" };
}
private async executeStep(
step: number,
state: Record<string, unknown>
): Promise<Record<string, unknown>> {
// implement actual step execution logic
return state;
}
private isComplete(state: Record<string, unknown>): boolean {
return state.status === "done";
}
}Real-World Application
Example 1: Software Engineering Automation Agent
A workflow that reads a codebase → diagnoses a bug → applies a fix → runs tests → iteratively corrects on failure is a canonical example of a long-horizon agent. Because compile logs and test results accumulate continuously, context management is central. This is a situation encountered frequently in practice, and the pattern below applies three optimizations simultaneously.
import anthropic
from dataclasses import dataclass, field
@dataclass
class BugFixAgent:
client: anthropic.Anthropic = field(default_factory=anthropic.Anthropic)
context_manager: ContextManager = field(default_factory=ContextManager)
router: ModelRouter = field(default_factory=ModelRouter)
SYSTEM_PROMPT = """You are a senior software engineer.
You specialize in diagnosing and fixing bugs.
You always write tests and maintain code quality."""
async def fix_bug(self, bug_report: str, codebase_context: str) -> str:
# [Optimization 1] Select model based on task complexity — model routing
analysis_model = self.router.get_model("bug analysis", len(codebase_context))
analysis = await self.client.messages.create(
model=analysis_model,
max_tokens=2048,
system=[{
"type": "text",
"text": self.SYSTEM_PROMPT + "\n\n" + codebase_context,
"cache_control": {"type": "ephemeral"} # [Optimization 2] cache the codebase
}],
messages=[{
"role": "user",
"content": f"Bug report: {bug_report}\n\nPlease analyze the root cause."
}]
)
# [Optimization 3] Switch to a large model for complex fixes
# Same codebase_context → cache hit occurs (reusing the caching benefit)
fix = await self.client.messages.create(
model="claude-opus-4-8",
max_tokens=4096,
system=[{
"type": "text",
"text": self.SYSTEM_PROMPT + "\n\n" + codebase_context,
"cache_control": {"type": "ephemeral"}
}],
messages=[
{"role": "user", "content": f"Bug report: {bug_report}"},
{"role": "assistant", "content": analysis.content[0].text},
{"role": "user", "content": "Based on the analysis above, please write the fix."}
]
)
return fix.content[0].textHere is how the three optimizations fit together:
| Optimization Point | Code Location | Savings Mechanism |
|---|---|---|
| System prompt caching | cache_control: ephemeral |
90% cost reduction on repeated calls |
| Codebase context caching | Same system across analysis→fix steps | Second call also gets a cache hit |
| Per-step model routing | analysis_model vs claude-opus-4-8 |
Small model for analysis, large model for fix |
Example 2: Deep Research Agent
In a workflow of web search → document reading → synthesis report writing, how you manage repeated search results makes all the difference in cost. There is a case where context optimization alone cut daily costs from $150 to $45 ($38,000/year saved) for an automated workload running at 1,000 queries/day.
from typing import AsyncGenerator
import anthropic
class ResearchAgent:
def __init__(self):
self.client = anthropic.Anthropic() # explicitly initialized at class level
self.context_manager = ContextManager(
max_tokens=80_000,
summary_threshold=60_000
)
async def research(self, query: str) -> AsyncGenerator[str, None]:
search_results = []
async for result in self._search(query):
# Summarize search results immediately — store only summaries, not full text, in context
summary = await self._summarize_result(result, query)
search_results.append(summary)
# Token count estimate — approximate value for English
# For non-Latin scripts, word count based on spaces diverges significantly from actual token count;
# using tiktoken or the API's token counter is recommended
estimated_tokens = int(len(summary.split()) * 1.5)
await self.context_manager.add_message(
role="tool",
content=f"[Search Result Summary]\n{summary}",
token_count=estimated_tokens
)
yield f"In progress: {len(search_results)} results collected"
# Use a large model for the final report
report = await self._synthesize(
query=query,
context=self.context_manager.get_context()
)
yield report
async def _summarize_result(self, result: str, query: str) -> str:
# use self.client — not a global variable
response = self.client.messages.create(
model="claude-haiku-4-5-20251001", # use a cheap small model for summarization
max_tokens=256,
messages=[{
"role": "user",
"content": f"Summarize only what is relevant to the query '{query}' in 3 sentences:\n{result[:2000]}"
}]
)
return response.content[0].text
async def _search(self, query: str) -> AsyncGenerator[str, None]:
# implement actual search logic
pass
async def _synthesize(self, query: str, context: list) -> str:
# implement final report generation logic
passTrade-off Analysis
Each strategy has a different implementation difficulty, and the point at which its effect appears also differs. This table should help you decide where to start.
ROI Comparison by Strategy
| Strategy | Implementation Difficulty | Immediate Effect | Savings Scale |
|---|---|---|---|
| Prompt caching | Low (one line of code) | Immediate | 50–90% on cache-hit tokens |
| Context compression | Medium | Effect grows as context accumulates | 39–60% of input tokens |
| Model routing | Medium–High | Immediate | 60–90% of total cost |
| Checkpointing | High | Effective for long-running tasks | Full retry cost |
Implementation Trade-offs
| Item | Risk | Mitigation |
|---|---|---|
| Cache miss | Cache does not work at all if the dynamic part precedes the static part | Enforce a static→dynamic prompt structure by design |
| Information loss during compression | Higher compression ratios risk losing important context | Validate compression quality by comparing successful vs. failed trajectories, as in the ACON approach |
| Routing error | Sending a complex task to a small model causes a sharp quality drop | Fall back to a higher-tier model when classification confidence is low |
| Cost unpredictability | Branching logic and retries cause non-linear token usage growth | Token budget caps and automatic circuit-breaker safeguards are essential |
| State serialization difficulty | Complex agent state can be difficult to serialize into checkpoints in some cases | Design a serializable state structure from the start |
These four strategies are each effective on their own, but applying them together multiplies the impact. That said, trying to implement all of them at once raises complexity, so it's more practical to apply them one at a time in the order: caching → compression → routing → checkpointing.
The Most Common Mistakes in Practice
-
Designing prompts without considering cache structure: A structure that prepends user input before the system prompt completely neutralizes the cache. If your cache hit rate is 0%, this is almost always the culprit. It's worth considering the static/dynamic separation from the very beginning of prompt design.
-
Expanding context indefinitely without compression: "We have a 200K context window, so it'll be fine" invites both drift and a cost explosion. It's worth remembering that performance degradation starts after the roughly 35-minute threshold.
-
Applying frontier models to every task: Using Opus for simple summarization, translation, or format conversion is like paying taxi fare to ride a bus. Even if implementing the routing logic feels tedious, the ROI is overwhelming at scale.
Closing Thoughts
The core of long-horizon agent cost optimization is not choosing cheaper models — it is designing when not to use expensive ones.
Three steps you can start right now:
-
Measure your current cache hit rate first: For Anthropic, you can log
response.usage.cache_read_input_tokens; for OpenAI,usage.prompt_tokens_details.cached_tokens. A hit rate of 0% is very likely a prompt structure issue. -
Add one line of
cache_control: It's worth wrapping your system prompt and static context (documents, codebase) withcache_control: {"type": "ephemeral"}. This is the highest-ROI first step — a single line of code that can cut repeat call costs by up to 90%. -
Attach a rolling summary: You can add a
ContextManagerthat compresses old messages with a small model once context exceeds a certain threshold (e.g., 40K tokens). Based on the pattern introduced above, a good starting point is to apply it to just your longest-running session and observe the change in cost.
References
- ACON: Optimizing Context Compression for Long-horizon LLM Agents | arXiv
- Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks | arXiv
- Reducing Cost of LLM Agents with Trajectory Reduction | arXiv
- Long-Horizon Agents Are Here. Full Autopilot Isn't | DEV Community
- Long-horizon agents explained: Hype, reality, engineering lessons | EPAM Insights
- Context Engineering in 2025: The Complete Guide to AI Agent Optimization | Mem0 Blog
- AI Agent Context Compression: Strategies for Long-Running Sessions | Zylos Research
- Prompt Caching for Anthropic and OpenAI Models | DigitalOcean Blog
- LangChain Cost Optimization: Agent Execution Cost Analysis and Reduction
- Long Horizon Document Agents | LlamaIndex Blog
- Context Window Management Strategies for Long-Context AI Agents | Maxim AI
- Long-Running AI Agents and Task Decomposition 2026 | Zylos Research