Privacy Policy© 2026 DEV BAK - TECH BLOG. All rights reserved.
DEV BAK - TECH BLOG
AI

Why AI Agent LLM Costs Explode and Strategies to Cut Them by 60–80%

If you've ever deployed an agentic AI system in production, you've probably stared at a bill in disbelief at least once. "I only sent a few prompts — why is it this expensive?" I initially thought simply reducing token count would solve it, but that turned out not to be the case.

The core problem is that costs in agentic workflows do not scale linearly with raw token consumption. To handle a single user request, an agent internally chains multiple LLM calls in sequence: Planning → Tool Calling → Self-correction → Synthesis. Each time this execution trajectory changes, costs grow non-linearly. With GPT-4o, a simple Q&A costs roughly $0.001–$0.004, but a multi-agent system can spend $0.3–$1.5 or more on the same task.

This article breaks down why dynamic workflow cost structures are built this way, and examines the principles behind real-world cases that achieved 60–80% cost reductions. Once you understand the new cost metric of "dollars per task" and model cascade routing, you'll design budgets for agent systems in a completely different way.


Core Concepts

Why Costs Come Out 10x Higher Than Expected — The Three Layers of Cost Structure

Agent workflow costs decompose into three major layers.

Variable Description Share
Input tokens Context + system prompt repetition accumulated per call 60–70% of total
Output tokens Reasoning results + tool call spec generation 5x the per-unit price of input
Orchestration overhead Agent loops, retries, multi-agent coordination 4–15x vs. single call

What many people miss here is the repeated cost of input tokens. An agent with a 2,000-token system prompt running 10–20 turns burns 20,000–40,000 tokens on that prompt alone with no meaningful processing — money out the door for nothing.

Orchestration Overhead: The process by which an agent decides its next action, selects a tool, and evaluates results itself consumes LLM calls. It's not simply "using a tool once" — even "deciding which tool to use and why" costs tokens.

Why Costs Explode as Agent Count Grows

When 4 agents in an AutoGen GroupChat run 5 rounds of discussion, a simple calculation yields at minimum 20 LLM calls. In practice, each round adds a speaker selection call to determine the next speaker, pushing the count even higher. When the number of agents grows linearly, coordination costs can grow exponentially.

python
import autogen
 
# Without a termination condition, costs quietly explode
groupchat = autogen.GroupChat(
    agents=[user_proxy, planner, coder, reviewer],
    messages=[],
    # max_round not set → the loop does not terminate automatically
)
 
# Recommended: explicitly declare an upper bound at the architecture level
groupchat_safe = autogen.GroupChat(
    agents=[user_proxy, planner, coder, reviewer],
    messages=[],
    max_round=10,
)

Real-world production cases confirmed that a GroupChat without a round cap incurs 3–5x the cost of a single-call workflow.

Why You Should Think in "Dollars Per Task" Instead of Token Unit Price

Focusing only on raw token unit price ($/M tokens) is a trap. To be honest, I initially thought "GPT-3.5 is 20x cheaper than GPT-4, so just run everything on 3.5" — but failure rates climbed, retry costs exploded, and it ended up being more expensive overall.

$/successful workflow step: Cost per successful workflow step. You need to include the retry cost of failed steps to see the true TCO (Total Cost of Ownership). The simple $/M tokens metric hides failure rates.

This is the same context behind Galileo's Agent Leaderboard v2 (as of mid-2025) starting to measure average action completion rate, tool selection quality, and average cost per session in a unified way.

Model Cascade Routing — Automatically Branching by Complexity

A cascade architecture that routes tasks to model tiers based on complexity is currently the most effective cost optimization pattern. There is a price difference of 150x or more between low-cost models ($0.10–$1/M tokens) and frontier models ($15–$30+/M tokens).

python
from litellm import completion
 
def cascade_route(task: str, context: dict) -> str:
    complexity = estimate_complexity(task)  # 0.0 ~ 1.0
 
    if complexity < 0.3:
        model = "claude-haiku-4-5"      # Simple classification, format conversion, summarization
    elif complexity < 0.7:
        model = "claude-sonnet-4-6"     # Code generation, multi-step reasoning
    else:
        model = "claude-opus-4-8"       # Complex architecture design, ambiguous requirements analysis
 
    response = completion(
        model=model,
        messages=[{"role": "user", "content": task}],
        metadata={"cost_tag": f"complexity_{complexity:.1f}"}  # For cost tracking
    )
    return response.choices[0].message.content

How estimate_complexity is implemented determines the entire quality of routing. You can start with a simple approach based on token count or keywords, or maintain a separate lightweight classification model. Defining the 0–1 scale criteria is the heart of this architecture.

Compared to naive routing (using a single model regardless of complexity), cascade routing simultaneously achieves 14% additional performance improvement and 60–80% cost reduction on complex benchmarks.


Practical Application

Example 1: Reducing Input Tokens by 40–60% Through Agent Trajectory Compression

This is a pattern confirmed by the AgentDiet (2509.23586) paper on arXiv. When an agent accumulates tool call results verbatim, by later calls the early intermediate outputs fill the context window. The detailed contents of already-completed subtasks are often unnecessary for the final answer.

python
class TrajectoryCompressor:
    def __init__(self, max_tokens: int = 4000):
        self.max_tokens = max_tokens
 
    def compress(self, trajectory: list[dict]) -> list[dict]:
        compressed = []
        for step in trajectory:
            if step["status"] == "completed":
                # step["summary"] is a summary pre-generated by a separate LLM call.
                # The summary generation incurs its own cost, but the tokens saved
                # in subsequent calls far outweigh it, so overall cost decreases.
                compressed.append({
                    "role": "assistant",
                    "content": f"[Completed] {step['summary']}",
                    "status": "compressed"
                })
            else:
                compressed.append(step)
        return compressed
Metric Before Compression After Compression Reduction
Input tokens 100% 40–60% 39.9–59.7%
Total compute cost 100% 64–79% 21.1–35.9%
Agent performance Baseline Equivalent No loss

Example 2: Cutting Repeated Planning Costs in Half with Plan Caching

If you're repeatedly processing the same kind of task — for example, running a PR review agent hundreds of times a day — plan caching (Agentic Plan Caching, arXiv 2506.14852) is the fastest option. It reuses previously successful planning results as "test-time memory."

When I first applied this pattern, I was able to see results immediately with a surprisingly simple implementation. The key is building a cache key based on task type and context structure.

python
import hashlib
import json
 
class AgentPlanCache:
    def __init__(self):
        self._cache: dict[str, dict] = {}
 
    def _task_fingerprint(self, task_type: str, context_keys: list[str]) -> str:
        payload = json.dumps({"type": task_type, "keys": sorted(context_keys)})
        return hashlib.sha256(payload.encode()).hexdigest()[:16]
 
    def get_plan(self, task_type: str, context_keys: list[str]) -> dict | None:
        key = self._task_fingerprint(task_type, context_keys)
        return self._cache.get(key)
 
    def store_plan(self, task_type: str, context_keys: list[str], plan: dict, score: float):
        if score >= 0.95:  # Cache only plans with a success rate of 95% or higher
            key = self._task_fingerprint(task_type, context_keys)
            self._cache[key] = {"plan": plan, "score": score}
 
# Result: 50.31% reduction in serving cost, 27.28% reduction in latency, 96.61% of optimal performance retained

Example 3: Prompt Caching Strategy — What You Cache Is What Matters

The comparative experiment in arXiv 2601.06007 tested three strategies. All three can be compared using the same API call structure.

python
import anthropic
 
client = anthropic.Anthropic()
 
# Strategy 1: Full context caching
# Simple to implement, but if any part of the context changes, the entire cache is invalidated
response = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": full_context,
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": user_task}]
)
 
# Strategy 2: System prompt caching only — the most versatile and suitable for most situations
response = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": system_prompt,           # Cache only the fixed system prompt
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": user_task}]  # Exclude dynamic parts from caching
)
 
# Strategy 3: Cache excluding dynamic tool results
# Cache the system prompt and fixed instructions; exclude frequently changing tool call results
response = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": system_prompt,
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": static_instructions, "cache_control": {"type": "ephemeral"}},
            {"type": "text", "text": dynamic_tool_results}  # Excluded from cache scope
        ]
    }]
)
Strategy Cost Reduction Best Fit
Full context caching 41–55% Static tasks, batch processing
System prompt caching only 55–70% General-purpose agents, most situations
Cache excluding dynamic tool results 65–80% Multi-turn agents, tool-dependent workflows

All three strategies consistently measured 41–80% cost reductions across all major providers (Anthropic, OpenAI, Google).

Example 4: State-Based Cost Control with LangGraph

LangGraph's state machine structure is well-suited for reducing repeated LLM calls by 40–50%. Because already-processed states are tracked explicitly, unnecessary re-evaluation is prevented. In particular, treating token budget as part of state lets cost control logic blend naturally into the workflow graph.

python
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator
 
class AgentState(TypedDict):
    task: str
    plan: list[str]
    completed_steps: Annotated[list[str], operator.add]
    token_budget: int   # Manage cost ceiling as state
    tokens_used: int
 
def check_budget(state: AgentState) -> str:
    if state["tokens_used"] >= state["token_budget"]:
        return "budget_exceeded"
    if not state["plan"]:
        return END
    return "execute_step"
 
builder = StateGraph(AgentState)
builder.add_node("planner", plan_task)
builder.add_node("executor", execute_step)
builder.add_node("synthesizer", synthesize_result)
 
# Embed cost control branching into the graph structure
builder.add_conditional_edges("executor", check_budget, {
    "execute_step": "executor",
    "budget_exceeded": "synthesizer",   # Terminate with partial result when budget is exceeded
    END: "synthesizer"
})

Pros and Cons Analysis

Advantages

Item Description
Cost scales with complexity Replaces the static workflow's cost ∝ volume structure with cost ∝ complexity — eliminates wasteful charges for simple tasks
Leveraging model tiers Exploiting the 150x unit price gap achieves 60–80% savings without quality loss
Domain-specific efficiency Compared to general-purpose LLMs, domain-specific agents maintain higher accuracy while costing 4.4–10.8x less
Proactive cost control Applying gateway-level policies prevents after-the-fact billing shock

Disadvantages and Caveats

Item Description Mitigation
System prompt repetition 2K-token prompt wastes 40K tokens in a 20-turn loop Prompt caching or prompt reduction
Context explosion 80–120K token context can form within 2–3 weeks Embed dynamic truncation policy at design time
Silent cost increases Detection is delayed without observability tooling Adopt AgentOps/Langfuse ($200–$1,500/month)
Multi-agent overhead Coordination cost multiplier of 4–15x as agent count grows Include cost model when deciding on number of agents
Routing misclassification risk Wrong model tier assignment → quality degradation or unnecessary high cost Set classification model evaluation metrics based on $/successful step

KV Cache (Key-Value Cache): A mechanism by which transformer models store computed results for previously processed tokens to skip recomputation on identical context. Inference engines like vLLM and SGLang reduce repeated context costs at the infrastructure level through KV cache optimization.

TCO (Total Cost of Ownership): Total cost including not just token costs but also retry costs, orchestration overhead, observability tool costs, and engineering time.

The Most Common Mistakes in Practice

  1. Comparing unit prices while ignoring success rates: Routing to a cheaper model but seeing failure rates rise until retry costs exceed the cost of calling a frontier model directly is more common than you'd think.
  2. Deferring context truncation policy until later: Starting with "it's just testing, so it's fine" means that within 2–3 weeks of hitting production, context exceeds 100K and latency and costs explode simultaneously.
  3. Attempting optimization without observability: Adding caching or swapping models without visibility into where tokens are leaking makes it impossible to know whether the change had any effect. Tools like Langfuse or AgentOps are an investment, not a cost.

Closing Thoughts

The cost problem in agentic workflows is not about "using a cheaper model" — it's about designing cost control structures into the entire execution path.

Three steps you can start on right now:

  1. Measure per-step token consumption in your current workflow. If you're using LangChain/LangGraph, pip install langfuse and adding a single callback line lets you immediately see which step consumes the most tokens. Even outside that stack, you can get the same visibility with AgentOps or an OpenTelemetry export.
  2. Start by applying system prompt caching. With the Anthropic API, adding just a cache_control: {"type": "ephemeral"} header cuts repeated system prompt costs by 55–70%. Implementation difficulty is low and the effect is immediate.
  3. Shift your performance metric from "accuracy" to "accuracy per dollar." If you've been evaluating models by accuracy alone, measure cost on the same dataset and compare by $/successful step — you'll often discover an optimal model combination that contradicts your intuition.

Applying just these three changes will produce a meaningful difference within the first month for most teams. Bookmark this and try applying one per sprint.


References

  • Reducing Cost of LLM Agents with Trajectory Reduction | arXiv
  • Don't Break the Cache: Evaluation of Prompt Caching for Long-Horizon Agentic Tasks | arXiv
  • Agentic Plan Caching: Test-Time Memory for Fast and Cost-Efficient LLM Agents | arXiv
  • LLM Cost Optimization for Agent Workflows: A Practical Guide | DEV Community
  • The Hidden Costs of Agentic AI: Why 40% of Projects Fail Before Production | Galileo
  • AI Cost Optimization: A Practical Guide for 2026 | TrueFoundry
  • Dynamic Routing for Multi-Agent AI Workflows | TDCommons
  • Efficient LLM Serving for Agentic Workflows: A Data Systems Perspective | arXiv
  • Agentic AI Costs More Than You Budgeted | DataRobot
#LLM#AI에이전트#프롬프트캐싱#모델캐스케이드#LangGraph#비용최적화#멀티에이전트#컨텍스트관리#AutoGen#Anthropic
Share

Table of Contents

Core ConceptsWhy Costs Come Out 10x Higher Than Expected — The Three Layers of Cost StructureWhy Costs Explode as Agent Count GrowsWhy You Should Think in "Dollars Per Task" Instead of Token Unit PriceModel Cascade Routing — Automatically Branching by ComplexityPractical ApplicationExample 1: Reducing Input Tokens by 40–60% Through Agent Trajectory CompressionExample 2: Cutting Repeated Planning Costs in Half with Plan CachingExample 3: Prompt Caching Strategy — What You Cache Is What MattersExample 4: State-Based Cost Control with LangGraphPros and Cons AnalysisAdvantagesDisadvantages and CaveatsThe Most Common Mistakes in PracticeClosing ThoughtsReferences

Recommended Posts

Running Qwen3-Coder Locally: Setting Up an SWE-bench 70% AI Coding Agent with a Single RTX 3090
AI

Running Qwen3-Coder Locally: Setting Up an SWE-bench 70% AI Coding Agent with a Single RTX 3090

After watching my cloud AI bills double two months in a row, I started seriously looking for alternatives. Honestly, it wasn't so much a bias of "how good could...

June 6, 202622 min read
Open-Weight vs Closed AI 2026: Now That the Benchmark Gap Has Narrowed, the Criteria for Choosing Has Changed
AI

Open-Weight vs Closed AI 2026: Now That the Benchmark Gap Has Narrowed, the Criteria for Choosing Has Changed

To be honest, until a year ago I thought closed models would maintain an overwhelming lead for some time. It seemed only natural to plug in an OpenAI API key to...

June 6, 202623 min read
7 Major Patterns of Agentic AI Design
AI

7 Major Patterns of Agentic AI Design

Use + ReAct | KB, ticket DB, and other external systems with repeated lookups | | Response writing | Response agent | Reflection | Self-review of tone and accu...

June 6, 20269 min read
LLM Agent Output Validation: Why Hallucinations Pass JSON Schema and How to Design a 3-Layer Defense
AI

LLM Agent Output Validation: Why Hallucinations Pass JSON Schema and How to Design a 3-Layer Defense

Once you put an LLM-based agent into production, you'll encounter strange failures sooner than you expect. Digging through the logs, you'll find that JSON parsi...

June 5, 202624 min read
AI Agent State Management Architecture — Achieving Production Reliability with LangGraph Checkpointing
AI

AI Agent State Management Architecture — Achieving Production Reliability with LangGraph Checkpointing

The first problem I encountered when building my first agent was this: "Why doesn't this agent remember what it just did?" For a simple chatbot that's fine, but...

June 5, 202627 min read
Building Your Own LLM Evaluation Framework vs. Off-the-Shelf Tools: Team Decision Criteria for 2026
AI

Building Your Own LLM Evaluation Framework vs. Off-the-Shelf Tools: Team Decision Criteria for 2026

If your team is shipping RAG, chatbots, or agents to production, this decision is waiting for you If you've ever shipped an AI feature to your product and th...

May 30, 202624 min read