Reducing LLM Agent Pipeline Token Costs by 50%: A Practical Comparison of Summary Agent vs. Chunk Injection vs. Prompt Caching
When we first deployed the 10-step agent pipeline to production, we were charged $0.80 per single request. With 1,000 requests a day, that was $24,000 a month. The pipeline itself ran well, but the API costs were threatening the business model. When we dug into the cause, the answer was simple—with every request, the entire previous conversation, system prompt, and search results were being resent in their entirety.
This article is a practical guide for developers running multistep agents on LangChain, LlamaIndex, or by calling APIs directly. It compares three key strategies for context compression—summary agents, chunk injection, and prompt caching—along with executable code. Hierarchically combining these three strategies can reduce token costs by 50–90%.
Key Concepts
Why Context Explodes in the Agent Pipeline
The LLM API charges the entire input token for every request. When an agent executes 10 steps, if the result of step 1 is accumulated as the input for step 2, and the result of step 2 is accumulated as the input for step 3—the size of the input for the final step becomes the sum of all previous steps. Three structural problems overlap here.
| Cause | Explanation | Risk |
|---|---|---|
| Accumulated Context | The entire previous conversation is resent on every request | Cost skyrockets in proportion to the number of steps |
| Context Window Limit | Agent fails if model's maximum tokens are exceeded | Service interruption |
| Lost in the Middle | As context lengthens, focus on key information decreases | Quality degradation |
Context Compression is not just simple text reduction. It is an agent memory management strategy that includes optional chunk dropping, semantic-preserving summarization, and KV cache reuse.
Compare Three Strategies at a Glance
| Strategy | Key Idea | Cost Reduction | Implementation Complexity | Risk of Information Loss |
|---|---|---|---|---|
| Summary Agent | Periodically replace conversation history with summaries | Up to 90% | Low | High |
| Chunk Injection | Selectively inject only query-related chunks from the vector DB | 40%+ | High | Low |
| Prompt Caching | Server-side reuse of recurring static contexts | 50–90% | Very Low | None |
Practical Application
Example 1: Summary Agent — Sliding Window Summary
When the conversation reaches a threshold (e.g., 8,000 tokens), the core is compressed via a separate LLM call. 90% compression is possible from the original 6,000-token history to a 500-token summary.
from anthropic import Anthropic
client = Anthropic()
def summarize_history(messages: list[dict], max_tokens: int = 8000) -> list[dict]:
"""누적 메시지가 임계치를 넘으면 요약본으로 압축한다."""
# 토큰 수 추정 (실제 구현에서는 tiktoken 등으로 정확히 계산)
estimated_tokens = sum(len(m["content"].split()) * 1.3 for m in messages)
if estimated_tokens < max_tokens:
return messages
# 최근 2개 메시지는 유지, 나머지를 요약
recent = messages[-2:]
to_summarize = messages[:-2]
summary_response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system="당신은 대화 요약 전문가입니다. 핵심 사실, 결정 사항, 맥락만 보존하여 간결하게 요약하세요.",
messages=[
{
"role": "user",
"content": f"다음 대화를 500토큰 이내로 요약하세요:\n\n{str(to_summarize)}"
}
]
)
summary_text = summary_response.content[0].text
# 요약본을 첫 메시지로 교체
return [
{"role": "user", "content": f"[이전 대화 요약]\n{summary_text}"},
{"role": "assistant", "content": "요약된 맥락을 파악했습니다. 계속하겠습니다."},
*recent
]
# 에이전트 루프 예시
messages = []
for step in pipeline_steps:
messages = summarize_history(messages) # 매 단계 전 압축 체크
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=messages + [{"role": "user", "content": step}]
)
messages.append({"role": "user", "content": step})
messages.append({"role": "assistant", "content": response.content[0].text})| Aspect | Result |
|---|---|
| 6,000 Token History → Summary | ~500 Tokens (92% Compressed) |
| Additional LLM call cost | Summary 1 call ≈ $0.002 (Sonnet) |
| Loss of details | 5–20% depending on summary quality |
Example 2: Chunk Injection — Layered RAG Compression
We use LangChain's ContextualCompressionRetriever. By applying it in the order of Selection (low-cost filter) → Extraction (high-cost LLM), we capture both cost and accuracy.
from langchain_anthropic import ChatAnthropic
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor, LLMChainFilter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
llm = ChatAnthropic(model="claude-sonnet-4-6")
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(embedding_function=embeddings)
# 1단계: 관련 없는 문서 전체를 저비용으로 제거
filter_compressor = LLMChainFilter.from_llm(llm)
# 2단계: 남은 문서에서 쿼리 관련 문장만 추출 (고비용이지만 이미 걸러진 후)
extractor_compressor = LLMChainExtractor.from_llm(llm)
# 계층화: filter 먼저, extract 나중
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
pipeline_compressor = DocumentCompressorPipeline(
transformers=[filter_compressor, extractor_compressor]
)
compression_retriever = ContextualCompressionRetriever(
base_compressor=pipeline_compressor,
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10})
)
# 사용 예시
query = "프롬프트 캐싱 적용 방법"
compressed_docs = compression_retriever.get_relevant_documents(query)
# 10개 문서 검색 → 쿼리 관련 핵심 문장만 반환 (평균 40~60% 토큰 감소)If you want to directly control the chunk selection criteria, add a similarity score-based filter.
from langchain_core.documents import Document
def filter_by_similarity(docs_with_scores: list[tuple], threshold: float = 0.75) -> list[Document]:
"""코사인 유사도 임계치 이하 문서 제거 (BM25 점수와 조합 가능)"""
return [doc for doc, score in docs_with_scores if score >= threshold]
# 벡터스토어에서 유사도 점수와 함께 검색
docs_with_scores = vectorstore.similarity_search_with_score(query, k=10)
filtered_docs = filter_by_similarity(docs_with_scores, threshold=0.75)Example 3: Prompt Caching — Anthropic cache_control
It is most powerful when repeatedly sending the same system prompt or long documents. Cache writes cost 1.25 times, but cache reads cost 0.1 times — the cost is recovered with just a single reuse.
import anthropic
client = anthropic.Anthropic()
# 긴 시스템 프롬프트나 문서를 캐싱 (최소 1,024 토큰 필요)
SYSTEM_PROMPT = """당신은 코드 리뷰 전문가입니다...
[2,000 토큰 이상의 상세한 가이드라인]
"""
def analyze_with_cache(code_snippet: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # 5분 TTL 캐시
}
],
messages=[{"role": "user", "content": f"다음 코드를 리뷰해주세요:\n{code_snippet}"}]
)
# 캐시 히트율 확인
usage = response.usage
print(f"캐시 읽기 토큰: {usage.cache_read_input_tokens}")
print(f"캐시 쓰기 토큰: {usage.cache_creation_input_tokens}")
print(f"일반 입력 토큰: {usage.input_tokens}")
cache_hit_rate = usage.cache_read_input_tokens / (
usage.cache_read_input_tokens + usage.cache_creation_input_tokens + usage.input_tokens
) * 100
print(f"캐시 히트율: {cache_hit_rate:.1f}%")
return response.content[0].text
# 동일 시스템 프롬프트로 반복 호출 — 2번째 요청부터 캐시 히트
for code in code_snippets:
result = analyze_with_cache(code)In TypeScript, LangChain's middleware can be applied with minimal modification to existing code.
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
// 2,000 토큰짜리 시스템 프롬프트를 캐싱
const CACHED_SYSTEM = `당신은 기술 문서 작성 전문가입니다...
[상세 가이드라인 — 최소 1,024 토큰]`;
async function generateWithCache(userPrompt: string): Promise<string> {
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 2048,
system: [
{
type: "text",
text: CACHED_SYSTEM,
cache_control: { type: "ephemeral" },
},
],
messages: [{ role: "user", content: userPrompt }],
});
const usage = response.usage as any;
console.log(`캐시 읽기: ${usage.cache_read_input_tokens ?? 0} 토큰`);
return response.content[0].type === "text" ? response.content[0].text : "";
}Actual Cost Reduction Cases:
| Scenario | Before Caching | After Caching | Savings Rate |
|---|---|---|---|
| 50 PDF Iterations | $3.00 | $0.15 | 95% |
| 6-person team 8 weeks Claude usage | $2,400/month | $680/month | 72% |
| 2,000 Token System Prompt, 10 times/hour | Base | 1/8 Cost | 87.5% |
Pros and Cons Analysis
Advantages
| Item | Content |
|---|---|
| Summary Agent | Intuitive implementation, fundamentally resolves context window issues, maintains natural conversation flow |
| Chunk Injection | Maintains accuracy (includes only necessary information), capable of processing large-scale documents, can be combined with other strategies |
| Prompt Caching | Maximum cost efficiency (up to 90%), latency reduced by up to 85%, minimal code changes |
Disadvantages and Precautions
| Item | Content | Response Plan |
|---|---|---|
| Summary Agent | Risk of permanent loss of details | Original history stored in separate storage, important facts extracted in a structured form |
| Summary Agent | Cost of additional LLM calls for summarization | Use small model (Haiku), increase threshold to reduce call frequency |
| Chunk Injection | Vector DB Operational Overhead | Use of Managed Services (Pinecone, Weaviate Cloud) |
| Chunk Injection | Indirect Prompt Injection Vulnerability | Sandboxing Detected Chunks Within Trust Boundaries |
| Prompt caching | Requires at least 1,024 tokens | Short prompts use a combination of semantic caching (Redis) |
| Prompt caching | Invalidate cache on context change | Move frequently changing parts to the end of the prompt |
The Most Common Mistakes in Practice
- Introducing three strategies simultaneously — it becomes impossible to measure which strategy contributed to cost reduction. You must apply them sequentially, one by one, and measure the effectiveness at each stage using the
usagemetric before moving on to the next strategy. - Setting chunk sizes to fixed values — Small chunks (512 tokens) are effective for code, while large chunks (1,500 tokens) are effective for narrative documents. A single chunk size is not optimal for any domain. Separate strategies by content type using
RecursiveCharacterTextSplitter. - Not monitoring the cache hit rate — If you do not track
usage.cache_read_input_tokens, you cannot know if the cache is actually working. If the hit rate is below 50%, you should review the prompt structure — frequently changing parts are likely positioned before the cache target.
In Conclusion
The token cost issue in agent pipelines is resolved through context management strategies, not model replacement. In summary, agents are easy to implement but suffer from information loss; chunk injection offers high accuracy but entails infrastructure complexity; and prompt caching is the most cost-effective but is only valid for fixed contexts. In production, a hierarchical combination of these three strategies makes a 50–90% reduction realistically achievable.
3 Steps to Start Right Now:
- This Week — Prompt Caching Applied: Added a single line,
cache_control: {"type": "ephemeral"}, to the system prompt of existing agents. Hit rates are measured usingusage.cache_read_input_tokens. Minimal code changes, immediate cost savings. - Next Week — Add Summary Agent: Apply summary compression with a small model (claude-haiku-4-5) when conversation history exceeds 8,000 tokens. Adjust threshold after evaluating summary quality with 1 week of data.
- This Month — Introduction of Chunk Injection:
ContextualCompressionRetrieverapplied to agents including document search. Achieves both accuracy and cost with a similarity threshold (0.75) and a layered pipeline.
Next Post: Anthropic’s Official Context Engineering Framework — How to Apply the Write / Select / Compress / Isolate 4 Strategies to Real-World Multi-Agent Systems
Reference Materials
- Context Compression Techniques: Reduce LLM Costs by 50% | SitePoint
- How I Reduced LLM Token Costs by 90% | Medium
- Automatic Context Compression in LLM Agents | AI Forum Medium
- Acon: Optimizing Context Compression for Long-horizon LLM Agents | arXiv
- Effective Context Engineering for Agents | Anthropic
- Context Engineering for Agents | LangChain Blog
- Prompt caching Official Documentation | Anthropic
- LLMLingua Official Site | Microsoft
- How xMemory cuts token costs | VentureBeat
- LongLLMLingua Prompt Compression Guide | LlamaIndex
- Don't Break the Cache: Prompt Caching for Long-Horizon Agentic Tasks | arXiv
- AI Agent Token Cost Optimization: Complete Guide | fast.io
- Chunking Strategies for RAG | Weaviate
- Prompt Caching vs Semantic Caching | Redis