Reducing LLM Agent Pipeline Token Costs by 50%: A Practical Comparison of Summary Agent vs. Chunk Injection vs. Prompt Caching

When we first deployed the 10-step agent pipeline to production, we were charged $0.80 per single request. With 1,000 requests a day, that was $24,000 a month. The pipeline itself ran well, but the API costs were threatening the business model. When we dug into the cause, the answer was simple—with every request, the entire previous conversation, system prompt, and search results were being resent in their entirety.

This article is a practical guide for developers running multistep agents on LangChain, LlamaIndex, or by calling APIs directly. It compares three key strategies for context compression—summary agents, chunk injection, and prompt caching—along with executable code. Hierarchically combining these three strategies can reduce token costs by 50–90%.

Key Concepts

Why Context Explodes in the Agent Pipeline

The LLM API charges the entire input token for every request. When an agent executes 10 steps, if the result of step 1 is accumulated as the input for step 2, and the result of step 2 is accumulated as the input for step 3—the size of the input for the final step becomes the sum of all previous steps. Three structural problems overlap here.

Cause	Explanation	Risk
Accumulated Context	The entire previous conversation is resent on every request	Cost skyrockets in proportion to the number of steps
Context Window Limit	Agent fails if model's maximum tokens are exceeded	Service interruption
Lost in the Middle	As context lengthens, focus on key information decreases	Quality degradation

Context Compression is not just simple text reduction. It is an agent memory management strategy that includes optional chunk dropping, semantic-preserving summarization, and KV cache reuse.

Compare Three Strategies at a Glance

Strategy	Key Idea	Cost Reduction	Implementation Complexity	Risk of Information Loss
Summary Agent	Periodically replace conversation history with summaries	Up to 90%	Low	High
Chunk Injection	Selectively inject only query-related chunks from the vector DB	40%+	High	Low
Prompt Caching	Server-side reuse of recurring static contexts	50–90%	Very Low	None

Practical Application

Example 1: Summary Agent — Sliding Window Summary

When the conversation reaches a threshold (e.g., 8,000 tokens), the core is compressed via a separate LLM call. 90% compression is possible from the original 6,000-token history to a 500-token summary.

python

from anthropic import Anthropic
 
client = Anthropic()
 
def summarize_history(messages: list[dict], max_tokens: int = 8000) -> list[dict]:
    """누적 메시지가 임계치를 넘으면 요약본으로 압축한다."""
    # 토큰 수 추정 (실제 구현에서는 tiktoken 등으로 정확히 계산)
    estimated_tokens = sum(len(m["content"].split()) * 1.3 for m in messages)
 
    if estimated_tokens < max_tokens:
        return messages
 
    # 최근 2개 메시지는 유지, 나머지를 요약
    recent = messages[-2:]
    to_summarize = messages[:-2]
 
    summary_response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system="당신은 대화 요약 전문가입니다. 핵심 사실, 결정 사항, 맥락만 보존하여 간결하게 요약하세요.",
        messages=[
            {
                "role": "user",
                "content": f"다음 대화를 500토큰 이내로 요약하세요:\n\n{str(to_summarize)}"
            }
        ]
    )
 
    summary_text = summary_response.content[0].text
 
    # 요약본을 첫 메시지로 교체
    return [
        {"role": "user", "content": f"[이전 대화 요약]\n{summary_text}"},
        {"role": "assistant", "content": "요약된 맥락을 파악했습니다. 계속하겠습니다."},
        *recent
    ]
 
# 에이전트 루프 예시
messages = []
for step in pipeline_steps:
    messages = summarize_history(messages)  # 매 단계 전 압축 체크
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=messages + [{"role": "user", "content": step}]
    )
    messages.append({"role": "user", "content": step})
    messages.append({"role": "assistant", "content": response.content[0].text})

Aspect	Result
6,000 Token History → Summary	~500 Tokens (92% Compressed)
Additional LLM call cost	Summary 1 call ≈ $0.002 (Sonnet)
Loss of details	5–20% depending on summary quality

Example 2: Chunk Injection — Layered RAG Compression

We use LangChain's ContextualCompressionRetriever. By applying it in the order of Selection (low-cost filter) → Extraction (high-cost LLM), we capture both cost and accuracy.

python

from langchain_anthropic import ChatAnthropic
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor, LLMChainFilter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
 
llm = ChatAnthropic(model="claude-sonnet-4-6")
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(embedding_function=embeddings)
 
# 1단계: 관련 없는 문서 전체를 저비용으로 제거
filter_compressor = LLMChainFilter.from_llm(llm)
 
# 2단계: 남은 문서에서 쿼리 관련 문장만 추출 (고비용이지만 이미 걸러진 후)
extractor_compressor = LLMChainExtractor.from_llm(llm)
 
# 계층화: filter 먼저, extract 나중
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
pipeline_compressor = DocumentCompressorPipeline(
    transformers=[filter_compressor, extractor_compressor]
)
 
compression_retriever = ContextualCompressionRetriever(
    base_compressor=pipeline_compressor,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10})
)
 
# 사용 예시
query = "프롬프트 캐싱 적용 방법"
compressed_docs = compression_retriever.get_relevant_documents(query)
# 10개 문서 검색 → 쿼리 관련 핵심 문장만 반환 (평균 40~60% 토큰 감소)

If you want to directly control the chunk selection criteria, add a similarity score-based filter.

python

from langchain_core.documents import Document
 
def filter_by_similarity(docs_with_scores: list[tuple], threshold: float = 0.75) -> list[Document]:
    """코사인 유사도 임계치 이하 문서 제거 (BM25 점수와 조합 가능)"""
    return [doc for doc, score in docs_with_scores if score >= threshold]
 
# 벡터스토어에서 유사도 점수와 함께 검색
docs_with_scores = vectorstore.similarity_search_with_score(query, k=10)
filtered_docs = filter_by_similarity(docs_with_scores, threshold=0.75)

Example 3: Prompt Caching — Anthropic `cache_control`

It is most powerful when repeatedly sending the same system prompt or long documents. Cache writes cost 1.25 times, but cache reads cost 0.1 times — the cost is recovered with just a single reuse.

python

import anthropic
 
client = anthropic.Anthropic()
 
# 긴 시스템 프롬프트나 문서를 캐싱 (최소 1,024 토큰 필요)
SYSTEM_PROMPT = """당신은 코드 리뷰 전문가입니다...
[2,000 토큰 이상의 상세한 가이드라인]
"""
 
def analyze_with_cache(code_snippet: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"}  # 5분 TTL 캐시
            }
        ],
        messages=[{"role": "user", "content": f"다음 코드를 리뷰해주세요:\n{code_snippet}"}]
    )
 
    # 캐시 히트율 확인
    usage = response.usage
    print(f"캐시 읽기 토큰: {usage.cache_read_input_tokens}")
    print(f"캐시 쓰기 토큰: {usage.cache_creation_input_tokens}")
    print(f"일반 입력 토큰: {usage.input_tokens}")
 
    cache_hit_rate = usage.cache_read_input_tokens / (
        usage.cache_read_input_tokens + usage.cache_creation_input_tokens + usage.input_tokens
    ) * 100
    print(f"캐시 히트율: {cache_hit_rate:.1f}%")
 
    return response.content[0].text
 
# 동일 시스템 프롬프트로 반복 호출 — 2번째 요청부터 캐시 히트
for code in code_snippets:
    result = analyze_with_cache(code)

In TypeScript, LangChain's middleware can be applied with minimal modification to existing code.

typescript

import Anthropic from "@anthropic-ai/sdk";
 
const client = new Anthropic();
 
// 2,000 토큰짜리 시스템 프롬프트를 캐싱
const CACHED_SYSTEM = `당신은 기술 문서 작성 전문가입니다...
[상세 가이드라인 — 최소 1,024 토큰]`;
 
async function generateWithCache(userPrompt: string): Promise<string> {
  const response = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 2048,
    system: [
      {
        type: "text",
        text: CACHED_SYSTEM,
        cache_control: { type: "ephemeral" },
      },
    ],
    messages: [{ role: "user", content: userPrompt }],
  });
 
  const usage = response.usage as any;
  console.log(`캐시 읽기: ${usage.cache_read_input_tokens ?? 0} 토큰`);
 
  return response.content[0].type === "text" ? response.content[0].text : "";
}

Actual Cost Reduction Cases:

Scenario	Before Caching	After Caching	Savings Rate
50 PDF Iterations	$3.00	$0.15	95%
6-person team 8 weeks Claude usage	$2,400/month	$680/month	72%
2,000 Token System Prompt, 10 times/hour	Base	1/8 Cost	87.5%

Pros and Cons Analysis

Advantages

Item	Content
Summary Agent	Intuitive implementation, fundamentally resolves context window issues, maintains natural conversation flow
Chunk Injection	Maintains accuracy (includes only necessary information), capable of processing large-scale documents, can be combined with other strategies
Prompt Caching	Maximum cost efficiency (up to 90%), latency reduced by up to 85%, minimal code changes

Disadvantages and Precautions

Item	Content	Response Plan
Summary Agent	Risk of permanent loss of details	Original history stored in separate storage, important facts extracted in a structured form
Summary Agent	Cost of additional LLM calls for summarization	Use small model (Haiku), increase threshold to reduce call frequency
Chunk Injection	Vector DB Operational Overhead	Use of Managed Services (Pinecone, Weaviate Cloud)
Chunk Injection	Indirect Prompt Injection Vulnerability	Sandboxing Detected Chunks Within Trust Boundaries
Prompt caching	Requires at least 1,024 tokens	Short prompts use a combination of semantic caching (Redis)
Prompt caching	Invalidate cache on context change	Move frequently changing parts to the end of the prompt

The Most Common Mistakes in Practice

Introducing three strategies simultaneously — it becomes impossible to measure which strategy contributed to cost reduction. You must apply them sequentially, one by one, and measure the effectiveness at each stage using the usage metric before moving on to the next strategy.
Setting chunk sizes to fixed values — Small chunks (512 tokens) are effective for code, while large chunks (1,500 tokens) are effective for narrative documents. A single chunk size is not optimal for any domain. Separate strategies by content type using RecursiveCharacterTextSplitter.
Not monitoring the cache hit rate — If you do not track usage.cache_read_input_tokens, you cannot know if the cache is actually working. If the hit rate is below 50%, you should review the prompt structure — frequently changing parts are likely positioned before the cache target.

In Conclusion

The token cost issue in agent pipelines is resolved through context management strategies, not model replacement. In summary, agents are easy to implement but suffer from information loss; chunk injection offers high accuracy but entails infrastructure complexity; and prompt caching is the most cost-effective but is only valid for fixed contexts. In production, a hierarchical combination of these three strategies makes a 50–90% reduction realistically achievable.

3 Steps to Start Right Now:

This Week — Prompt Caching Applied: Added a single line, cache_control: {"type": "ephemeral"}, to the system prompt of existing agents. Hit rates are measured using usage.cache_read_input_tokens. Minimal code changes, immediate cost savings.
Next Week — Add Summary Agent: Apply summary compression with a small model (claude-haiku-4-5) when conversation history exceeds 8,000 tokens. Adjust threshold after evaluating summary quality with 1 week of data.
This Month — Introduction of Chunk Injection: ContextualCompressionRetriever applied to agents including document search. Achieves both accuracy and cost with a similarity threshold (0.75) and a layered pipeline.

Next Post: Anthropic’s Official Context Engineering Framework — How to Apply the Write / Select / Compress / Isolate 4 Strategies to Real-World Multi-Agent Systems

Reference Materials

Reducing LLM Agent Pipeline Token Costs by 50%: A Practical Comparison of Summary Agent vs. Chunk Injection vs. Prompt Caching | DEV BAK - 기술블로그

Reducing LLM Agent Pipeline Token Costs by 50%: A Practical Comparison of Summary Agent vs. Chunk Injection vs. Prompt Caching

Key Concepts

Why Context Explodes in the Agent Pipeline

Cause	Explanation	Risk
Accumulated Context	The entire previous conversation is resent on every request	Cost skyrockets in proportion to the number of steps
Context Window Limit	Agent fails if model's maximum tokens are exceeded	Service interruption
Lost in the Middle	As context lengthens, focus on key information decreases	Quality degradation

Context Compression is not just simple text reduction. It is an agent memory management strategy that includes optional chunk dropping, semantic-preserving summarization, and KV cache reuse.

Compare Three Strategies at a Glance

Strategy	Key Idea	Cost Reduction	Implementation Complexity	Risk of Information Loss
Summary Agent	Periodically replace conversation history with summaries	Up to 90%	Low	High
Chunk Injection	Selectively inject only query-related chunks from the vector DB	40%+	High	Low
Prompt Caching	Server-side reuse of recurring static contexts	50–90%	Very Low	None

Practical Application

Example 1: Summary Agent — Sliding Window Summary

python

from anthropic import Anthropic
 
client = Anthropic()
 
def summarize_history(messages: list[dict], max_tokens: int = 8000) -> list[dict]:
    """누적 메시지가 임계치를 넘으면 요약본으로 압축한다."""
    # 토큰 수 추정 (실제 구현에서는 tiktoken 등으로 정확히 계산)
    estimated_tokens = sum(len(m["content"].split()) * 1.3 for m in messages)
 
    if estimated_tokens < max_tokens:
        return messages
 
    # 최근 2개 메시지는 유지, 나머지를 요약
    recent = messages[-2:]
    to_summarize = messages[:-2]
 
    summary_response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system="당신은 대화 요약 전문가입니다. 핵심 사실, 결정 사항, 맥락만 보존하여 간결하게 요약하세요.",
        messages=[
            {
                "role": "user",
                "content": f"다음 대화를 500토큰 이내로 요약하세요:\n\n{str(to_summarize)}"
            }
        ]
    )
 
    summary_text = summary_response.content[0].text
 
    # 요약본을 첫 메시지로 교체
    return [
        {"role": "user", "content": f"[이전 대화 요약]\n{summary_text}"},
        {"role": "assistant", "content": "요약된 맥락을 파악했습니다. 계속하겠습니다."},
        *recent
    ]
 
# 에이전트 루프 예시
messages = []
for step in pipeline_steps:
    messages = summarize_history(messages)  # 매 단계 전 압축 체크
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=messages + [{"role": "user", "content": step}]
    )
    messages.append({"role": "user", "content": step})
    messages.append({"role": "assistant", "content": response.content[0].text})

Aspect	Result
6,000 Token History → Summary	~500 Tokens (92% Compressed)
Additional LLM call cost	Summary 1 call ≈ $0.002 (Sonnet)
Loss of details	5–20% depending on summary quality

Example 2: Chunk Injection — Layered RAG Compression

We use LangChain's ContextualCompressionRetriever. By applying it in the order of Selection (low-cost filter) → Extraction (high-cost LLM), we capture both cost and accuracy.

python

from langchain_anthropic import ChatAnthropic
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor, LLMChainFilter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
 
llm = ChatAnthropic(model="claude-sonnet-4-6")
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(embedding_function=embeddings)
 
# 1단계: 관련 없는 문서 전체를 저비용으로 제거
filter_compressor = LLMChainFilter.from_llm(llm)
 
# 2단계: 남은 문서에서 쿼리 관련 문장만 추출 (고비용이지만 이미 걸러진 후)
extractor_compressor = LLMChainExtractor.from_llm(llm)
 
# 계층화: filter 먼저, extract 나중
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
pipeline_compressor = DocumentCompressorPipeline(
    transformers=[filter_compressor, extractor_compressor]
)
 
compression_retriever = ContextualCompressionRetriever(
    base_compressor=pipeline_compressor,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10})
)
 
# 사용 예시
query = "프롬프트 캐싱 적용 방법"
compressed_docs = compression_retriever.get_relevant_documents(query)
# 10개 문서 검색 → 쿼리 관련 핵심 문장만 반환 (평균 40~60% 토큰 감소)

If you want to directly control the chunk selection criteria, add a similarity score-based filter.

python

from langchain_core.documents import Document
 
def filter_by_similarity(docs_with_scores: list[tuple], threshold: float = 0.75) -> list[Document]:
    """코사인 유사도 임계치 이하 문서 제거 (BM25 점수와 조합 가능)"""
    return [doc for doc, score in docs_with_scores if score >= threshold]
 
# 벡터스토어에서 유사도 점수와 함께 검색
docs_with_scores = vectorstore.similarity_search_with_score(query, k=10)
filtered_docs = filter_by_similarity(docs_with_scores, threshold=0.75)

Example 3: Prompt Caching — Anthropic `cache_control`

python

import anthropic
 
client = anthropic.Anthropic()
 
# 긴 시스템 프롬프트나 문서를 캐싱 (최소 1,024 토큰 필요)
SYSTEM_PROMPT = """당신은 코드 리뷰 전문가입니다...
[2,000 토큰 이상의 상세한 가이드라인]
"""
 
def analyze_with_cache(code_snippet: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"}  # 5분 TTL 캐시
            }
        ],
        messages=[{"role": "user", "content": f"다음 코드를 리뷰해주세요:\n{code_snippet}"}]
    )
 
    # 캐시 히트율 확인
    usage = response.usage
    print(f"캐시 읽기 토큰: {usage.cache_read_input_tokens}")
    print(f"캐시 쓰기 토큰: {usage.cache_creation_input_tokens}")
    print(f"일반 입력 토큰: {usage.input_tokens}")
 
    cache_hit_rate = usage.cache_read_input_tokens / (
        usage.cache_read_input_tokens + usage.cache_creation_input_tokens + usage.input_tokens
    ) * 100
    print(f"캐시 히트율: {cache_hit_rate:.1f}%")
 
    return response.content[0].text
 
# 동일 시스템 프롬프트로 반복 호출 — 2번째 요청부터 캐시 히트
for code in code_snippets:
    result = analyze_with_cache(code)

In TypeScript, LangChain's middleware can be applied with minimal modification to existing code.

typescript

import Anthropic from "@anthropic-ai/sdk";
 
const client = new Anthropic();
 
// 2,000 토큰짜리 시스템 프롬프트를 캐싱
const CACHED_SYSTEM = `당신은 기술 문서 작성 전문가입니다...
[상세 가이드라인 — 최소 1,024 토큰]`;
 
async function generateWithCache(userPrompt: string): Promise<string> {
  const response = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 2048,
    system: [
      {
        type: "text",
        text: CACHED_SYSTEM,
        cache_control: { type: "ephemeral" },
      },
    ],
    messages: [{ role: "user", content: userPrompt }],
  });
 
  const usage = response.usage as any;
  console.log(`캐시 읽기: ${usage.cache_read_input_tokens ?? 0} 토큰`);
 
  return response.content[0].type === "text" ? response.content[0].text : "";
}

Actual Cost Reduction Cases:

Scenario	Before Caching	After Caching	Savings Rate
50 PDF Iterations	$3.00	$0.15	95%
6-person team 8 weeks Claude usage	$2,400/month	$680/month	72%
2,000 Token System Prompt, 10 times/hour	Base	1/8 Cost	87.5%

Pros and Cons Analysis

Advantages

Item	Content
Summary Agent	Intuitive implementation, fundamentally resolves context window issues, maintains natural conversation flow
Chunk Injection	Maintains accuracy (includes only necessary information), capable of processing large-scale documents, can be combined with other strategies
Prompt Caching	Maximum cost efficiency (up to 90%), latency reduced by up to 85%, minimal code changes

Disadvantages and Precautions

Item	Content	Response Plan
Summary Agent	Risk of permanent loss of details	Original history stored in separate storage, important facts extracted in a structured form
Summary Agent	Cost of additional LLM calls for summarization	Use small model (Haiku), increase threshold to reduce call frequency
Chunk Injection	Vector DB Operational Overhead	Use of Managed Services (Pinecone, Weaviate Cloud)
Chunk Injection	Indirect Prompt Injection Vulnerability	Sandboxing Detected Chunks Within Trust Boundaries
Prompt caching	Requires at least 1,024 tokens	Short prompts use a combination of semantic caching (Redis)
Prompt caching	Invalidate cache on context change	Move frequently changing parts to the end of the prompt

The Most Common Mistakes in Practice

Introducing three strategies simultaneously — it becomes impossible to measure which strategy contributed to cost reduction. You must apply them sequentially, one by one, and measure the effectiveness at each stage using the usage metric before moving on to the next strategy.
Setting chunk sizes to fixed values — Small chunks (512 tokens) are effective for code, while large chunks (1,500 tokens) are effective for narrative documents. A single chunk size is not optimal for any domain. Separate strategies by content type using RecursiveCharacterTextSplitter.
Not monitoring the cache hit rate — If you do not track usage.cache_read_input_tokens, you cannot know if the cache is actually working. If the hit rate is below 50%, you should review the prompt structure — frequently changing parts are likely positioned before the cache target.

In Conclusion

3 Steps to Start Right Now:

This Week — Prompt Caching Applied: Added a single line, cache_control: {"type": "ephemeral"}, to the system prompt of existing agents. Hit rates are measured using usage.cache_read_input_tokens. Minimal code changes, immediate cost savings.
Next Week — Add Summary Agent: Apply summary compression with a small model (claude-haiku-4-5) when conversation history exceeds 8,000 tokens. Adjust threshold after evaluating summary quality with 1 week of data.
This Month — Introduction of Chunk Injection: ContextualCompressionRetriever applied to agents including document search. Achieves both accuracy and cost with a similarity threshold (0.75) and a layered pipeline.

Next Post: Anthropic’s Official Context Engineering Framework — How to Apply the Write / Select / Compress / Isolate 4 Strategies to Real-World Multi-Agent Systems

Key Concepts

Why Context Explodes in the Agent Pipeline

Compare Three Strategies at a Glance

Practical Application

Example 1: Summary Agent — Sliding Window Summary

Example 2: Chunk Injection — Layered RAG Compression

Example 3: Prompt Caching — Anthropic cache_control

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

The Most Common Mistakes in Practice

In Conclusion

Reference Materials

Key Concepts

Why Context Explodes in the Agent Pipeline

Compare Three Strategies at a Glance

Practical Application

Example 1: Summary Agent — Sliding Window Summary

Example 2: Chunk Injection — Layered RAG Compression

Example 3: Prompt Caching — Anthropic cache_control

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

The Most Common Mistakes in Practice

In Conclusion

Reference Materials

Recommended Posts

Context Engineering 4 Strategies — Applying Write, Select, Compress, and Isolate to Multi-Agent Systems

Practical Guide to MCP Server Development — Transforming Internal APIs and DBs into a Professional Domain Agent

MCP Server Docker Deployment in 3 Steps — SSE Deprecated, Now Streamable HTTP

LangGraph vs CrewAI: Respond with State Machine Orchestration when the number of agents exceeds 3

Harness Engineering: Environment Design Guide for AI Agent Production

Example 3: Prompt Caching — Anthropic `cache_control`

Example 3: Prompt Caching — Anthropic `cache_control`