Privacy Policy© 2026 DEV BAK - TECH BLOG. All rights reserved.
DEV BAK - TECH BLOG
AI

vLLM APC vs SGLang RadixAttention: KV Cache Architecture Differences and Workload-Based Selection Criteria

When running an LLM inference server, you'll eventually hit the question: "Why is prefill so slow?" In workloads where the beginning of the prompt repeats — like multi-turn chatbots or RAG pipelines — it quickly becomes apparent that recomputing the same context on every request is a noticeable waste.

At first I thought, "Aren't they both just prefix caching?" and didn't give it much thought. But when I actually looked at the internals, the fundamental philosophy differed — from data structure choices to eviction policies. And those differences translate into meaningful performance gaps depending on workload characteristics. Particula's 2026 benchmark (Llama 3, H100) showed roughly a 29% difference (16,200 vs 12,500 tok/s), and in high-concurrency scenarios the gap can approach twofold.

The intended audience for this post is backend and ML engineers who operate or are evaluating LLM inference servers. After reading this, you'll understand the caching implementation principles and internal data structure differences between the two frameworks, and be able to make a data-driven decision about which is more advantageous for your serving environment.


Selection criteria at a glance

  • Multi-turn conversation / RAG / agents: SGLang RadixAttention — the longer the shared prefix and the higher the concurrency, the greater the advantage
  • Multi-LoRA / non-NVIDIA hardware / multi-tenant security: vLLM APC — strengths in ecosystem maturity and hardware compatibility
  • Independent batches with less than 60% prefix overlap: Both approaches offer limited benefit — analyze your traffic before adopting either

Core Concepts

To briefly cover KV cache: the attention layer in a Transformer references the Key/Value matrices of previous tokens when processing each new token. KV cache stores and reuses the K/V matrices of already-computed tokens. Prefix caching is the extended concept of reusing these K/V matrices across different requests.

LLM inference has two phases. Prefill processes the entire input prompt at once to generate the first token, while decode sequentially generates one token at a time, using the previous token as input. Prefix caching reduces the cost of the prefill phase.

vLLM APC: Managing Blocks with a Hash Table

vLLM's Automatic Prefix Caching (APC) sits naturally on top of PagedAttention, which manages the KV cache by dividing GPU memory into pages. It splits the KV cache into fixed 16-token blocks, assigns a hash key to each block, and manages everything through a single global hash table.

When a new request arrives, the runtime slices the token sequence into blocks and computes their hashes. If a hash is in the table — cache hit, reuse without prefill. If not — cache miss, compute and insert into the table.

Each block's hash is computed by hashing together the hash values of all preceding blocks and the token IDs of the current block. This guarantees "position-dependent" keys: even identical 16-token sequences produce different keys if what precedes them differs. Starting with vLLM v0.11, the default hash function switched to SHA-256, eliminating the collision risk present in earlier versions (related discussion: GitHub Issue #16016).

python
# Enable vLLM APC (v0.11 or later recommended)
from vllm import LLM, SamplingParams
 
llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    enable_prefix_caching=True,
    block_size=16,  # Minimum cache unit: 16-token block
)
 
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
 
system_prompt = "You are a code review expert. " * 50  # Long system prompt
 
outputs = llm.generate([
    system_prompt + "Please review this Python code: def foo(): pass",
    system_prompt + "Please review this Python code: def bar(): return 1",
], sampling_params)
# From the second request onward, system_prompt blocks are reused from cache

There is one key constraint here. Because the block size is fixed at 16 tokens, if the prefix length is not a multiple of 16, the final incomplete block cannot be cached. If the prefix is exactly 100 tokens, only 6 blocks (96 tokens) are cached and the remaining 4 tokens are recomputed every time. It seems trivial, but getting into the habit of designing your system prompt token count to be a multiple of 16 actually makes a measurable difference in cache hit rates.


SGLang RadixAttention: Unifying All Requests in a Single Tree

SGLang's approach is bolder. It manages the KV pages of all active requests currently being processed by the server using a single Radix Tree (compressed prefix trie).

Unlike a regular Trie, a Radix Tree compresses common prefixes into a single node. Where a regular Trie uses one node per character, a Radix Tree bundles the entire common portion into a single node. SGLang applies this principle directly to the KV cache, designing it so that "a group of common prefix tokens = one node + corresponding KV pages."

sql
Request A: [System Prompt] + [User: "Please review this Python code"]
Request B: [System Prompt] + [User: "Please review this JavaScript code"]
Request C: [System Prompt] + [User: "Please review this Python code"] + [Assistant Response] + [User: "Explain further"]
 
Radix Tree:
root
└── [System Prompt KV Pages]  ← shared by A, B, C
    ├── [User: "Please review this Python code"]  ← shared by A, C
    │   └── [Assistant Response] → [User: "Explain further"]  ← C only
    └── [User: "Please review this JavaScript code"]  ← B only

When a new request arrives, the runtime starts at the tree root and searches for the longest prefix match. The KV pages of matched nodes are reused immediately, and prefill is performed only for the remaining tokens.

The most convenient aspect of SGLang is that RadixAttention is built in by default. It activates automatically just by launching the server, with no additional flags. Because it supports variable-length nodes, arbitrary-length prefix matching is possible without block-level alignment.

python
# SGLang - OpenAI-compatible API (currently recommended approach)
# Server: python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
from openai import OpenAI
 
client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)
 
SYSTEM_PROMPT = "You are a code review expert. ..."
 
# RadixAttention operates automatically inside the server — no client-side configuration needed
for user_code in ["def foo(): pass", "def bar(): return 1"]:
    response = client.chat.completions.create(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"Please review this Python code: {user_code}"}
        ],
        max_tokens=512
    )
    print(response.choices[0].message.content)
# From the second request onward, the SYSTEM_PROMPT portion is automatically reused from the radix tree

The Core Differences at a Glance

Item vLLM APC SGLang RadixAttention
Data structure Global hash table Radix tree
Minimum cache unit Fixed 16-token block Variable-length node (arbitrary-length prefix matching)
Prefix detection method Block hash comparison Tree longest-prefix matching
Partial block caching Not supported Supported
Eviction policy LRU (ref count=0) + longer prefix preservation Recursive LRU leaf node eviction
Automatic prefix discovery across requests Not supported (requires identical block hash) Supported (automatic identification via tree traversal)
Enabled by default Requires --enable-prefix-caching flag Built-in by default

Practical Application

Example 1: Multi-Turn Chatbot — Where SGLang Shines

Imagine building a customer support chatbot. The system prompt contains hundreds of tokens of company policies, product information, and service guidelines. Without caching, as a user continues the conversation, the entire growing context must be prefilled from scratch every turn. By the 10th turn, the accumulated prefill cost is dramatically higher than turn 1.

python
# SGLang multi-turn chatbot — OpenAI-compatible API approach
from openai import OpenAI
 
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
 
SYSTEM_PROMPT = """
You are a customer support specialist for TechCorp.
[Company Policy ~200 tokens...]
[Product Information ~300 tokens...]
[Service Guidelines ~150 tokens...]
"""  # ~650 tokens total
 
def chat(history: list, user_input: str) -> str:
    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
    messages.extend(history)
    messages.append({"role": "user", "content": user_input})
 
    response = client.chat.completions.create(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        messages=messages,
        max_tokens=256
    )
    return response.choices[0].message.content
 
# As conversation history accumulates, the range reused from the radix tree grows automatically
history = []
for user_input in user_messages:
    response_text = chat(history, user_input)
    history.extend([
        {"role": "user", "content": user_input},
        {"role": "assistant", "content": response_text}
    ])
Conversation Turn KV Pages Reused Tokens Subject to Prefill
Turn 1 0% Everything (system prompt + user message)
Turn 2 System prompt + turn 1 history New user message only
Turn 5 System prompt + turns 1–4 history New user message only
Turn N Entire accumulated history New user message only

The more concurrent users sharing the same system prompt, the more the system prompt node in the radix tree is shared. In this scenario, a cache hit rate of 75–95% is a realistic target.


Example 2: RAG Pipeline — Reusing Document Context

In RAG, retrieved document chunks occupy the front of the prompt. It's common for the same documents to be retrieved repeatedly across different questions — a pattern that pairs well with prefix caching.

python
# RAG + SGLang: cache-friendly prompt construction
from openai import OpenAI
 
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
 
def rag_query(retrieved_docs: list[str], question: str) -> str:
    # Place retrieved documents at the front of the system message — maximizes cache reuse
    context = "\n\n".join([f"Document {i+1}: {doc}" for i, doc in enumerate(retrieved_docs)])
 
    response = client.chat.completions.create(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        messages=[
            {"role": "system", "content": f"Please answer the question based on the following documents:\n\n{context}"},
            {"role": "user", "content": question}
        ],
        max_tokens=512
    )
    return response.choices[0].message.content
 
# fetch_document() is a function you implement to suit your environment
popular_doc = fetch_document("python-docs-chapter3")
 
results = [
    rag_query(retrieved_docs=[popular_doc, doc_b], question=q)
    for q in questions  # The popular_doc portion is automatically reused from the radix tree
]

Cache-friendly prompt design: Frequently repeated context (system prompts, documents, tool definitions) should always be placed at the front of the prompt. If variable content like usernames or timestamps comes first, the entire long system prompt that follows will be a cache miss.


Example 3: Multi-LoRA Serving — Where vLLM Excels

If you're running a platform where multiple teams use their own fine-tuned models, the story changes. vLLM's independent block structure is a natural fit for managing KV caches separately per LoRA adapter.

python
# vLLM Multi-LoRA serving
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
 
llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    enable_lora=True,
    enable_prefix_caching=True,
    max_lora_rank=64,
)
 
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
 
lora_requests = {
    "team_a": LoRARequest("team_a_lora", 1, "/models/lora/team_a"),
    "team_b": LoRARequest("team_b_lora", 2, "/models/lora/team_b"),
}
 
outputs = llm.generate(
    ["Team A's task", "Team B's task"],
    sampling_params,
    lora_request=[lora_requests["team_a"], lora_requests["team_b"]]
)

SGLang also supports Multi-LoRA, but vLLM's maturity and flexibility stand out in this area.


Example 4: Cache Isolation in Multi-Tenant Environments (vLLM v0.11+)

When processing requests from multiple customers on a SaaS platform, cache isolation is a security requirement. Customer A's prompts must not influence customer B's cache.

python
# vLLM Cache Salting for per-tenant cache isolation
from vllm import LLM, SamplingParams
 
llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    enable_prefix_caching=True,
)
 
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
 
def serve_tenant_request(tenant_id: str, prompt: str):
    return llm.generate(
        prompt,
        sampling_params,
        prefix_cache_salt=f"tenant:{tenant_id}",  # Salt injection changes the first block hash
    )
 
# Even identical prompts won't share cache if tenants differ
shared_prompt = "Common system prompt..."
response_a = serve_tenant_request("tenant_001", shared_prompt)
response_b = serve_tenant_request("tenant_002", shared_prompt)

Pros and Cons Analysis

Honestly, a table of pros and cons alone tends to strip out the context. It helps to read the explanations alongside each framework's limitations so you understand when they'll actually trip you up in practice.

vLLM APC

vLLM's strength is versatility. Hardware compatibility, ecosystem maturity, and Multi-LoRA support all favor vLLM.

Item Details
Broad hardware support Virtually all accelerators: TPU, Trainium, Gaudi, NVIDIA, and more
Mature ecosystem ~3x more contributors than SGLang, full OpenAI API compatibility
Multi-LoRA serving Independent block structure makes per-adapter cache management straightforward
Security isolation Per-tenant cache isolation via cache salting (v0.11+)
Low overhead Cache hit check via hash lookup only, no tree maintenance

Three constraints most commonly cause friction in production.

Item Details Mitigation
Block-level caching constraint Incomplete final block is dropped if prefix length isn't a multiple of 16 Design system prompt token count to be a multiple of 16
Concurrency ceiling Throughput degrades faster than SGLang under high concurrency Consider SGLang for multi-turn, high-concurrency workloads
Hash collision risk in old versions Pre-v0.11 uses Python's built-in hash(), which can cause incorrect cache reuse on collision Always use v0.11 or later

SGLang RadixAttention

SGLang's strengths are precision of cache reuse and stability under high concurrency.

Item Details
Variable-length prefix caching Maximum prefix reuse for arbitrary lengths without block alignment
Automatic prefix discovery Automatically identifies shareable prefixes from traffic patterns without explicit configuration
High-concurrency stability Particula benchmark (Llama 3, H100) — throughput remains stable under high concurrency
Structured output optimization Speed advantage for JSON/schema-constrained decoding via compressed FSM
Higher throughput ~29% higher throughput than vLLM under equivalent conditions (16,200 vs 12,500 tok/s)

SGLang's limitations are primarily around ecosystem and overhead.

Item Details Mitigation
NVIDIA-centric ecosystem Non-NVIDIA hardware support is more limited than vLLM vLLM is the practical choice for non-NVIDIA environments
Tree maintenance overhead Radix tree insertion, search, and eviction incur additional CPU and memory cost vLLM is more advantageous for independent batch processing without shared prefixes
Disadvantage for independent batches If there are no common prefixes across traffic, the tree provides no benefit Measure prefix overlap before deciding to adopt

The Most Common Mistakes in Practice

  1. Adopting prefix caching without measuring the cache hit rate: Applying prefix caching to a workload with low prefix overlap only adds overhead. Check the actual prefix distribution in your traffic before adopting.

  2. Placing variable content at the front of the prompt: When request-specific information like usernames, timestamps, or session IDs appears at the front of the prompt, the entire long system prompt that follows becomes a cache miss. The key is fixed content first, variable content last.

  3. Continuing to use vLLM versions prior to v0.11: Older APC uses Python's built-in hash() function, which carries a collision risk. If two different prompts share the same hash key, the wrong KV cache gets reused, which can lead to subtle output errors.


Closing Thoughts

The prefix caching in these two frameworks starts from "same goal, different data structures" — SGLang pays the tree maintenance overhead in exchange for token-level reuse and automatic prefix discovery via a radix tree, while vLLM accepts block-level constraints in exchange for simplicity and hardware compatibility via a hash table.

When your workload characteristics are clear, the choice is clear too. For workloads with long shared prefixes and high concurrency — multi-turn conversation, RAG, agent frameworks — SGLang offers tangible advantages. For multi-LoRA, non-NVIDIA hardware, or multi-tenant environments requiring security isolation, vLLM is the more natural choice.

Three steps you can take right now:

  1. Start by measuring the prefix overlap in your own workload. You can get a quick picture by sampling 100 recent requests as shown below. If the result is 0.6 or higher, both approaches should deliver meaningful gains.
python
from collections import Counter
 
def measure_prefix_overlap(prompts: list[str], prefix_len: int = 200) -> float:
    prefixes = [p[:prefix_len] for p in prompts]
    most_common_count = Counter(prefixes).most_common(1)[0][1]
    return most_common_count / len(prompts)
 
overlap_rate = measure_prefix_overlap(recent_prompts)
print(f"Prefix overlap rate: {overlap_rate:.1%}")

If you're already using vLLM, you can check your current hit rate immediately via the vllm:prefix_cache_hit_rate metric on the /metrics endpoint (Prometheus format).

  1. If you're using vLLM, try adding the --enable-prefix-caching flag and upgrading to v0.11 or later. It applies instantly with a single command: vllm serve meta-llama/Meta-Llama-3-8B-Instruct --enable-prefix-caching. Padding your system prompt token count to a multiple of 16 should push your cache hit rate even higher.

  2. For multi-turn or RAG workloads, run an A/B test with SGLang. Launch the server with python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct and RadixAttention will be active by default. Measuring prefill latency and throughput differences under the same traffic will give you the data to decide which fits your environment better.


References

  • Automatic Prefix Caching | vLLM Official Documentation
  • vLLM APC Implementation Details (v0.6.1)
  • RFC: Cache Salting for Secure Prefix Caching in vLLM | GitHub Issue #16016
  • Fast and Expressive LLM Inference with RadixAttention and SGLang | LMSYS Blog
  • Efficient Execution of Structured Language Model Programs | arXiv:2312.07104
  • RadixAttention Concept Documentation | SGLang Official
  • Prefix Caching — SGLang vs vLLM: Token-Level Radix Tree vs Block-Level Hashing | Medium
  • When to Choose SGLang Over vLLM: Multi-Turn Conversations and KV Cache Reuse | Runpod
  • SGLang vs vLLM in 2026: Benchmarks, Architecture, and When to Use Each | Particula
  • KV Cache Management and Prefix Caching | vLLM DeepWiki
  • KV-Cache Wins: From Prefix Caching in vLLM to Distributed Scheduling with llm-d
#vLLM#SGLang#KV캐시#RadixAttention#PrefixCaching#LLM추론#PagedAttention#RAG#MultiLoRA#Transformer
Share

Table of Contents

Core ConceptsvLLM APC: Managing Blocks with a Hash TableSGLang RadixAttention: Unifying All Requests in a Single TreeThe Core Differences at a GlancePractical ApplicationExample 1: Multi-Turn Chatbot — Where SGLang ShinesExample 2: RAG Pipeline — Reusing Document ContextExample 3: Multi-LoRA Serving — Where vLLM ExcelsExample 4: Cache Isolation in Multi-Tenant Environments (vLLM v0.11+)Pros and Cons AnalysisvLLM APCSGLang RadixAttentionThe Most Common Mistakes in PracticeClosing ThoughtsReferences

Recommended Posts

Why KV Cache Hit Rate Drops to 0% When Scaling Out vLLM Pods, and How llm-d Solves It (Prefix-Aware Routing / Distributed KV Cache)
AI

Why KV Cache Hit Rate Drops to 0% When Scaling Out vLLM Pods, and How llm-d Solves It (Prefix-Aware Routing / Distributed KV Cache)

When operating an LLM service, you will eventually encounter this situation. When you had Automatic Prefix Caching (APC) enabled in vLLM and ran a multi-turn ch...

May 26, 202626 min read
Migrating LLM Inference from vLLM to SGLang: Why RAG and Multi-Turn Workloads See a 6x Throughput Difference
AI

Migrating LLM Inference from vLLM to SGLang: Why RAG and Multi-Turn Workloads See a 6x Throughput Difference

Honestly, my first reaction when I came across SGLang was "another new framework?" vLLM was working well enough, and touching a serving stack that's already run...

May 27, 202621 min read
SGLang RadixAttention KV Cache Hit Rate: How We Raised Hit Rate from 3% to 78% with Prometheus Monitoring and Operational Tuning — Advanced Guide
AI

SGLang RadixAttention KV Cache Hit Rate: How We Raised Hit Rate from 3% to 78% with Prometheus Monitoring and Operational Tuning — Advanced Guide

When running LLM serving infrastructure, GPU costs can quickly spiral out of control. Back when I was operating a multi-turn chatbot service, I eventually reali...

May 27, 202621 min read
SGLang RadixAttention: How to Boost RAG Pipeline Throughput 5x by Reusing KV Cache for Identical Document Blocks
AI

SGLang RadixAttention: How to Boost RAG Pipeline Throughput 5x by Reusing KV Cache for Identical Document Blocks

If you've ever deployed a RAG pipeline to production, you've probably experienced this: you send dozens of requests with only the question changing while the sa...

May 26, 202622 min read
How to Measure RAG Pipeline Quality in Numbers with Ragas and Ollama
AI

How to Measure RAG Pipeline Quality in Numbers with Ragas and Ollama

From Faithfulness·Context Precision measurement to chunking·hybrid search A/B comparisons After building a RAG system, you've probably asked yourself, "Is th...

May 24, 202623 min read
How to Lock Down Your Team's Ollama Server — Security Configuration, vLLM Migration, and Multi-Agent Orchestration
AI

How to Lock Down Your Team's Ollama Server — Security Configuration, vLLM Migration, and Multi-Agent Orchestration

This article is aimed at readers with experience in Nginx and Docker. It will be especially useful for those already running a team Ollama server or evaluating ...

May 24, 202627 min read