vLLM APC vs SGLang RadixAttention: KV Cache Architecture Differences and Workload-Based Selection Criteria
When running an LLM inference server, you'll eventually hit the question: "Why is prefill so slow?" In workloads where the beginning of the prompt repeats — like multi-turn chatbots or RAG pipelines — it quickly becomes apparent that recomputing the same context on every request is a noticeable waste.
At first I thought, "Aren't they both just prefix caching?" and didn't give it much thought. But when I actually looked at the internals, the fundamental philosophy differed — from data structure choices to eviction policies. And those differences translate into meaningful performance gaps depending on workload characteristics. Particula's 2026 benchmark (Llama 3, H100) showed roughly a 29% difference (16,200 vs 12,500 tok/s), and in high-concurrency scenarios the gap can approach twofold.
The intended audience for this post is backend and ML engineers who operate or are evaluating LLM inference servers. After reading this, you'll understand the caching implementation principles and internal data structure differences between the two frameworks, and be able to make a data-driven decision about which is more advantageous for your serving environment.
Selection criteria at a glance
- Multi-turn conversation / RAG / agents: SGLang RadixAttention — the longer the shared prefix and the higher the concurrency, the greater the advantage
- Multi-LoRA / non-NVIDIA hardware / multi-tenant security: vLLM APC — strengths in ecosystem maturity and hardware compatibility
- Independent batches with less than 60% prefix overlap: Both approaches offer limited benefit — analyze your traffic before adopting either
Core Concepts
To briefly cover KV cache: the attention layer in a Transformer references the Key/Value matrices of previous tokens when processing each new token. KV cache stores and reuses the K/V matrices of already-computed tokens. Prefix caching is the extended concept of reusing these K/V matrices across different requests.
LLM inference has two phases. Prefill processes the entire input prompt at once to generate the first token, while decode sequentially generates one token at a time, using the previous token as input. Prefix caching reduces the cost of the prefill phase.
vLLM APC: Managing Blocks with a Hash Table
vLLM's Automatic Prefix Caching (APC) sits naturally on top of PagedAttention, which manages the KV cache by dividing GPU memory into pages. It splits the KV cache into fixed 16-token blocks, assigns a hash key to each block, and manages everything through a single global hash table.
When a new request arrives, the runtime slices the token sequence into blocks and computes their hashes. If a hash is in the table — cache hit, reuse without prefill. If not — cache miss, compute and insert into the table.
Each block's hash is computed by hashing together the hash values of all preceding blocks and the token IDs of the current block. This guarantees "position-dependent" keys: even identical 16-token sequences produce different keys if what precedes them differs. Starting with vLLM v0.11, the default hash function switched to SHA-256, eliminating the collision risk present in earlier versions (related discussion: GitHub Issue #16016).
# Enable vLLM APC (v0.11 or later recommended)
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
enable_prefix_caching=True,
block_size=16, # Minimum cache unit: 16-token block
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
system_prompt = "You are a code review expert. " * 50 # Long system prompt
outputs = llm.generate([
system_prompt + "Please review this Python code: def foo(): pass",
system_prompt + "Please review this Python code: def bar(): return 1",
], sampling_params)
# From the second request onward, system_prompt blocks are reused from cacheThere is one key constraint here. Because the block size is fixed at 16 tokens, if the prefix length is not a multiple of 16, the final incomplete block cannot be cached. If the prefix is exactly 100 tokens, only 6 blocks (96 tokens) are cached and the remaining 4 tokens are recomputed every time. It seems trivial, but getting into the habit of designing your system prompt token count to be a multiple of 16 actually makes a measurable difference in cache hit rates.
SGLang RadixAttention: Unifying All Requests in a Single Tree
SGLang's approach is bolder. It manages the KV pages of all active requests currently being processed by the server using a single Radix Tree (compressed prefix trie).
Unlike a regular Trie, a Radix Tree compresses common prefixes into a single node. Where a regular Trie uses one node per character, a Radix Tree bundles the entire common portion into a single node. SGLang applies this principle directly to the KV cache, designing it so that "a group of common prefix tokens = one node + corresponding KV pages."
Request A: [System Prompt] + [User: "Please review this Python code"]
Request B: [System Prompt] + [User: "Please review this JavaScript code"]
Request C: [System Prompt] + [User: "Please review this Python code"] + [Assistant Response] + [User: "Explain further"]
Radix Tree:
root
└── [System Prompt KV Pages] ← shared by A, B, C
├── [User: "Please review this Python code"] ← shared by A, C
│ └── [Assistant Response] → [User: "Explain further"] ← C only
└── [User: "Please review this JavaScript code"] ← B onlyWhen a new request arrives, the runtime starts at the tree root and searches for the longest prefix match. The KV pages of matched nodes are reused immediately, and prefill is performed only for the remaining tokens.
The most convenient aspect of SGLang is that RadixAttention is built in by default. It activates automatically just by launching the server, with no additional flags. Because it supports variable-length nodes, arbitrary-length prefix matching is possible without block-level alignment.
# SGLang - OpenAI-compatible API (currently recommended approach)
# Server: python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
SYSTEM_PROMPT = "You are a code review expert. ..."
# RadixAttention operates automatically inside the server — no client-side configuration needed
for user_code in ["def foo(): pass", "def bar(): return 1"]:
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Please review this Python code: {user_code}"}
],
max_tokens=512
)
print(response.choices[0].message.content)
# From the second request onward, the SYSTEM_PROMPT portion is automatically reused from the radix treeThe Core Differences at a Glance
| Item | vLLM APC | SGLang RadixAttention |
|---|---|---|
| Data structure | Global hash table | Radix tree |
| Minimum cache unit | Fixed 16-token block | Variable-length node (arbitrary-length prefix matching) |
| Prefix detection method | Block hash comparison | Tree longest-prefix matching |
| Partial block caching | Not supported | Supported |
| Eviction policy | LRU (ref count=0) + longer prefix preservation | Recursive LRU leaf node eviction |
| Automatic prefix discovery across requests | Not supported (requires identical block hash) | Supported (automatic identification via tree traversal) |
| Enabled by default | Requires --enable-prefix-caching flag |
Built-in by default |
Practical Application
Example 1: Multi-Turn Chatbot — Where SGLang Shines
Imagine building a customer support chatbot. The system prompt contains hundreds of tokens of company policies, product information, and service guidelines. Without caching, as a user continues the conversation, the entire growing context must be prefilled from scratch every turn. By the 10th turn, the accumulated prefill cost is dramatically higher than turn 1.
# SGLang multi-turn chatbot — OpenAI-compatible API approach
from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
SYSTEM_PROMPT = """
You are a customer support specialist for TechCorp.
[Company Policy ~200 tokens...]
[Product Information ~300 tokens...]
[Service Guidelines ~150 tokens...]
""" # ~650 tokens total
def chat(history: list, user_input: str) -> str:
messages = [{"role": "system", "content": SYSTEM_PROMPT}]
messages.extend(history)
messages.append({"role": "user", "content": user_input})
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=messages,
max_tokens=256
)
return response.choices[0].message.content
# As conversation history accumulates, the range reused from the radix tree grows automatically
history = []
for user_input in user_messages:
response_text = chat(history, user_input)
history.extend([
{"role": "user", "content": user_input},
{"role": "assistant", "content": response_text}
])| Conversation Turn | KV Pages Reused | Tokens Subject to Prefill |
|---|---|---|
| Turn 1 | 0% | Everything (system prompt + user message) |
| Turn 2 | System prompt + turn 1 history | New user message only |
| Turn 5 | System prompt + turns 1–4 history | New user message only |
| Turn N | Entire accumulated history | New user message only |
The more concurrent users sharing the same system prompt, the more the system prompt node in the radix tree is shared. In this scenario, a cache hit rate of 75–95% is a realistic target.
Example 2: RAG Pipeline — Reusing Document Context
In RAG, retrieved document chunks occupy the front of the prompt. It's common for the same documents to be retrieved repeatedly across different questions — a pattern that pairs well with prefix caching.
# RAG + SGLang: cache-friendly prompt construction
from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
def rag_query(retrieved_docs: list[str], question: str) -> str:
# Place retrieved documents at the front of the system message — maximizes cache reuse
context = "\n\n".join([f"Document {i+1}: {doc}" for i, doc in enumerate(retrieved_docs)])
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "system", "content": f"Please answer the question based on the following documents:\n\n{context}"},
{"role": "user", "content": question}
],
max_tokens=512
)
return response.choices[0].message.content
# fetch_document() is a function you implement to suit your environment
popular_doc = fetch_document("python-docs-chapter3")
results = [
rag_query(retrieved_docs=[popular_doc, doc_b], question=q)
for q in questions # The popular_doc portion is automatically reused from the radix tree
]Cache-friendly prompt design: Frequently repeated context (system prompts, documents, tool definitions) should always be placed at the front of the prompt. If variable content like usernames or timestamps comes first, the entire long system prompt that follows will be a cache miss.
Example 3: Multi-LoRA Serving — Where vLLM Excels
If you're running a platform where multiple teams use their own fine-tuned models, the story changes. vLLM's independent block structure is a natural fit for managing KV caches separately per LoRA adapter.
# vLLM Multi-LoRA serving
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
llm = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
enable_lora=True,
enable_prefix_caching=True,
max_lora_rank=64,
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
lora_requests = {
"team_a": LoRARequest("team_a_lora", 1, "/models/lora/team_a"),
"team_b": LoRARequest("team_b_lora", 2, "/models/lora/team_b"),
}
outputs = llm.generate(
["Team A's task", "Team B's task"],
sampling_params,
lora_request=[lora_requests["team_a"], lora_requests["team_b"]]
)SGLang also supports Multi-LoRA, but vLLM's maturity and flexibility stand out in this area.
Example 4: Cache Isolation in Multi-Tenant Environments (vLLM v0.11+)
When processing requests from multiple customers on a SaaS platform, cache isolation is a security requirement. Customer A's prompts must not influence customer B's cache.
# vLLM Cache Salting for per-tenant cache isolation
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
enable_prefix_caching=True,
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
def serve_tenant_request(tenant_id: str, prompt: str):
return llm.generate(
prompt,
sampling_params,
prefix_cache_salt=f"tenant:{tenant_id}", # Salt injection changes the first block hash
)
# Even identical prompts won't share cache if tenants differ
shared_prompt = "Common system prompt..."
response_a = serve_tenant_request("tenant_001", shared_prompt)
response_b = serve_tenant_request("tenant_002", shared_prompt)Pros and Cons Analysis
Honestly, a table of pros and cons alone tends to strip out the context. It helps to read the explanations alongside each framework's limitations so you understand when they'll actually trip you up in practice.
vLLM APC
vLLM's strength is versatility. Hardware compatibility, ecosystem maturity, and Multi-LoRA support all favor vLLM.
| Item | Details |
|---|---|
| Broad hardware support | Virtually all accelerators: TPU, Trainium, Gaudi, NVIDIA, and more |
| Mature ecosystem | ~3x more contributors than SGLang, full OpenAI API compatibility |
| Multi-LoRA serving | Independent block structure makes per-adapter cache management straightforward |
| Security isolation | Per-tenant cache isolation via cache salting (v0.11+) |
| Low overhead | Cache hit check via hash lookup only, no tree maintenance |
Three constraints most commonly cause friction in production.
| Item | Details | Mitigation |
|---|---|---|
| Block-level caching constraint | Incomplete final block is dropped if prefix length isn't a multiple of 16 | Design system prompt token count to be a multiple of 16 |
| Concurrency ceiling | Throughput degrades faster than SGLang under high concurrency | Consider SGLang for multi-turn, high-concurrency workloads |
| Hash collision risk in old versions | Pre-v0.11 uses Python's built-in hash(), which can cause incorrect cache reuse on collision |
Always use v0.11 or later |
SGLang RadixAttention
SGLang's strengths are precision of cache reuse and stability under high concurrency.
| Item | Details |
|---|---|
| Variable-length prefix caching | Maximum prefix reuse for arbitrary lengths without block alignment |
| Automatic prefix discovery | Automatically identifies shareable prefixes from traffic patterns without explicit configuration |
| High-concurrency stability | Particula benchmark (Llama 3, H100) — throughput remains stable under high concurrency |
| Structured output optimization | Speed advantage for JSON/schema-constrained decoding via compressed FSM |
| Higher throughput | ~29% higher throughput than vLLM under equivalent conditions (16,200 vs 12,500 tok/s) |
SGLang's limitations are primarily around ecosystem and overhead.
| Item | Details | Mitigation |
|---|---|---|
| NVIDIA-centric ecosystem | Non-NVIDIA hardware support is more limited than vLLM | vLLM is the practical choice for non-NVIDIA environments |
| Tree maintenance overhead | Radix tree insertion, search, and eviction incur additional CPU and memory cost | vLLM is more advantageous for independent batch processing without shared prefixes |
| Disadvantage for independent batches | If there are no common prefixes across traffic, the tree provides no benefit | Measure prefix overlap before deciding to adopt |
The Most Common Mistakes in Practice
-
Adopting prefix caching without measuring the cache hit rate: Applying prefix caching to a workload with low prefix overlap only adds overhead. Check the actual prefix distribution in your traffic before adopting.
-
Placing variable content at the front of the prompt: When request-specific information like usernames, timestamps, or session IDs appears at the front of the prompt, the entire long system prompt that follows becomes a cache miss. The key is fixed content first, variable content last.
-
Continuing to use vLLM versions prior to v0.11: Older APC uses Python's built-in
hash()function, which carries a collision risk. If two different prompts share the same hash key, the wrong KV cache gets reused, which can lead to subtle output errors.
Closing Thoughts
The prefix caching in these two frameworks starts from "same goal, different data structures" — SGLang pays the tree maintenance overhead in exchange for token-level reuse and automatic prefix discovery via a radix tree, while vLLM accepts block-level constraints in exchange for simplicity and hardware compatibility via a hash table.
When your workload characteristics are clear, the choice is clear too. For workloads with long shared prefixes and high concurrency — multi-turn conversation, RAG, agent frameworks — SGLang offers tangible advantages. For multi-LoRA, non-NVIDIA hardware, or multi-tenant environments requiring security isolation, vLLM is the more natural choice.
Three steps you can take right now:
- Start by measuring the prefix overlap in your own workload. You can get a quick picture by sampling 100 recent requests as shown below. If the result is 0.6 or higher, both approaches should deliver meaningful gains.
from collections import Counter
def measure_prefix_overlap(prompts: list[str], prefix_len: int = 200) -> float:
prefixes = [p[:prefix_len] for p in prompts]
most_common_count = Counter(prefixes).most_common(1)[0][1]
return most_common_count / len(prompts)
overlap_rate = measure_prefix_overlap(recent_prompts)
print(f"Prefix overlap rate: {overlap_rate:.1%}")If you're already using vLLM, you can check your current hit rate immediately via the vllm:prefix_cache_hit_rate metric on the /metrics endpoint (Prometheus format).
-
If you're using vLLM, try adding the
--enable-prefix-cachingflag and upgrading to v0.11 or later. It applies instantly with a single command:vllm serve meta-llama/Meta-Llama-3-8B-Instruct --enable-prefix-caching. Padding your system prompt token count to a multiple of 16 should push your cache hit rate even higher. -
For multi-turn or RAG workloads, run an A/B test with SGLang. Launch the server with
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instructand RadixAttention will be active by default. Measuring prefill latency and throughput differences under the same traffic will give you the data to decide which fits your environment better.
References
- Automatic Prefix Caching | vLLM Official Documentation
- vLLM APC Implementation Details (v0.6.1)
- RFC: Cache Salting for Secure Prefix Caching in vLLM | GitHub Issue #16016
- Fast and Expressive LLM Inference with RadixAttention and SGLang | LMSYS Blog
- Efficient Execution of Structured Language Model Programs | arXiv:2312.07104
- RadixAttention Concept Documentation | SGLang Official
- Prefix Caching — SGLang vs vLLM: Token-Level Radix Tree vs Block-Level Hashing | Medium
- When to Choose SGLang Over vLLM: Multi-Turn Conversations and KV Cache Reuse | Runpod
- SGLang vs vLLM in 2026: Benchmarks, Architecture, and When to Use Each | Particula
- KV Cache Management and Prefix Caching | vLLM DeepWiki
- KV-Cache Wins: From Prefix Caching in vLLM to Distributed Scheduling with llm-d