vLLM APC vs SGLang RadixAttention: KV Cache Architecture Differences and Workload-Based Selection Criteria

When running an LLM inference server, you'll eventually hit the question: "Why is prefill so slow?" In workloads where the beginning of the prompt repeats — like multi-turn chatbots or RAG pipelines — it quickly becomes apparent that recomputing the same context on every request is a noticeable waste.

At first I thought, "Aren't they both just prefix caching?" and didn't give it much thought. But when I actually looked at the internals, the fundamental philosophy differed — from data structure choices to eviction policies. And those differences translate into meaningful performance gaps depending on workload characteristics. Particula's 2026 benchmark (Llama 3, H100) showed roughly a 29% difference (16,200 vs 12,500 tok/s), and in high-concurrency scenarios the gap can approach twofold.

The intended audience for this post is backend and ML engineers who operate or are evaluating LLM inference servers. After reading this, you'll understand the caching implementation principles and internal data structure differences between the two frameworks, and be able to make a data-driven decision about which is more advantageous for your serving environment.

Selection criteria at a glance

Multi-turn conversation / RAG / agents: SGLang RadixAttention — the longer the shared prefix and the higher the concurrency, the greater the advantage

Multi-LoRA / non-NVIDIA hardware / multi-tenant security: vLLM APC — strengths in ecosystem maturity and hardware compatibility

Independent batches with less than 60% prefix overlap: Both approaches offer limited benefit — analyze your traffic before adopting either

Core Concepts

To briefly cover KV cache: the attention layer in a Transformer references the Key/Value matrices of previous tokens when processing each new token. KV cache stores and reuses the K/V matrices of already-computed tokens. Prefix caching is the extended concept of reusing these K/V matrices across different requests.

LLM inference has two phases. Prefill processes the entire input prompt at once to generate the first token, while decode sequentially generates one token at a time, using the previous token as input. Prefix caching reduces the cost of the prefill phase.

vLLM APC: Managing Blocks with a Hash Table

vLLM's Automatic Prefix Caching (APC) sits naturally on top of PagedAttention, which manages the KV cache by dividing GPU memory into pages. It splits the KV cache into fixed 16-token blocks, assigns a hash key to each block, and manages everything through a single global hash table.

When a new request arrives, the runtime slices the token sequence into blocks and computes their hashes. If a hash is in the table — cache hit, reuse without prefill. If not — cache miss, compute and insert into the table.

Each block's hash is computed by hashing together the hash values of all preceding blocks and the token IDs of the current block. This guarantees "position-dependent" keys: even identical 16-token sequences produce different keys if what precedes them differs. Starting with vLLM v0.11, the default hash function switched to SHA-256, eliminating the collision risk present in earlier versions (related discussion: GitHub Issue #16016).

python

# Enable vLLM APC (v0.11 or later recommended)
from vllm import LLM, SamplingParams
 
llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    enable_prefix_caching=True,
    block_size=16,  # Minimum cache unit: 16-token block
)
 
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
 
system_prompt = "You are a code review expert. " * 50  # Long system prompt
 
outputs = llm.generate([
    system_prompt + "Please review this Python code: def foo(): pass",
    system_prompt + "Please review this Python code: def bar(): return 1",
], sampling_params)
# From the second request onward, system_prompt blocks are reused from cache

There is one key constraint here. Because the block size is fixed at 16 tokens, if the prefix length is not a multiple of 16, the final incomplete block cannot be cached. If the prefix is exactly 100 tokens, only 6 blocks (96 tokens) are cached and the remaining 4 tokens are recomputed every time. It seems trivial, but getting into the habit of designing your system prompt token count to be a multiple of 16 actually makes a measurable difference in cache hit rates.

SGLang RadixAttention: Unifying All Requests in a Single Tree

SGLang's approach is bolder. It manages the KV pages of all active requests currently being processed by the server using a single Radix Tree (compressed prefix trie).

Unlike a regular Trie, a Radix Tree compresses common prefixes into a single node. Where a regular Trie uses one node per character, a Radix Tree bundles the entire common portion into a single node. SGLang applies this principle directly to the KV cache, designing it so that "a group of common prefix tokens = one node + corresponding KV pages."

sql

Request A: [System Prompt] + [User: "Please review this Python code"]
Request B: [System Prompt] + [User: "Please review this JavaScript code"]
Request C: [System Prompt] + [User: "Please review this Python code"] + [Assistant Response] + [User: "Explain further"]
 
Radix Tree:
root
└── [System Prompt KV Pages]  ← shared by A, B, C
    ├── [User: "Please review this Python code"]  ← shared by A, C
    │   └── [Assistant Response] → [User: "Explain further"]  ← C only
    └── [User: "Please review this JavaScript code"]  ← B only

When a new request arrives, the runtime starts at the tree root and searches for the longest prefix match. The KV pages of matched nodes are reused immediately, and prefill is performed only for the remaining tokens.

The most convenient aspect of SGLang is that RadixAttention is built in by default. It activates automatically just by launching the server, with no additional flags. Because it supports variable-length nodes, arbitrary-length prefix matching is possible without block-level alignment.

python

# SGLang - OpenAI-compatible API (currently recommended approach)
# Server: python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
from openai import OpenAI
 
client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)
 
SYSTEM_PROMPT = "You are a code review expert. ..."
 
# RadixAttention operates automatically inside the server — no client-side configuration needed
for user_code in ["def foo(): pass", "def bar(): return 1"]:
    response = client.chat.completions.create(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"Please review this Python code: {user_code}"}
        ],
        max_tokens=512
    )
    print(response.choices[0].message.content)
# From the second request onward, the SYSTEM_PROMPT portion is automatically reused from the radix tree

The Core Differences at a Glance

Item	vLLM APC	SGLang RadixAttention
Data structure	Global hash table	Radix tree
Minimum cache unit	Fixed 16-token block	Variable-length node (arbitrary-length prefix matching)
Prefix detection method	Block hash comparison	Tree longest-prefix matching
Partial block caching	Not supported	Supported
Eviction policy	LRU (ref count=0) + longer prefix preservation	Recursive LRU leaf node eviction
Automatic prefix discovery across requests	Not supported (requires identical block hash)	Supported (automatic identification via tree traversal)
Enabled by default	Requires `--enable-prefix-caching` flag	Built-in by default

Practical Application

Example 1: Multi-Turn Chatbot — Where SGLang Shines

Imagine building a customer support chatbot. The system prompt contains hundreds of tokens of company policies, product information, and service guidelines. Without caching, as a user continues the conversation, the entire growing context must be prefilled from scratch every turn. By the 10th turn, the accumulated prefill cost is dramatically higher than turn 1.

python

# SGLang multi-turn chatbot — OpenAI-compatible API approach
from openai import OpenAI
 
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
 
SYSTEM_PROMPT = """
You are a customer support specialist for TechCorp.
[Company Policy ~200 tokens...]
[Product Information ~300 tokens...]
[Service Guidelines ~150 tokens...]
"""  # ~650 tokens total
 
def chat(history: list, user_input: str) -> str:
    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
    messages.extend(history)
    messages.append({"role": "user", "content": user_input})
 
    response = client.chat.completions.create(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        messages=messages,
        max_tokens=256
    )
    return response.choices[0].message.content
 
# As conversation history accumulates, the range reused from the radix tree grows automatically
history = []
for user_input in user_messages:
    response_text = chat(history, user_input)
    history.extend([
        {"role": "user", "content": user_input},
        {"role": "assistant", "content": response_text}
    ])

Conversation Turn	KV Pages Reused	Tokens Subject to Prefill
Turn 1	0%	Everything (system prompt + user message)
Turn 2	System prompt + turn 1 history	New user message only
Turn 5	System prompt + turns 1–4 history	New user message only
Turn N	Entire accumulated history	New user message only

The more concurrent users sharing the same system prompt, the more the system prompt node in the radix tree is shared. In this scenario, a cache hit rate of 75–95% is a realistic target.

Example 2: RAG Pipeline — Reusing Document Context

In RAG, retrieved document chunks occupy the front of the prompt. It's common for the same documents to be retrieved repeatedly across different questions — a pattern that pairs well with prefix caching.

python

# RAG + SGLang: cache-friendly prompt construction
from openai import OpenAI
 
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
 
def rag_query(retrieved_docs: list[str], question: str) -> str:
    # Place retrieved documents at the front of the system message — maximizes cache reuse
    context = "\n\n".join([f"Document {i+1}: {doc}" for i, doc in enumerate(retrieved_docs)])
 
    response = client.chat.completions.create(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        messages=[
            {"role": "system", "content": f"Please answer the question based on the following documents:\n\n{context}"},
            {"role": "user", "content": question}
        ],
        max_tokens=512
    )
    return response.choices[0].message.content
 
# fetch_document() is a function you implement to suit your environment
popular_doc = fetch_document("python-docs-chapter3")
 
results = [
    rag_query(retrieved_docs=[popular_doc, doc_b], question=q)
    for q in questions  # The popular_doc portion is automatically reused from the radix tree
]

Cache-friendly prompt design: Frequently repeated context (system prompts, documents, tool definitions) should always be placed at the front of the prompt. If variable content like usernames or timestamps comes first, the entire long system prompt that follows will be a cache miss.

Example 3: Multi-LoRA Serving — Where vLLM Excels

If you're running a platform where multiple teams use their own fine-tuned models, the story changes. vLLM's independent block structure is a natural fit for managing KV caches separately per LoRA adapter.

python

# vLLM Multi-LoRA serving
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
 
llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    enable_lora=True,
    enable_prefix_caching=True,
    max_lora_rank=64,
)
 
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
 
lora_requests = {
    "team_a": LoRARequest("team_a_lora", 1, "/models/lora/team_a"),
    "team_b": LoRARequest("team_b_lora", 2, "/models/lora/team_b"),
}
 
outputs = llm.generate(
    ["Team A's task", "Team B's task"],
    sampling_params,
    lora_request=[lora_requests["team_a"], lora_requests["team_b"]]
)

SGLang also supports Multi-LoRA, but vLLM's maturity and flexibility stand out in this area.

Example 4: Cache Isolation in Multi-Tenant Environments (vLLM v0.11+)

When processing requests from multiple customers on a SaaS platform, cache isolation is a security requirement. Customer A's prompts must not influence customer B's cache.

python

# vLLM Cache Salting for per-tenant cache isolation
from vllm import LLM, SamplingParams
 
llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    enable_prefix_caching=True,
)
 
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
 
def serve_tenant_request(tenant_id: str, prompt: str):
    return llm.generate(
        prompt,
        sampling_params,
        prefix_cache_salt=f"tenant:{tenant_id}",  # Salt injection changes the first block hash
    )
 
# Even identical prompts won't share cache if tenants differ
shared_prompt = "Common system prompt..."
response_a = serve_tenant_request("tenant_001", shared_prompt)
response_b = serve_tenant_request("tenant_002", shared_prompt)

Pros and Cons Analysis

Honestly, a table of pros and cons alone tends to strip out the context. It helps to read the explanations alongside each framework's limitations so you understand when they'll actually trip you up in practice.

vLLM APC

vLLM's strength is versatility. Hardware compatibility, ecosystem maturity, and Multi-LoRA support all favor vLLM.

Item	Details
Broad hardware support	Virtually all accelerators: TPU, Trainium, Gaudi, NVIDIA, and more
Mature ecosystem	~3x more contributors than SGLang, full OpenAI API compatibility
Multi-LoRA serving	Independent block structure makes per-adapter cache management straightforward
Security isolation	Per-tenant cache isolation via cache salting (v0.11+)
Low overhead	Cache hit check via hash lookup only, no tree maintenance

Three constraints most commonly cause friction in production.

Item	Details	Mitigation
Block-level caching constraint	Incomplete final block is dropped if prefix length isn't a multiple of 16	Design system prompt token count to be a multiple of 16
Concurrency ceiling	Throughput degrades faster than SGLang under high concurrency	Consider SGLang for multi-turn, high-concurrency workloads
Hash collision risk in old versions	Pre-v0.11 uses Python's built-in `hash()`, which can cause incorrect cache reuse on collision	Always use v0.11 or later

SGLang RadixAttention

SGLang's strengths are precision of cache reuse and stability under high concurrency.

Item	Details
Variable-length prefix caching	Maximum prefix reuse for arbitrary lengths without block alignment
Automatic prefix discovery	Automatically identifies shareable prefixes from traffic patterns without explicit configuration
High-concurrency stability	Particula benchmark (Llama 3, H100) — throughput remains stable under high concurrency
Structured output optimization	Speed advantage for JSON/schema-constrained decoding via compressed FSM
Higher throughput	~29% higher throughput than vLLM under equivalent conditions (16,200 vs 12,500 tok/s)

SGLang's limitations are primarily around ecosystem and overhead.

Item	Details	Mitigation
NVIDIA-centric ecosystem	Non-NVIDIA hardware support is more limited than vLLM	vLLM is the practical choice for non-NVIDIA environments
Tree maintenance overhead	Radix tree insertion, search, and eviction incur additional CPU and memory cost	vLLM is more advantageous for independent batch processing without shared prefixes
Disadvantage for independent batches	If there are no common prefixes across traffic, the tree provides no benefit	Measure prefix overlap before deciding to adopt

The Most Common Mistakes in Practice

Adopting prefix caching without measuring the cache hit rate: Applying prefix caching to a workload with low prefix overlap only adds overhead. Check the actual prefix distribution in your traffic before adopting.
Placing variable content at the front of the prompt: When request-specific information like usernames, timestamps, or session IDs appears at the front of the prompt, the entire long system prompt that follows becomes a cache miss. The key is fixed content first, variable content last.
Continuing to use vLLM versions prior to v0.11: Older APC uses Python's built-in hash() function, which carries a collision risk. If two different prompts share the same hash key, the wrong KV cache gets reused, which can lead to subtle output errors.

Closing Thoughts

The prefix caching in these two frameworks starts from "same goal, different data structures" — SGLang pays the tree maintenance overhead in exchange for token-level reuse and automatic prefix discovery via a radix tree, while vLLM accepts block-level constraints in exchange for simplicity and hardware compatibility via a hash table.

When your workload characteristics are clear, the choice is clear too. For workloads with long shared prefixes and high concurrency — multi-turn conversation, RAG, agent frameworks — SGLang offers tangible advantages. For multi-LoRA, non-NVIDIA hardware, or multi-tenant environments requiring security isolation, vLLM is the more natural choice.

Three steps you can take right now:

Start by measuring the prefix overlap in your own workload. You can get a quick picture by sampling 100 recent requests as shown below. If the result is 0.6 or higher, both approaches should deliver meaningful gains.

python

from collections import Counter
 
def measure_prefix_overlap(prompts: list[str], prefix_len: int = 200) -> float:
    prefixes = [p[:prefix_len] for p in prompts]
    most_common_count = Counter(prefixes).most_common(1)[0][1]
    return most_common_count / len(prompts)
 
overlap_rate = measure_prefix_overlap(recent_prompts)
print(f"Prefix overlap rate: {overlap_rate:.1%}")

If you're already using vLLM, you can check your current hit rate immediately via the vllm:prefix_cache_hit_rate metric on the /metrics endpoint (Prometheus format).

If you're using vLLM, try adding the --enable-prefix-caching flag and upgrading to v0.11 or later. It applies instantly with a single command: vllm serve meta-llama/Meta-Llama-3-8B-Instruct --enable-prefix-caching. Padding your system prompt token count to a multiple of 16 should push your cache hit rate even higher.
For multi-turn or RAG workloads, run an A/B test with SGLang. Launch the server with python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct and RadixAttention will be active by default. Measuring prefill latency and throughput differences under the same traffic will give you the data to decide which fits your environment better.

References

#vLLM#SGLang#KV캐시#RadixAttention#PrefixCaching#LLM추론#PagedAttention#RAG#MultiLoRA#Transformer

vLLM APC vs SGLang RadixAttention: KV Cache Architecture Differences and Workload-Based Selection Criteria | DEV BAK - 기술블로그

vLLM APC vs SGLang RadixAttention: KV Cache Architecture Differences and Workload-Based Selection Criteria

Selection criteria at a glance

Multi-turn conversation / RAG / agents: SGLang RadixAttention — the longer the shared prefix and the higher the concurrency, the greater the advantage

Multi-LoRA / non-NVIDIA hardware / multi-tenant security: vLLM APC — strengths in ecosystem maturity and hardware compatibility

Independent batches with less than 60% prefix overlap: Both approaches offer limited benefit — analyze your traffic before adopting either

Core Concepts

vLLM APC: Managing Blocks with a Hash Table

python

# Enable vLLM APC (v0.11 or later recommended)
from vllm import LLM, SamplingParams
 
llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    enable_prefix_caching=True,
    block_size=16,  # Minimum cache unit: 16-token block
)
 
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
 
system_prompt = "You are a code review expert. " * 50  # Long system prompt
 
outputs = llm.generate([
    system_prompt + "Please review this Python code: def foo(): pass",
    system_prompt + "Please review this Python code: def bar(): return 1",
], sampling_params)
# From the second request onward, system_prompt blocks are reused from cache

SGLang RadixAttention: Unifying All Requests in a Single Tree

SGLang's approach is bolder. It manages the KV pages of all active requests currently being processed by the server using a single Radix Tree (compressed prefix trie).

sql

Request A: [System Prompt] + [User: "Please review this Python code"]
Request B: [System Prompt] + [User: "Please review this JavaScript code"]
Request C: [System Prompt] + [User: "Please review this Python code"] + [Assistant Response] + [User: "Explain further"]
 
Radix Tree:
root
└── [System Prompt KV Pages]  ← shared by A, B, C
    ├── [User: "Please review this Python code"]  ← shared by A, C
    │   └── [Assistant Response] → [User: "Explain further"]  ← C only
    └── [User: "Please review this JavaScript code"]  ← B only

python

# SGLang - OpenAI-compatible API (currently recommended approach)
# Server: python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
from openai import OpenAI
 
client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)
 
SYSTEM_PROMPT = "You are a code review expert. ..."
 
# RadixAttention operates automatically inside the server — no client-side configuration needed
for user_code in ["def foo(): pass", "def bar(): return 1"]:
    response = client.chat.completions.create(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"Please review this Python code: {user_code}"}
        ],
        max_tokens=512
    )
    print(response.choices[0].message.content)
# From the second request onward, the SYSTEM_PROMPT portion is automatically reused from the radix tree

The Core Differences at a Glance

Item	vLLM APC	SGLang RadixAttention
Data structure	Global hash table	Radix tree
Minimum cache unit	Fixed 16-token block	Variable-length node (arbitrary-length prefix matching)
Prefix detection method	Block hash comparison	Tree longest-prefix matching
Partial block caching	Not supported	Supported
Eviction policy	LRU (ref count=0) + longer prefix preservation	Recursive LRU leaf node eviction
Automatic prefix discovery across requests	Not supported (requires identical block hash)	Supported (automatic identification via tree traversal)
Enabled by default	Requires `--enable-prefix-caching` flag	Built-in by default

Practical Application

Example 1: Multi-Turn Chatbot — Where SGLang Shines

python

# SGLang multi-turn chatbot — OpenAI-compatible API approach
from openai import OpenAI
 
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
 
SYSTEM_PROMPT = """
You are a customer support specialist for TechCorp.
[Company Policy ~200 tokens...]
[Product Information ~300 tokens...]
[Service Guidelines ~150 tokens...]
"""  # ~650 tokens total
 
def chat(history: list, user_input: str) -> str:
    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
    messages.extend(history)
    messages.append({"role": "user", "content": user_input})
 
    response = client.chat.completions.create(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        messages=messages,
        max_tokens=256
    )
    return response.choices[0].message.content
 
# As conversation history accumulates, the range reused from the radix tree grows automatically
history = []
for user_input in user_messages:
    response_text = chat(history, user_input)
    history.extend([
        {"role": "user", "content": user_input},
        {"role": "assistant", "content": response_text}
    ])

Conversation Turn	KV Pages Reused	Tokens Subject to Prefill
Turn 1	0%	Everything (system prompt + user message)
Turn 2	System prompt + turn 1 history	New user message only
Turn 5	System prompt + turns 1–4 history	New user message only
Turn N	Entire accumulated history	New user message only

The more concurrent users sharing the same system prompt, the more the system prompt node in the radix tree is shared. In this scenario, a cache hit rate of 75–95% is a realistic target.

Example 2: RAG Pipeline — Reusing Document Context

python

# RAG + SGLang: cache-friendly prompt construction
from openai import OpenAI
 
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
 
def rag_query(retrieved_docs: list[str], question: str) -> str:
    # Place retrieved documents at the front of the system message — maximizes cache reuse
    context = "\n\n".join([f"Document {i+1}: {doc}" for i, doc in enumerate(retrieved_docs)])
 
    response = client.chat.completions.create(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        messages=[
            {"role": "system", "content": f"Please answer the question based on the following documents:\n\n{context}"},
            {"role": "user", "content": question}
        ],
        max_tokens=512
    )
    return response.choices[0].message.content
 
# fetch_document() is a function you implement to suit your environment
popular_doc = fetch_document("python-docs-chapter3")
 
results = [
    rag_query(retrieved_docs=[popular_doc, doc_b], question=q)
    for q in questions  # The popular_doc portion is automatically reused from the radix tree
]

Cache-friendly prompt design: Frequently repeated context (system prompts, documents, tool definitions) should always be placed at the front of the prompt. If variable content like usernames or timestamps comes first, the entire long system prompt that follows will be a cache miss.

Example 3: Multi-LoRA Serving — Where vLLM Excels

python

# vLLM Multi-LoRA serving
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
 
llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    enable_lora=True,
    enable_prefix_caching=True,
    max_lora_rank=64,
)
 
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
 
lora_requests = {
    "team_a": LoRARequest("team_a_lora", 1, "/models/lora/team_a"),
    "team_b": LoRARequest("team_b_lora", 2, "/models/lora/team_b"),
}
 
outputs = llm.generate(
    ["Team A's task", "Team B's task"],
    sampling_params,
    lora_request=[lora_requests["team_a"], lora_requests["team_b"]]
)

SGLang also supports Multi-LoRA, but vLLM's maturity and flexibility stand out in this area.

Example 4: Cache Isolation in Multi-Tenant Environments (vLLM v0.11+)

When processing requests from multiple customers on a SaaS platform, cache isolation is a security requirement. Customer A's prompts must not influence customer B's cache.

python

# vLLM Cache Salting for per-tenant cache isolation
from vllm import LLM, SamplingParams
 
llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    enable_prefix_caching=True,
)
 
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
 
def serve_tenant_request(tenant_id: str, prompt: str):
    return llm.generate(
        prompt,
        sampling_params,
        prefix_cache_salt=f"tenant:{tenant_id}",  # Salt injection changes the first block hash
    )
 
# Even identical prompts won't share cache if tenants differ
shared_prompt = "Common system prompt..."
response_a = serve_tenant_request("tenant_001", shared_prompt)
response_b = serve_tenant_request("tenant_002", shared_prompt)

Pros and Cons Analysis

vLLM APC

vLLM's strength is versatility. Hardware compatibility, ecosystem maturity, and Multi-LoRA support all favor vLLM.

Item	Details
Broad hardware support	Virtually all accelerators: TPU, Trainium, Gaudi, NVIDIA, and more
Mature ecosystem	~3x more contributors than SGLang, full OpenAI API compatibility
Multi-LoRA serving	Independent block structure makes per-adapter cache management straightforward
Security isolation	Per-tenant cache isolation via cache salting (v0.11+)
Low overhead	Cache hit check via hash lookup only, no tree maintenance

Three constraints most commonly cause friction in production.

Item	Details	Mitigation
Block-level caching constraint	Incomplete final block is dropped if prefix length isn't a multiple of 16	Design system prompt token count to be a multiple of 16
Concurrency ceiling	Throughput degrades faster than SGLang under high concurrency	Consider SGLang for multi-turn, high-concurrency workloads
Hash collision risk in old versions	Pre-v0.11 uses Python's built-in `hash()`, which can cause incorrect cache reuse on collision	Always use v0.11 or later

SGLang RadixAttention

SGLang's strengths are precision of cache reuse and stability under high concurrency.

Item	Details
Variable-length prefix caching	Maximum prefix reuse for arbitrary lengths without block alignment
Automatic prefix discovery	Automatically identifies shareable prefixes from traffic patterns without explicit configuration
High-concurrency stability	Particula benchmark (Llama 3, H100) — throughput remains stable under high concurrency
Structured output optimization	Speed advantage for JSON/schema-constrained decoding via compressed FSM
Higher throughput	~29% higher throughput than vLLM under equivalent conditions (16,200 vs 12,500 tok/s)

SGLang's limitations are primarily around ecosystem and overhead.

Item	Details	Mitigation
NVIDIA-centric ecosystem	Non-NVIDIA hardware support is more limited than vLLM	vLLM is the practical choice for non-NVIDIA environments
Tree maintenance overhead	Radix tree insertion, search, and eviction incur additional CPU and memory cost	vLLM is more advantageous for independent batch processing without shared prefixes
Disadvantage for independent batches	If there are no common prefixes across traffic, the tree provides no benefit	Measure prefix overlap before deciding to adopt

The Most Common Mistakes in Practice

Adopting prefix caching without measuring the cache hit rate: Applying prefix caching to a workload with low prefix overlap only adds overhead. Check the actual prefix distribution in your traffic before adopting.
Placing variable content at the front of the prompt: When request-specific information like usernames, timestamps, or session IDs appears at the front of the prompt, the entire long system prompt that follows becomes a cache miss. The key is fixed content first, variable content last.
Continuing to use vLLM versions prior to v0.11: Older APC uses Python's built-in hash() function, which carries a collision risk. If two different prompts share the same hash key, the wrong KV cache gets reused, which can lead to subtle output errors.

Closing Thoughts

Three steps you can take right now:

Start by measuring the prefix overlap in your own workload. You can get a quick picture by sampling 100 recent requests as shown below. If the result is 0.6 or higher, both approaches should deliver meaningful gains.

python

from collections import Counter
 
def measure_prefix_overlap(prompts: list[str], prefix_len: int = 200) -> float:
    prefixes = [p[:prefix_len] for p in prompts]
    most_common_count = Counter(prefixes).most_common(1)[0][1]
    return most_common_count / len(prompts)
 
overlap_rate = measure_prefix_overlap(recent_prompts)
print(f"Prefix overlap rate: {overlap_rate:.1%}")

If you're already using vLLM, you can check your current hit rate immediately via the vllm:prefix_cache_hit_rate metric on the /metrics endpoint (Prometheus format).

If you're using vLLM, try adding the --enable-prefix-caching flag and upgrading to v0.11 or later. It applies instantly with a single command: vllm serve meta-llama/Meta-Llama-3-8B-Instruct --enable-prefix-caching. Padding your system prompt token count to a multiple of 16 should push your cache hit rate even higher.
For multi-turn or RAG workloads, run an A/B test with SGLang. Launch the server with python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct and RadixAttention will be active by default. Measuring prefill latency and throughput differences under the same traffic will give you the data to decide which fits your environment better.

References

#vLLM#SGLang#KV캐시#RadixAttention#PrefixCaching#LLM추론#PagedAttention#RAG#MultiLoRA#Transformer

Core Concepts

vLLM APC: Managing Blocks with a Hash Table

SGLang RadixAttention: Unifying All Requests in a Single Tree

The Core Differences at a Glance

Practical Application

Example 1: Multi-Turn Chatbot — Where SGLang Shines

Example 2: RAG Pipeline — Reusing Document Context

Example 3: Multi-LoRA Serving — Where vLLM Excels

Example 4: Cache Isolation in Multi-Tenant Environments (vLLM v0.11+)

Pros and Cons Analysis

vLLM APC

SGLang RadixAttention

The Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

vLLM APC: Managing Blocks with a Hash Table

SGLang RadixAttention: Unifying All Requests in a Single Tree

The Core Differences at a Glance

Practical Application

Example 1: Multi-Turn Chatbot — Where SGLang Shines

Example 2: RAG Pipeline — Reusing Document Context

Example 3: Multi-LoRA Serving — Where vLLM Excels

Example 4: Cache Isolation in Multi-Tenant Environments (vLLM v0.11+)

Pros and Cons Analysis

vLLM APC

SGLang RadixAttention

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Why KV Cache Hit Rate Drops to 0% When Scaling Out vLLM Pods, and How llm-d Solves It (Prefix-Aware Routing / Distributed KV Cache)

Migrating LLM Inference from vLLM to SGLang: Why RAG and Multi-Turn Workloads See a 6x Throughput Difference

SGLang RadixAttention KV Cache Hit Rate: How We Raised Hit Rate from 3% to 78% with Prometheus Monitoring and Operational Tuning — Advanced Guide

SGLang RadixAttention: How to Boost RAG Pipeline Throughput 5x by Reusing KV Cache for Identical Document Blocks

How to Measure RAG Pipeline Quality in Numbers with Ragas and Ollama

How to Lock Down Your Team's Ollama Server — Security Configuration, vLLM Migration, and Multi-Agent Orchestration