vLLM vs SGLang Performance Comparison — Choosing an Inference Engine Through the Lens of 2026 KV Cache Architecture
Have you ever served an LLM yourself? At first it seemed like model selection was everything, but once you deploy to production, you quickly realize that the choice of inference engine is what truly determines performance and cost. Early on, I brushed it off with "just use vLLM, right?"—until a multi-turn RAG pipeline produced a monthly bill 2.3× higher than expected. That's when I seriously compared the two engines. It turned out the problem wasn't the model; the system prompt and document context were being recomputed from scratch on every single request.
Both vLLM and SGLang are open-source LLM inference engines that originated at UC Berkeley. They may look similar at a glance, but their internal KV cache management approaches are fundamentally different—leading to performance gaps of up to 6× depending on workload characteristics. This article will help you understand the core differences between the two engines and give you clear criteria to confidently choose the right one for your situation.
Let's walk through, one by one, why the two engines produce such different results on different workloads.
Core Concepts
PagedAttention vs RadixAttention — Two Ways of Viewing the KV Cache
The most fundamental difference between the two engines is how they manage the KV (Key-Value) cache. Honestly, if you understand just this part, the rest of the selection criteria follows naturally.
KV Cache (Key-Value Cache): A memory space that stores the Key and Value matrices computed when a Transformer model processes previous tokens. Reusing these avoids recomputing already-processed portions, significantly boosting throughput.
vLLM's PagedAttention was inspired by virtual memory paging in operating systems. It divides GPU memory into fixed-size pages and allocates them non-contiguously. This greatly reduces memory fragmentation and lets requests of varying lengths share memory efficiently. However, each request receives independent pages, so common content between requests is still recomputed from scratch.
[vLLM - PagedAttention concept]
Request A: [Page1][Page2][Page3]
Request B: [Page4][Page5]
Request C: [Page6][Page7][Page8][Page9]
→ Each request is allocated independent memory pages
→ No shared prefix, every request computed from scratchSGLang's RadixAttention, by contrast, manages the KV cache using a Radix Tree structure.
Radix Tree: A tree structure that shares common prefixes. It's analogous to storing "car", "card", and "care" in an alphabetical dictionary by keeping "car" once and branching only on "d" and "e". Applied to token sequences, a common leading token group is computed once and shared across multiple requests.
The key insight is that it automatically detects and reuses common prefixes (leading token sequences) across requests. If there's a repeating portion like a system prompt or document context, it is computed only once and all subsequent requests pull from that cached result.
[SGLang - RadixAttention tree structure]
System prompt: "You are a helpful assistant. [Document A]"
│ (computed once, stored in cache)
┌─────────────┼─────────────┐
Request A Request B Request C
"Main argument?" "Counter?" "Summarize?"
│ │ │
(extra tokens) (extra tokens) (extra tokens)
newly computed newly computed newly computed
→ Common portion (system prompt + document) computed only once
→ Only each request's unique portion is newly processedThe Difference in Numbers
Benchmarks based on an H100 GPU with the Llama-3.1-8B model as of 2026. Note that the gap narrows to 3–5% for larger 70B+ models, so interpretation varies by model size.
| Metric | vLLM | SGLang | Notes |
|---|---|---|---|
| Throughput (tok/s, H100 / 8B) | ~12,600 | ~16,200 | SGLang ~29% ahead |
| Prefix-sharing workloads | baseline | up to 6.4× | RadixAttention effect |
| TTFT p95 | baseline | 5–8% lower | Time to first token |
| High concurrency (100+ requests) | ~4,741 tok/s | Difficult to compare directly due to environment differences | vLLM C++ routing advantage |
| Model support breadth | Very wide | Relatively narrow | vLLM advantage |
| Hardware support | GPU, TPU, Trainium, Gaudi | GPU-centric | vLLM advantage |
The figure 16,200 tok/s on H100 may not be intuitive, but in practical terms it means that even with 100 simultaneous users, each gets 162 tok/s. If an average sentence is 20 tokens, that's over 8 sentences generated per second per user.
TTFT (Time To First Token): The time from when a user sends a request to when the first token is returned. The key metric determining perceived responsiveness in streaming responses.
The reason vLLM remains stable under high concurrency is that its internal routing is implemented in C++. Python has a structural limitation where multiple threads cannot execute simultaneously due to the GIL (Global Interpreter Lock), whereas C++ has no such constraint—so even with 100+ concurrent requests, queue management itself doesn't become a bottleneck.
Practical Applications
Example 1: Using SGLang in a Multi-Turn RAG Pipeline
This is exactly the scenario that made me switch engines. In RAG pipelines, retrieved document context is often repeated across multiple user requests—the pattern of "answer questions based on this document." With vLLM, if 100 questions are asked about the same document, all 100 computations start from scratch.
# Start SGLang server (settings optimized for multi-turn RAG)
# python -m sglang.launch_server \
# --model-path meta-llama/Llama-3.1-8B-Instruct \
# --host 0.0.0.0 \
# --port 30000 \
# --mem-fraction-static 0.9 \
# --max-prefill-tokens 16384
import openai
client = openai.OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY",
)
SHARED_CONTEXT = (
"You are a document analysis expert.\n\n"
"[Reference Document]\n"
"{document_content}"
)
def ask_question(document: str, question: str, history: list) -> str:
messages = [
{
"role": "system",
"content": SHARED_CONTEXT.format(document_content=document),
},
*history,
{"role": "user", "content": question},
]
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=messages,
max_tokens=512,
)
return response.choices[0].message.content
# Actual document content (paste a long report, paper, etc. here)
sample_doc = (
"The global AI market size in 2024 is estimated at approximately $200 billion. "
"Key growth drivers are the popularization of generative AI services and cloud infrastructure expansion. "
"North America holds 40% of the market, with the Asia-Pacific region showing notable growth."
)
history = []
questions = ["What is the main argument of this document?", "Are there any counterarguments?", "Summarize the conclusion."]
for question in questions:
answer = ask_question(sample_doc, question, history)
history.extend([
{"role": "user", "content": question},
{"role": "assistant", "content": answer},
])
print(f"Q: {question}\nA: {answer}\n")| Code Point | Description |
|---|---|
SHARED_CONTEXT |
The portion common to all requests → automatically cached by RadixAttention |
Accumulating history |
Cache hit rate increases as the multi-turn context grows longer |
Changing base_url |
Use the OpenAI SDK as-is; just swap the URL |
mem-fraction-static 0.9 |
Enlarges the cache pool — the larger it is, the greater the cache reuse benefit |
When 100 users repeatedly ask questions about the same document, the system prompt + document context is computed only once, and the rest is retrieved from cache. Cases have been reported where this pattern reduced GPU usage by approximately 30%.
Example 2: Using vLLM for Large-Scale Batch Document Classification
When the prompt structure is fixed but each document's content is completely different—i.e., workloads with no opportunity for cache reuse—vLLM shines.
# Start vLLM server (optimized for batch processing)
# python -m vllm.entrypoints.openai.api_server \
# --model meta-llama/Llama-3.1-70B-Instruct \
# --tensor-parallel-size 4 \
# --max-model-len 8192 \
# --gpu-memory-utilization 0.95
import asyncio
import openai
client = openai.AsyncOpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY",
)
CLASSIFICATION_PROMPT = (
"Please classify the following text into one of the categories below.\n"
"Categories: [Technology, Business, Sports, Entertainment, Politics]\n\n"
"Text:\n{text}\n\n"
"Answer with just one word for the category."
)
async def classify_document(text: str) -> str:
response = await client.chat.completions.create(
model="meta-llama/Llama-3.1-70B-Instruct",
messages=[{
"role": "user",
"content": CLASSIFICATION_PROMPT.format(text=text),
}],
max_tokens=10,
temperature=0,
)
return response.choices[0].message.content.strip()
async def batch_classify(documents: list[str]) -> list[str]:
tasks = [classify_document(doc) for doc in documents]
return await asyncio.gather(*tasks)
# List of documents each with unique content (load from files in actual use)
documents = [
"Samsung Electronics announced successful mass production of 3nm process chips.",
"Son Heung-min scored a hat-trick to lead his team to victory.",
"The Federal Reserve cut its benchmark interest rate by 0.25 percentage points.",
]
results = asyncio.run(batch_classify(documents))
for doc, label in zip(documents, results):
print(f"[{label}] {doc[:30]}...")| Code Point | Description |
|---|---|
tensor-parallel-size 4 |
Distributes the 70B model across 4 GPUs — vLLM's extensive parallelism support |
asyncio.gather |
Dispatches requests concurrently; vLLM's continuous batching groups them internally |
max_tokens=10 |
Short output length → maximizes throughput |
| Unique context per request | No opportunity for cache reuse → PagedAttention's memory efficiency is key |
Example 3: Serving DeepSeek Models — SGLang as the De Facto Standard
If you're serving DeepSeek-R1 or DeepSeek-V3, SGLang is strongly recommended. SGLang has built-in dedicated optimizations for the DeepSeek family that demonstrate meaningful advantages over vLLM.
Three key optimizations to highlight:
- MLA (Multi-head Latent Attention): A KV cache compression technique developed by DeepSeek. It projects Keys and Values into a low-dimensional latent space to dramatically reduce cache memory.
- MoE (Mixture of Experts): An architecture where only some "expert" networks out of the model's total parameters are selectively activated per token. Maintains large model capacity while keeping actual computation low.
- Elastic Expert Parallelism: A strategy that dynamically redistributes MoE expert layers across GPUs. Even if a specific GPU fails, the workload is automatically rebalanced so serving is not interrupted.
# Start DeepSeek-R1 server (SGLang recommended settings)
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1 \
--trust-remote-code \
--tp 8 \
--dp 2 \
--host 0.0.0.0 \
--port 30000 \
--reasoning-parser deepseek-r1
# MLA attention backend auto-enabled
# MoE fault tolerance via Elastic Expert ParallelismOnce the server is up, you can receive inference results in Python like this. A notable feature of DeepSeek-R1 is that it returns the reasoning process inside <think> tags.
import openai
client = openai.OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY",
)
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1",
messages=[{
"role": "user",
"content": (
"Find and fix the bug in the following code:\n\n"
"def factorial(n):\n"
" if n == 0:\n"
" return 0\n"
" return n * factorial(n-1)"
),
}],
max_tokens=2048,
temperature=0.6,
)
full_response = response.choices[0].message.content
# Separate the reasoning process wrapped in <think> tags from the final answer
if "<think>" in full_response:
think_end = full_response.find("</think>")
reasoning = full_response[7:think_end] # after <think>
answer = full_response[think_end + 8:].strip() # after </think>
print(f"[Reasoning Process]\n{reasoning}\n")
print(f"[Final Answer]\n{answer}")
else:
print(full_response)| Code Point | Description |
|---|---|
--tp 8 --dp 2 |
Tensor parallel 8 + data parallel 2, recommended combination for DeepSeek-R1's scale |
--reasoning-parser deepseek-r1 |
Enables <think> tag parsing, allowing separation of reasoning and answer |
think_end = find("</think>") |
Useful for separately logging or debugging R1's chain of reasoning |
Pros and Cons Analysis
Advantages
vLLM
| Item | Details |
|---|---|
| Model support breadth | Widest in the industry. Nearly all open-source models including Meta, Mistral, Cohere, Google |
| Hardware diversity | Supports NVIDIA GPU, TPU, AWS Trainium, Intel Gaudi |
| Community maturity | ~3× more contributors than SGLang; documentation and ecosystem are stable |
| OpenAI compatibility | Existing OpenAI SDK apps work as-is with just one URL change |
| High-concurrency stability | C++ routing maintains stable throughput even with 100+ simultaneous requests |
| Speculative Decoding | Supports various spec decode methods (EAGLE, MTP, etc.) for 1.3–2× speed gains |
SGLang
| Item | Details |
|---|---|
| Prefix-reuse workloads | Up to 6.4× performance and up to 30% GPU memory savings with repeated context |
| Raw throughput | 16,200 tok/s on H100 / 8B, ~29% ahead of vLLM |
| Structured output | FSM-based constrained decoding guarantees JSON and regex-formatted output |
| DeepSeek optimization | De facto standard serving engine for the DeepSeek model family |
| TTFT | 5–8% reduction in p95 time to first token |
| Cloud support | Official support on AWS, GCP, Azure, Oracle Cloud |
Constrained Decoding: A technique that forces model output to conform to a specific format (JSON schema, regex, etc.). Uses a finite state machine (FSM) to allow only valid tokens at each token generation step. Eliminates parsing errors at the source in function-calling and agent workflows.
Speculative Decoding: A method where a small "draft model" predicts multiple tokens ahead, and the large model verifies them all at once. Improves processing speed by 1.3–2× while maintaining output quality.
Drawbacks and Caveats
vLLM
| Item | Details | Mitigation |
|---|---|---|
| Prefix-sharing limitations | Up to 29% lower throughput than SGLang on fixed-context repeat workloads | Consider switching to SGLang for those workloads |
| Multi-turn cache efficiency | Recomputation occurs on every request with long conversation histories | Consider migrating to SGLang |
SGLang
To be honest, looking at this list of drawbacks makes SGLang seem inferior—but in practice, the perceived difference in DeepSeek serving or RAG workloads is quite significant.
| Item | Details | Mitigation |
|---|---|---|
| Model support breadth | Fewer supported models than vLLM | Verify your model is on the support list first |
| Hardware limitations | TPU, Trainium, Gaudi support is in early stages | No issue if you're on an NVIDIA GPU environment |
| Community size | Relatively smaller than vLLM | Major cloud providers officially supporting it since 2025; growing rapidly |
| Operational maturity | Rapid update cycle leads to frequent API changes | Recommended to pin versions and upgrade incrementally |
Which Engine to Choose — Decision Flow
If you've read the whole article and are thinking "so which one should I use?", the flow below may help.
Does my workload have a repeating common system prompt / document context?
│
├─ Yes → Does the fixed context account for 30%+ of the total prompt?
│ ├─ Yes → SGLang ✓ (maximizes RadixAttention effect)
│ └─ No → Are you using DeepSeek-family models?
│ ├─ Yes → SGLang ✓ (built-in dedicated optimizations)
│ └─ No → Both are similar; vLLM is easier to operate
│
└─ No → Are there 100+ concurrent requests?
├─ Yes → vLLM ✓ (C++ routing stability)
└─ No → Mixed GPU/TPU environment or frequent model changes?
├─ Yes → vLLM ✓ (hardware & model diversity)
└─ No → SGLang if throughput is priority,
vLLM if ecosystem stability is priorityThe Most Common Mistakes in Practice
I've made similar mistakes myself, so I'm sharing them here.
-
Choosing an engine without analyzing your workload — I also initially decided with "vLLM is well-known, so that's fine," and later had to rebuild my deployment pipeline from scratch when switching to multi-turn RAG. The right order is to first understand what patterns of requests will come in, then choose the engine.
-
Judging benchmarks by general numbers rather than your own workload — SGLang's advantage is clear for 7B–8B models, but the gap shrinks to 3–5% for 70B+ large models. Rather than general benchmarks, I recommend directly measuring numbers that match your model size and request patterns. Reproducing your actual traffic pattern with
locustorwrkand directly observing TTFT, throughput, and GPU utilization makes things much clearer. -
Overlooking high-concurrency scenarios — I missed this too, and making decisions based solely on raw throughput numbers can lead to surprises in production. In environments with 100+ concurrent requests, vLLM's C++ routing can be advantageous, so you need to factor in your expected concurrent user count.
Closing Thoughts
Your workload determines your engine. Start with vLLM for one-off batch jobs, and SGLang for repeated context or agent workloads.
Three steps you can take right now:
-
Identify your workload pattern: Check how often a common prefix (system prompt, document context) is repeated across requests. If the fixed context accounts for 30%+ of the total prompt, SGLang is likely the better choice.
-
Spin up both engines locally: Install each with
pip install vllmandpip install sglang[all], then run them in OpenAI-compatible API mode. Both engines work with your existing code with just abase_urlchange. -
Run benchmarks with your actual traffic pattern: Reproduce your real request patterns with
locustorwrkand directly measure TTFT, throughput, and GPU utilization. Numbers tailored to your situation are far more valuable than general benchmarks.
References
- SGLang vs vLLM in 2026: Which Inference Engine Wins? | Kanerika
- vLLM vs SGLang: Which Inference Engine Should You Use in 2026? | Yotta Labs
- SGLang vs vLLM in 2026: Benchmarks, Architecture, and When to Use Each | Particula
- vLLM vs SGLang vs LMDeploy: Fastest LLM Inference Engine in 2026 | n1n.ai
- vLLM vs TensorRT-LLM vs SGLang: H100 Benchmarks (2026) | Spheron
- When to Choose SGLang Over vLLM: Multi-Turn Conversations and KV Cache Reuse | RunPod
- Benchmark: SGLang vs. vLLM Scaling under High Concurrency | GitHub Issues
- SGLang GitHub Repository
- vLLM Official Documentation
- Performance improvements with speculative decoding in vLLM | Red Hat Developer
- SGLang Development Roadmap 2026 Q1 | GitHub
- Benchmarking LLM Inference: vLLM vs SGLang vs Ollama on NVIDIA Blackwell | Joshua8.AI