Migrating LLM Inference from vLLM to SGLang: Why RAG and Multi-Turn Workloads See a 6x Throughput Difference
Honestly, my first reaction when I came across SGLang was "another new framework?" vLLM was working well enough, and touching a serving stack that's already running stably is always a burden. But while digging into performance bottlenecks in a pipeline processing tens of thousands of RAG requests per day, I looked at SGLang's RadixAttention benchmark numbers and thought, "If this is real, I need to tear apart our current stack."
One thing worth clarifying upfront: the "6x" figure in the title is achieved in RAG and multi-turn architectures where thousands of requests share the same document context. For single-turn workloads where every request is entirely new content, the two frameworks are nearly identical. Understanding which side your service is closer to comes first — this article was written to help you make that judgment. If you have experience with vLLM or are already running LLM serving in production, it's structured so you can follow along right through the code examples.
The key point is this — thanks to the OpenAI-compatible API, you only need to change one server endpoint and there's no need to touch client code. The barrier to adoption is low, so if you're currently serving RAG or multi-turn conversations, it's well worth running an experiment.
Core Concepts
Why the KV Cache Becomes a Bottleneck
When an LLM generates tokens, it accumulates the attention computation results for previously processed tokens (Key-Value cache, or KV cache) in GPU memory. This is to avoid recomputing the same content, but the problem is that as this cache grows heavy, GPU memory fills up quickly. In a RAG pipeline, each request carries thousands of tokens of document context, so how you manage this directly determines serving performance.
vLLM — The Industry Standard Built on PagedAttention
vLLM is a project that started at UC Berkeley, and it manages the GPU KV cache in page-sized units like an operating system's virtual memory, using a technique called PagedAttention. This reduces memory fragmentation and improves batching efficiency. It quickly became the industry standard after its 2023 debut and currently has the broadest model support and community ecosystem.
PagedAttention: A technique that divides and manages the KV cache in GPU memory into fixed-size pages. It allows more requests to be processed concurrently without memory waste.
SGLang — Pushing Prefix Reuse to the Extreme
SGLang (Structured Generation Language) is a framework built by the LMSYS group, and two core technologies create its differentiating edge.
RadixAttention: Manages the KV cache using a Radix Tree data structure. When multiple requests share the same prefix (system prompt, RAG context, conversation history), it automatically reuses the cache. In RAG pipelines, thousands of requests often prepend the same document context, so recomputing it every time is an obvious waste. vLLM also does caching, but thanks to the Radix Tree structure, SGLang identifies and reuses these shared prefixes with far greater precision.
Radix Tree: A tree data structure that shares common prefixes among strings. SGLang applies this to sharing KV cache for common prefix token sequences.
XGrammar-based Structured Output: When handling JSON/Regex/EBNF constraint decoding, XGrammar delivers 80x faster speed compared to standard methods at the grammar compilation stage. This results in roughly 3x faster actual JSON token decoding throughput compared to the standard approach. These measure different stages — compilation and decoding respectively — so it's worth not conflating them. The difference is significant in agent and tool-call pipelines where JSON responses must be enforced.
Comparing the Two Frameworks
The first thing that caught my eye when I first saw this table was the TTFT row. TTFT (Time To First Token) is the time from when a request is sent until the first token comes back in the response — 23% looks small in absolute terms, but it felt quite different at p95 latency.
| Item | vLLM | SGLang |
|---|---|---|
| KV cache management | PagedAttention | RadixAttention (prefix reuse) |
| Structured output | Basic support | XGrammar (~3x JSON throughput) |
| Model support range | Broader | Focused on major models |
| Hardware compatibility | Broad, including AMD ROCm | NVIDIA-centric, expanding |
| OpenAI API compatibility | Supported | Supported |
| First request latency | Consistent | Stable after warmup |
| TTFT (multi-turn p95) | 103ms | 79ms (23% improvement) |
| Official PyTorch integration | — | March 2025 |
Official PyTorch Integration (March 2025): SGLang was officially included in the PyTorch ecosystem. This signifies recognition as a member of the PyTorch ecosystem rather than just a third-party project, allowing expectations of long-term maintenance and continuity.
Practical Application
Here's the example flow upfront: drop-in replacement to verify behavior → Docker for production setup → optimize for RAG workloads → confirm numbers with benchmarks. Feel free to jump directly to whichever example fits your situation.
Example 1: Drop-in Replacement — Changing Only the Endpoint
The first thing I wanted to verify was "can you really avoid changing the code?" Trying it in practice, you only need to change the server startup command. If you're serving a large MoE-architecture model like DeepSeek V3, simply swapping the model with --model-path deepseek-ai/DeepSeek-V3 --tp 8 will automatically apply SGLang's MLA optimization kernels.
# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size 2
# Start SGLang server (same model, same API format)
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--tp 2 \
--host 0.0.0.0 \
--port 30000In the client code, only the base_url needs to change.
from openai import OpenAI
# Code that used vLLM — only this line needs to change
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY",
)
# When switching to SGLang
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY",
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello"}],
)| Changed Item | vLLM | SGLang |
|---|---|---|
| Server launch module | vllm.entrypoints.openai.api_server |
sglang.launch_server |
| TP argument name | --tensor-parallel-size |
--tp |
| Default port | 8000 | 30000 (configurable) |
| Client changes | — | Modify base_url only |
Tensor Parallelism (TP): A method that distributes the model's weight matrices across multiple GPUs for parallel processing.
--tp 4splits the model across 4 GPUs. In SGLang,--tpis the most stable option for multi-GPU operation.
Example 2: Serving SGLang with Docker
In production environments, Docker is much more convenient. There's an official image, so you can bring it up without any environment setup.
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<your-token>" \
--ipc=host \
lmsysorg/sglang:latest-runtime \
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 30000--shm-size 32g prevents errors caused by insufficient shared memory during multi-GPU tensor parallel processing. I ran it without this option the first time and spent a long time troubleshooting. In environments without Docker, set the environment variable with export HF_TOKEN=<your-token> before starting the server.
Example 3: Maximizing RadixAttention Effect in a RAG Pipeline
In RAG workloads, thousands of requests share the same document context. The thousands of tokens of documents that go into SHARED_CONTEXT in the code below are the primary reuse target for RadixAttention. The more concurrent requests there are, and the longer the shared prefix, the wider the performance gap with vLLM grows.
import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY",
)
SHARED_CONTEXT = """
You are a document analysis assistant.
[Document content — thousands of tokens of RAG context]
"""
async def query_with_rag(user_question: str) -> str:
response = await client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": SHARED_CONTEXT}, # target for prefix reuse
{"role": "user", "content": user_question},
],
)
return response.choices[0].message.content
async def main():
# Questions asked from various angles about the same document — this pattern maximizes RadixAttention effectiveness
questions = [
"What is the central argument of this document?",
"Please summarize the evidence the author presents.",
"Summarize the content of the third paragraph in one sentence.",
"What limitations are mentioned in this document?",
"Please explain the conclusion in plain language.",
]
results = await asyncio.gather(*[query_with_rag(q) for q in questions])
return resultsExample 4: Server Warmup and Benchmark Measurement
SGLang needs time to build up the RadixAttention cache during the first few requests. Measuring performance immediately after server startup can produce lower-than-actual numbers, so it's good practice to send a few dummy requests beforehand.
import asyncio
from openai import AsyncOpenAI
async def warmup_server(base_url: str, model: str, n: int = 5) -> None:
client = AsyncOpenAI(base_url=base_url, api_key="EMPTY")
tasks = [
client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": "ping"}],
max_tokens=1,
)
for _ in range(n)
]
await asyncio.gather(*tasks)
print(f"Warmup complete ({n} requests)")
# Run after server startup, before actual traffic
asyncio.run(
warmup_server("http://localhost:30000/v1", "meta-llama/Llama-3.1-8B-Instruct")
)Benchmarks can be measured using SGLang's built-in bench_serving script.
# SGLang built-in benchmark
python -m sglang.bench_serving \
--backend sglang \
--host localhost \
--port 30000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--num-prompts 200 \
--request-rate 10
# For comparison with vLLM — change only backend and port
python -m sglang.bench_serving \
--backend vllm \
--host localhost \
--port 8000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--num-prompts 200 \
--request-rate 10Pros and Cons Analysis
Advantages
| Item | Details |
|---|---|
| RAG/multi-turn throughput | Up to 6.4x improvement over vLLM (when prefix reuse rate is high) |
| Small model general throughput | ~29% advantage on H100 |
| DeepSeek/MoE inference | Built-in MLA optimization kernel, 3.1x faster than vLLM |
| Structured output | JSON decoding throughput ~3x vs. standard approach |
| TTFT (multi-turn p95) | vLLM 103ms → SGLang 79ms (23% improvement) |
| Drop-in replacement | No client code changes needed with OpenAI-compatible API |
| Kubernetes support | Official Helm Chart provided |
| Quantization | FP4/FP8/INT4/AWQ/GPTQ support |
| Ecosystem | Official PyTorch integration in March 2025 |
MLA (Multi-head Latent Attention): An attention variant introduced by DeepSeek that compresses the KV cache into low-dimensional latent vectors to reduce memory. SGLang has built-in kernel optimizations specialized for this architecture, outperforming vLLM in both memory efficiency and inference speed when combined with MoE architectures.
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Warmup required | Initial requests spend time loading RadixAttention cache | Send warmup requests after server startup (see Example 4) |
| DP instability | High performance variability when using --dp flag |
Recommend using --tp (Tensor Parallelism) instead |
| Mixed-bit quantization | Mixed-bit quantization causes some conflicts | Use a single quantization scheme like FP8/AWQ |
| MoE gate layer | mlp.gate quantization kernel unsupported in some cases |
Exclude that layer from quantization or check latest version |
| Large model gap | Performance gap narrows to 3–5% for 70B+ models | Judge based on workload characteristics for large models |
| Model support range | Non-mainstream architectures may be unsupported | Check official model list beforehand |
| Hardware compatibility | vLLM has the edge for AMD ROCm stability | Recommend keeping vLLM for AMD environments |
| Single-turn independent requests | No prefix reuse, almost no performance difference | Understand workload characteristics before deciding |
Data Parallelism (DP): A method that places identical model copies on multiple GPUs and distributes requests among them. In SGLang, using
--dpcurrently shows high performance variability and is not recommended. In multi-GPU environments, it's better to start by tuning--tpfirst.
The Most Common Mistakes in Practice
-
Judging by first-response latency without warmup — SGLang needs time to build up the RadixAttention cache during the first few requests. Concluding "no difference from vLLM" after just one request is a misjudgment. Run the warmup code from Example 4 first and look at numbers after sufficient traffic has passed through.
-
Expecting 6x from single-turn independent request workloads — If every request is entirely new content, RadixAttention's prefix reuse won't occur. In this case, the performance of the two frameworks is nearly identical. It's worth first understanding how much prefix sharing occurs in your workload.
-
Using the
--dpflag instead of--tp— When utilizing multiple GPUs, Data Parallelism may feel more familiar, but--tpis currently much more stable in SGLang. For multi-GPU setups, it's recommended to tune--tpvalues before reaching for--dp.
Closing Thoughts
Migrating to SGLang is well worth trying if even one of four keywords applies to you: RAG, multi-turn, structured output, or DeepSeek-family models. Migration costs are low thanks to the OpenAI-compatible API, and if it doesn't fit, switching back to vLLM is just a matter of changing one endpoint. The fact that you can test it without much risk is its greatest advantage.
Three steps you can start right now:
-
Install SGLang and run a local server — After
pip install sglang[all], you can start the server withpython -m sglang.launch_server --model-path <your current model> --tp <number of GPUs> --host 0.0.0.0 --port 30000. In environments without Docker, set your HuggingFace token withexport HF_TOKEN=<your-token>before starting. Gotcha: In multi-GPU Docker environments, running without--shm-size 32gwill produce shared memory errors. -
Change only
base_urlin your existing client to the SGLang port — Switch tohttp://localhost:30000/v1and verify that existing tests still pass as-is. You can confirm whether it works while leaving all other code untouched. Gotcha: It's important not to judge performance based on just the first request without warmup. -
Compare throughput and TTFT with the same workload — Use
python -m sglang.bench_servingto simulate your actual traffic patterns (whether RAG context is included, whether it's multi-turn). If the numbers improve meaningfully, moving on to Docker deployment is recommended. Gotcha: 70B+ large models see the gap narrow to 3–5%, so don't apply small model benchmarks to large model expectations.
References
- SGLang Official GitHub Repository
- SGLang Official Documentation
- vLLM vs SGLang 2026: H100 Benchmarks | TECHSY
- vLLM vs TensorRT-LLM vs SGLang: H100 Benchmarks 2026 | Spheron Blog
- When to Choose SGLang Over vLLM: Multi-Turn Conversations and KV Cache Reuse | RunPod
- SGLang in Production: Structured Generation, RadixAttention, and Multi-Step Pipelines | RunPod
- SGLang: The Complete Guide to High-Performance LLM Inference | Inference.net
- OpenAI Compatible API | SGLang Official Documentation
- Mini-SGLang: Efficient Inference Engine in a Nutshell | LMSYS Blog
- SGLang Quantization Official Documentation
- SGLang Speculative Decoding Official Documentation