Privacy Policy© 2026 DEV BAK - TECH BLOG. All rights reserved.
DEV BAK - TECH BLOG
AI

Migrating LLM Inference from vLLM to SGLang: Why RAG and Multi-Turn Workloads See a 6x Throughput Difference

Honestly, my first reaction when I came across SGLang was "another new framework?" vLLM was working well enough, and touching a serving stack that's already running stably is always a burden. But while digging into performance bottlenecks in a pipeline processing tens of thousands of RAG requests per day, I looked at SGLang's RadixAttention benchmark numbers and thought, "If this is real, I need to tear apart our current stack."

One thing worth clarifying upfront: the "6x" figure in the title is achieved in RAG and multi-turn architectures where thousands of requests share the same document context. For single-turn workloads where every request is entirely new content, the two frameworks are nearly identical. Understanding which side your service is closer to comes first — this article was written to help you make that judgment. If you have experience with vLLM or are already running LLM serving in production, it's structured so you can follow along right through the code examples.

The key point is this — thanks to the OpenAI-compatible API, you only need to change one server endpoint and there's no need to touch client code. The barrier to adoption is low, so if you're currently serving RAG or multi-turn conversations, it's well worth running an experiment.


Core Concepts

Why the KV Cache Becomes a Bottleneck

When an LLM generates tokens, it accumulates the attention computation results for previously processed tokens (Key-Value cache, or KV cache) in GPU memory. This is to avoid recomputing the same content, but the problem is that as this cache grows heavy, GPU memory fills up quickly. In a RAG pipeline, each request carries thousands of tokens of document context, so how you manage this directly determines serving performance.

vLLM — The Industry Standard Built on PagedAttention

vLLM is a project that started at UC Berkeley, and it manages the GPU KV cache in page-sized units like an operating system's virtual memory, using a technique called PagedAttention. This reduces memory fragmentation and improves batching efficiency. It quickly became the industry standard after its 2023 debut and currently has the broadest model support and community ecosystem.

PagedAttention: A technique that divides and manages the KV cache in GPU memory into fixed-size pages. It allows more requests to be processed concurrently without memory waste.

SGLang — Pushing Prefix Reuse to the Extreme

SGLang (Structured Generation Language) is a framework built by the LMSYS group, and two core technologies create its differentiating edge.

RadixAttention: Manages the KV cache using a Radix Tree data structure. When multiple requests share the same prefix (system prompt, RAG context, conversation history), it automatically reuses the cache. In RAG pipelines, thousands of requests often prepend the same document context, so recomputing it every time is an obvious waste. vLLM also does caching, but thanks to the Radix Tree structure, SGLang identifies and reuses these shared prefixes with far greater precision.

Radix Tree: A tree data structure that shares common prefixes among strings. SGLang applies this to sharing KV cache for common prefix token sequences.

XGrammar-based Structured Output: When handling JSON/Regex/EBNF constraint decoding, XGrammar delivers 80x faster speed compared to standard methods at the grammar compilation stage. This results in roughly 3x faster actual JSON token decoding throughput compared to the standard approach. These measure different stages — compilation and decoding respectively — so it's worth not conflating them. The difference is significant in agent and tool-call pipelines where JSON responses must be enforced.

Comparing the Two Frameworks

The first thing that caught my eye when I first saw this table was the TTFT row. TTFT (Time To First Token) is the time from when a request is sent until the first token comes back in the response — 23% looks small in absolute terms, but it felt quite different at p95 latency.

Item vLLM SGLang
KV cache management PagedAttention RadixAttention (prefix reuse)
Structured output Basic support XGrammar (~3x JSON throughput)
Model support range Broader Focused on major models
Hardware compatibility Broad, including AMD ROCm NVIDIA-centric, expanding
OpenAI API compatibility Supported Supported
First request latency Consistent Stable after warmup
TTFT (multi-turn p95) 103ms 79ms (23% improvement)
Official PyTorch integration — March 2025

Official PyTorch Integration (March 2025): SGLang was officially included in the PyTorch ecosystem. This signifies recognition as a member of the PyTorch ecosystem rather than just a third-party project, allowing expectations of long-term maintenance and continuity.


Practical Application

Here's the example flow upfront: drop-in replacement to verify behavior → Docker for production setup → optimize for RAG workloads → confirm numbers with benchmarks. Feel free to jump directly to whichever example fits your situation.

Example 1: Drop-in Replacement — Changing Only the Endpoint

The first thing I wanted to verify was "can you really avoid changing the code?" Trying it in practice, you only need to change the server startup command. If you're serving a large MoE-architecture model like DeepSeek V3, simply swapping the model with --model-path deepseek-ai/DeepSeek-V3 --tp 8 will automatically apply SGLang's MLA optimization kernels.

bash
# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --tensor-parallel-size 2
 
# Start SGLang server (same model, same API format)
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --tp 2 \
  --host 0.0.0.0 \
  --port 30000

In the client code, only the base_url needs to change.

python
from openai import OpenAI
 
# Code that used vLLM — only this line needs to change
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
)
 
# When switching to SGLang
client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY",
)
 
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello"}],
)
Changed Item vLLM SGLang
Server launch module vllm.entrypoints.openai.api_server sglang.launch_server
TP argument name --tensor-parallel-size --tp
Default port 8000 30000 (configurable)
Client changes — Modify base_url only

Tensor Parallelism (TP): A method that distributes the model's weight matrices across multiple GPUs for parallel processing. --tp 4 splits the model across 4 GPUs. In SGLang, --tp is the most stable option for multi-GPU operation.

Example 2: Serving SGLang with Docker

In production environments, Docker is much more convenient. There's an official image, so you can bring it up without any environment setup.

bash
docker run --gpus all \
  --shm-size 32g \
  -p 30000:30000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=<your-token>" \
  --ipc=host \
  lmsysorg/sglang:latest-runtime \
  python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 30000

--shm-size 32g prevents errors caused by insufficient shared memory during multi-GPU tensor parallel processing. I ran it without this option the first time and spent a long time troubleshooting. In environments without Docker, set the environment variable with export HF_TOKEN=<your-token> before starting the server.

Example 3: Maximizing RadixAttention Effect in a RAG Pipeline

In RAG workloads, thousands of requests share the same document context. The thousands of tokens of documents that go into SHARED_CONTEXT in the code below are the primary reuse target for RadixAttention. The more concurrent requests there are, and the longer the shared prefix, the wider the performance gap with vLLM grows.

python
import asyncio
from openai import AsyncOpenAI
 
client = AsyncOpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY",
)
 
SHARED_CONTEXT = """
You are a document analysis assistant.
[Document content — thousands of tokens of RAG context]
"""
 
async def query_with_rag(user_question: str) -> str:
    response = await client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[
            {"role": "system", "content": SHARED_CONTEXT},  # target for prefix reuse
            {"role": "user", "content": user_question},
        ],
    )
    return response.choices[0].message.content
 
async def main():
    # Questions asked from various angles about the same document — this pattern maximizes RadixAttention effectiveness
    questions = [
        "What is the central argument of this document?",
        "Please summarize the evidence the author presents.",
        "Summarize the content of the third paragraph in one sentence.",
        "What limitations are mentioned in this document?",
        "Please explain the conclusion in plain language.",
    ]
    results = await asyncio.gather(*[query_with_rag(q) for q in questions])
    return results

Example 4: Server Warmup and Benchmark Measurement

SGLang needs time to build up the RadixAttention cache during the first few requests. Measuring performance immediately after server startup can produce lower-than-actual numbers, so it's good practice to send a few dummy requests beforehand.

python
import asyncio
from openai import AsyncOpenAI
 
async def warmup_server(base_url: str, model: str, n: int = 5) -> None:
    client = AsyncOpenAI(base_url=base_url, api_key="EMPTY")
    tasks = [
        client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": "ping"}],
            max_tokens=1,
        )
        for _ in range(n)
    ]
    await asyncio.gather(*tasks)
    print(f"Warmup complete ({n} requests)")
 
# Run after server startup, before actual traffic
asyncio.run(
    warmup_server("http://localhost:30000/v1", "meta-llama/Llama-3.1-8B-Instruct")
)

Benchmarks can be measured using SGLang's built-in bench_serving script.

bash
# SGLang built-in benchmark
python -m sglang.bench_serving \
  --backend sglang \
  --host localhost \
  --port 30000 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --num-prompts 200 \
  --request-rate 10
 
# For comparison with vLLM — change only backend and port
python -m sglang.bench_serving \
  --backend vllm \
  --host localhost \
  --port 8000 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --num-prompts 200 \
  --request-rate 10

Pros and Cons Analysis

Advantages

Item Details
RAG/multi-turn throughput Up to 6.4x improvement over vLLM (when prefix reuse rate is high)
Small model general throughput ~29% advantage on H100
DeepSeek/MoE inference Built-in MLA optimization kernel, 3.1x faster than vLLM
Structured output JSON decoding throughput ~3x vs. standard approach
TTFT (multi-turn p95) vLLM 103ms → SGLang 79ms (23% improvement)
Drop-in replacement No client code changes needed with OpenAI-compatible API
Kubernetes support Official Helm Chart provided
Quantization FP4/FP8/INT4/AWQ/GPTQ support
Ecosystem Official PyTorch integration in March 2025

MLA (Multi-head Latent Attention): An attention variant introduced by DeepSeek that compresses the KV cache into low-dimensional latent vectors to reduce memory. SGLang has built-in kernel optimizations specialized for this architecture, outperforming vLLM in both memory efficiency and inference speed when combined with MoE architectures.

Disadvantages and Caveats

Item Details Mitigation
Warmup required Initial requests spend time loading RadixAttention cache Send warmup requests after server startup (see Example 4)
DP instability High performance variability when using --dp flag Recommend using --tp (Tensor Parallelism) instead
Mixed-bit quantization Mixed-bit quantization causes some conflicts Use a single quantization scheme like FP8/AWQ
MoE gate layer mlp.gate quantization kernel unsupported in some cases Exclude that layer from quantization or check latest version
Large model gap Performance gap narrows to 3–5% for 70B+ models Judge based on workload characteristics for large models
Model support range Non-mainstream architectures may be unsupported Check official model list beforehand
Hardware compatibility vLLM has the edge for AMD ROCm stability Recommend keeping vLLM for AMD environments
Single-turn independent requests No prefix reuse, almost no performance difference Understand workload characteristics before deciding

Data Parallelism (DP): A method that places identical model copies on multiple GPUs and distributes requests among them. In SGLang, using --dp currently shows high performance variability and is not recommended. In multi-GPU environments, it's better to start by tuning --tp first.

The Most Common Mistakes in Practice

  1. Judging by first-response latency without warmup — SGLang needs time to build up the RadixAttention cache during the first few requests. Concluding "no difference from vLLM" after just one request is a misjudgment. Run the warmup code from Example 4 first and look at numbers after sufficient traffic has passed through.

  2. Expecting 6x from single-turn independent request workloads — If every request is entirely new content, RadixAttention's prefix reuse won't occur. In this case, the performance of the two frameworks is nearly identical. It's worth first understanding how much prefix sharing occurs in your workload.

  3. Using the --dp flag instead of --tp — When utilizing multiple GPUs, Data Parallelism may feel more familiar, but --tp is currently much more stable in SGLang. For multi-GPU setups, it's recommended to tune --tp values before reaching for --dp.


Closing Thoughts

Migrating to SGLang is well worth trying if even one of four keywords applies to you: RAG, multi-turn, structured output, or DeepSeek-family models. Migration costs are low thanks to the OpenAI-compatible API, and if it doesn't fit, switching back to vLLM is just a matter of changing one endpoint. The fact that you can test it without much risk is its greatest advantage.

Three steps you can start right now:

  1. Install SGLang and run a local server — After pip install sglang[all], you can start the server with python -m sglang.launch_server --model-path <your current model> --tp <number of GPUs> --host 0.0.0.0 --port 30000. In environments without Docker, set your HuggingFace token with export HF_TOKEN=<your-token> before starting. Gotcha: In multi-GPU Docker environments, running without --shm-size 32g will produce shared memory errors.

  2. Change only base_url in your existing client to the SGLang port — Switch to http://localhost:30000/v1 and verify that existing tests still pass as-is. You can confirm whether it works while leaving all other code untouched. Gotcha: It's important not to judge performance based on just the first request without warmup.

  3. Compare throughput and TTFT with the same workload — Use python -m sglang.bench_serving to simulate your actual traffic patterns (whether RAG context is included, whether it's multi-turn). If the numbers improve meaningfully, moving on to Docker deployment is recommended. Gotcha: 70B+ large models see the gap narrow to 3–5%, so don't apply small model benchmarks to large model expectations.


References

  • SGLang Official GitHub Repository
  • SGLang Official Documentation
  • vLLM vs SGLang 2026: H100 Benchmarks | TECHSY
  • vLLM vs TensorRT-LLM vs SGLang: H100 Benchmarks 2026 | Spheron Blog
  • When to Choose SGLang Over vLLM: Multi-Turn Conversations and KV Cache Reuse | RunPod
  • SGLang in Production: Structured Generation, RadixAttention, and Multi-Step Pipelines | RunPod
  • SGLang: The Complete Guide to High-Performance LLM Inference | Inference.net
  • OpenAI Compatible API | SGLang Official Documentation
  • Mini-SGLang: Efficient Inference Engine in a Nutshell | LMSYS Blog
  • SGLang Quantization Official Documentation
  • SGLang Speculative Decoding Official Documentation
#SGLang#vLLM#LLM추론#RAG#KV캐시#RadixAttention#PagedAttention#구조화출력#TensorParallelism#멀티턴
Share

Table of Contents

Core ConceptsWhy the KV Cache Becomes a BottleneckvLLM — The Industry Standard Built on PagedAttentionSGLang — Pushing Prefix Reuse to the ExtremeComparing the Two FrameworksPractical ApplicationExample 1: Drop-in Replacement — Changing Only the EndpointExample 2: Serving SGLang with DockerExample 3: Maximizing RadixAttention Effect in a RAG PipelineExample 4: Server Warmup and Benchmark MeasurementPros and Cons AnalysisAdvantagesDisadvantages and CaveatsThe Most Common Mistakes in PracticeClosing ThoughtsReferences

Recommended Posts

SGLang RadixAttention KV Cache Hit Rate: How We Raised Hit Rate from 3% to 78% with Prometheus Monitoring and Operational Tuning — Advanced Guide
AI

SGLang RadixAttention KV Cache Hit Rate: How We Raised Hit Rate from 3% to 78% with Prometheus Monitoring and Operational Tuning — Advanced Guide

When running LLM serving infrastructure, GPU costs can quickly spiral out of control. Back when I was operating a multi-turn chatbot service, I eventually reali...

May 27, 202621 min read
SGLang Architecture That Extracts 6× Throughput from the Same GPUs — PD Disaggregation and HiCache
AI

SGLang Architecture That Extracts 6× Throughput from the Same GPUs — PD Disaggregation and HiCache

When I first took on LLM serving, the most baffling question was "We have enough GPUs, so why is this so slow?" The monitoring dashboard showed GPU memory nearl...

May 27, 202627 min read
SGLang EPD Disaggregation and Pipeline Parallelism — An Architecture That Splits Vision-Language Model Serving into 3 Stages to Reduce TTFT by Up to 8x
AI

SGLang EPD Disaggregation and Pipeline Parallelism — An Architecture That Splits Vision-Language Model Serving into 3 Stages to Reduce TTFT by Up to 8x

Even if you've never directly served multimodal AI before, that's fine. These days, AI features that accept image input are becoming so widespread so quickly th...

May 27, 202623 min read
Why KV Cache Hit Rate Drops to 0% When Scaling Out vLLM Pods, and How llm-d Solves It (Prefix-Aware Routing / Distributed KV Cache)
AI

Why KV Cache Hit Rate Drops to 0% When Scaling Out vLLM Pods, and How llm-d Solves It (Prefix-Aware Routing / Distributed KV Cache)

When operating an LLM service, you will eventually encounter this situation. When you had Automatic Prefix Caching (APC) enabled in vLLM and ran a multi-turn ch...

May 26, 202626 min read
vLLM APC vs SGLang RadixAttention: KV Cache Architecture Differences and Workload-Based Selection Criteria
AI

vLLM APC vs SGLang RadixAttention: KV Cache Architecture Differences and Workload-Based Selection Criteria

When running an LLM inference server, you'll eventually hit the question: "Why is prefill so slow?" In workloads where the beginning of the prompt repeats — lik...

May 26, 202625 min read
SGLang RadixAttention: How to Boost RAG Pipeline Throughput 5x by Reusing KV Cache for Identical Document Blocks
AI

SGLang RadixAttention: How to Boost RAG Pipeline Throughput 5x by Reusing KV Cache for Identical Document Blocks

If you've ever deployed a RAG pipeline to production, you've probably experienced this: you send dozens of requests with only the question changing while the sa...

May 26, 202622 min read