Migrating LLM Inference from vLLM to SGLang: Why RAG and Multi-Turn Workloads See a 6x Throughput Difference

Honestly, my first reaction when I came across SGLang was "another new framework?" vLLM was working well enough, and touching a serving stack that's already running stably is always a burden. But while digging into performance bottlenecks in a pipeline processing tens of thousands of RAG requests per day, I looked at SGLang's RadixAttention benchmark numbers and thought, "If this is real, I need to tear apart our current stack."

One thing worth clarifying upfront: the "6x" figure in the title is achieved in RAG and multi-turn architectures where thousands of requests share the same document context. For single-turn workloads where every request is entirely new content, the two frameworks are nearly identical. Understanding which side your service is closer to comes first — this article was written to help you make that judgment. If you have experience with vLLM or are already running LLM serving in production, it's structured so you can follow along right through the code examples.

The key point is this — thanks to the OpenAI-compatible API, you only need to change one server endpoint and there's no need to touch client code. The barrier to adoption is low, so if you're currently serving RAG or multi-turn conversations, it's well worth running an experiment.

Core Concepts

Why the KV Cache Becomes a Bottleneck

When an LLM generates tokens, it accumulates the attention computation results for previously processed tokens (Key-Value cache, or KV cache) in GPU memory. This is to avoid recomputing the same content, but the problem is that as this cache grows heavy, GPU memory fills up quickly. In a RAG pipeline, each request carries thousands of tokens of document context, so how you manage this directly determines serving performance.

vLLM — The Industry Standard Built on PagedAttention

vLLM is a project that started at UC Berkeley, and it manages the GPU KV cache in page-sized units like an operating system's virtual memory, using a technique called PagedAttention. This reduces memory fragmentation and improves batching efficiency. It quickly became the industry standard after its 2023 debut and currently has the broadest model support and community ecosystem.

PagedAttention: A technique that divides and manages the KV cache in GPU memory into fixed-size pages. It allows more requests to be processed concurrently without memory waste.

SGLang — Pushing Prefix Reuse to the Extreme

SGLang (Structured Generation Language) is a framework built by the LMSYS group, and two core technologies create its differentiating edge.

RadixAttention: Manages the KV cache using a Radix Tree data structure. When multiple requests share the same prefix (system prompt, RAG context, conversation history), it automatically reuses the cache. In RAG pipelines, thousands of requests often prepend the same document context, so recomputing it every time is an obvious waste. vLLM also does caching, but thanks to the Radix Tree structure, SGLang identifies and reuses these shared prefixes with far greater precision.

Radix Tree: A tree data structure that shares common prefixes among strings. SGLang applies this to sharing KV cache for common prefix token sequences.

XGrammar-based Structured Output: When handling JSON/Regex/EBNF constraint decoding, XGrammar delivers 80x faster speed compared to standard methods at the grammar compilation stage. This results in roughly 3x faster actual JSON token decoding throughput compared to the standard approach. These measure different stages — compilation and decoding respectively — so it's worth not conflating them. The difference is significant in agent and tool-call pipelines where JSON responses must be enforced.

Comparing the Two Frameworks

The first thing that caught my eye when I first saw this table was the TTFT row. TTFT (Time To First Token) is the time from when a request is sent until the first token comes back in the response — 23% looks small in absolute terms, but it felt quite different at p95 latency.

Item	vLLM	SGLang
KV cache management	PagedAttention	RadixAttention (prefix reuse)
Structured output	Basic support	XGrammar (~3x JSON throughput)
Model support range	Broader	Focused on major models
Hardware compatibility	Broad, including AMD ROCm	NVIDIA-centric, expanding
OpenAI API compatibility	Supported	Supported
First request latency	Consistent	Stable after warmup
TTFT (multi-turn p95)	103ms	79ms (23% improvement)
Official PyTorch integration	—	March 2025

Official PyTorch Integration (March 2025): SGLang was officially included in the PyTorch ecosystem. This signifies recognition as a member of the PyTorch ecosystem rather than just a third-party project, allowing expectations of long-term maintenance and continuity.

Practical Application

Here's the example flow upfront: drop-in replacement to verify behavior → Docker for production setup → optimize for RAG workloads → confirm numbers with benchmarks. Feel free to jump directly to whichever example fits your situation.

Example 1: Drop-in Replacement — Changing Only the Endpoint

The first thing I wanted to verify was "can you really avoid changing the code?" Trying it in practice, you only need to change the server startup command. If you're serving a large MoE-architecture model like DeepSeek V3, simply swapping the model with --model-path deepseek-ai/DeepSeek-V3 --tp 8 will automatically apply SGLang's MLA optimization kernels.

bash

# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --tensor-parallel-size 2
 
# Start SGLang server (same model, same API format)
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --tp 2 \
  --host 0.0.0.0 \
  --port 30000

In the client code, only the base_url needs to change.

python

from openai import OpenAI
 
# Code that used vLLM — only this line needs to change
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
)
 
# When switching to SGLang
client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY",
)
 
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello"}],
)

Changed Item	vLLM	SGLang
Server launch module	`vllm.entrypoints.openai.api_server`	`sglang.launch_server`
TP argument name	`--tensor-parallel-size`	`--tp`
Default port	8000	30000 (configurable)
Client changes	—	Modify `base_url` only

Tensor Parallelism (TP): A method that distributes the model's weight matrices across multiple GPUs for parallel processing. --tp 4 splits the model across 4 GPUs. In SGLang, --tp is the most stable option for multi-GPU operation.

Example 2: Serving SGLang with Docker

In production environments, Docker is much more convenient. There's an official image, so you can bring it up without any environment setup.

bash

docker run --gpus all \
  --shm-size 32g \
  -p 30000:30000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=<your-token>" \
  --ipc=host \
  lmsysorg/sglang:latest-runtime \
  python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 30000

--shm-size 32g prevents errors caused by insufficient shared memory during multi-GPU tensor parallel processing. I ran it without this option the first time and spent a long time troubleshooting. In environments without Docker, set the environment variable with export HF_TOKEN=<your-token> before starting the server.

Example 3: Maximizing RadixAttention Effect in a RAG Pipeline

In RAG workloads, thousands of requests share the same document context. The thousands of tokens of documents that go into SHARED_CONTEXT in the code below are the primary reuse target for RadixAttention. The more concurrent requests there are, and the longer the shared prefix, the wider the performance gap with vLLM grows.

python

import asyncio
from openai import AsyncOpenAI
 
client = AsyncOpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY",
)
 
SHARED_CONTEXT = """
You are a document analysis assistant.
[Document content — thousands of tokens of RAG context]
"""
 
async def query_with_rag(user_question: str) -> str:
    response = await client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[
            {"role": "system", "content": SHARED_CONTEXT},  # target for prefix reuse
            {"role": "user", "content": user_question},
        ],
    )
    return response.choices[0].message.content
 
async def main():
    # Questions asked from various angles about the same document — this pattern maximizes RadixAttention effectiveness
    questions = [
        "What is the central argument of this document?",
        "Please summarize the evidence the author presents.",
        "Summarize the content of the third paragraph in one sentence.",
        "What limitations are mentioned in this document?",
        "Please explain the conclusion in plain language.",
    ]
    results = await asyncio.gather(*[query_with_rag(q) for q in questions])
    return results

Example 4: Server Warmup and Benchmark Measurement

SGLang needs time to build up the RadixAttention cache during the first few requests. Measuring performance immediately after server startup can produce lower-than-actual numbers, so it's good practice to send a few dummy requests beforehand.

python

import asyncio
from openai import AsyncOpenAI
 
async def warmup_server(base_url: str, model: str, n: int = 5) -> None:
    client = AsyncOpenAI(base_url=base_url, api_key="EMPTY")
    tasks = [
        client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": "ping"}],
            max_tokens=1,
        )
        for _ in range(n)
    ]
    await asyncio.gather(*tasks)
    print(f"Warmup complete ({n} requests)")
 
# Run after server startup, before actual traffic
asyncio.run(
    warmup_server("http://localhost:30000/v1", "meta-llama/Llama-3.1-8B-Instruct")
)

Benchmarks can be measured using SGLang's built-in bench_serving script.

bash

# SGLang built-in benchmark
python -m sglang.bench_serving \
  --backend sglang \
  --host localhost \
  --port 30000 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --num-prompts 200 \
  --request-rate 10
 
# For comparison with vLLM — change only backend and port
python -m sglang.bench_serving \
  --backend vllm \
  --host localhost \
  --port 8000 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --num-prompts 200 \
  --request-rate 10

Pros and Cons Analysis

Advantages

Item	Details
RAG/multi-turn throughput	Up to 6.4x improvement over vLLM (when prefix reuse rate is high)
Small model general throughput	~29% advantage on H100
DeepSeek/MoE inference	Built-in MLA optimization kernel, 3.1x faster than vLLM
Structured output	JSON decoding throughput ~3x vs. standard approach
TTFT (multi-turn p95)	vLLM 103ms → SGLang 79ms (23% improvement)
Drop-in replacement	No client code changes needed with OpenAI-compatible API
Kubernetes support	Official Helm Chart provided
Quantization	FP4/FP8/INT4/AWQ/GPTQ support
Ecosystem	Official PyTorch integration in March 2025

MLA (Multi-head Latent Attention): An attention variant introduced by DeepSeek that compresses the KV cache into low-dimensional latent vectors to reduce memory. SGLang has built-in kernel optimizations specialized for this architecture, outperforming vLLM in both memory efficiency and inference speed when combined with MoE architectures.

Disadvantages and Caveats

Item	Details	Mitigation
Warmup required	Initial requests spend time loading RadixAttention cache	Send warmup requests after server startup (see Example 4)
DP instability	High performance variability when using `--dp` flag	Recommend using `--tp` (Tensor Parallelism) instead
Mixed-bit quantization	Mixed-bit quantization causes some conflicts	Use a single quantization scheme like FP8/AWQ
MoE gate layer	`mlp.gate` quantization kernel unsupported in some cases	Exclude that layer from quantization or check latest version
Large model gap	Performance gap narrows to 3–5% for 70B+ models	Judge based on workload characteristics for large models
Model support range	Non-mainstream architectures may be unsupported	Check official model list beforehand
Hardware compatibility	vLLM has the edge for AMD ROCm stability	Recommend keeping vLLM for AMD environments
Single-turn independent requests	No prefix reuse, almost no performance difference	Understand workload characteristics before deciding

Data Parallelism (DP): A method that places identical model copies on multiple GPUs and distributes requests among them. In SGLang, using --dp currently shows high performance variability and is not recommended. In multi-GPU environments, it's better to start by tuning --tp first.

The Most Common Mistakes in Practice

Judging by first-response latency without warmup — SGLang needs time to build up the RadixAttention cache during the first few requests. Concluding "no difference from vLLM" after just one request is a misjudgment. Run the warmup code from Example 4 first and look at numbers after sufficient traffic has passed through.
Expecting 6x from single-turn independent request workloads — If every request is entirely new content, RadixAttention's prefix reuse won't occur. In this case, the performance of the two frameworks is nearly identical. It's worth first understanding how much prefix sharing occurs in your workload.
Using the --dp flag instead of --tp — When utilizing multiple GPUs, Data Parallelism may feel more familiar, but --tp is currently much more stable in SGLang. For multi-GPU setups, it's recommended to tune --tp values before reaching for --dp.

Closing Thoughts

Migrating to SGLang is well worth trying if even one of four keywords applies to you: RAG, multi-turn, structured output, or DeepSeek-family models. Migration costs are low thanks to the OpenAI-compatible API, and if it doesn't fit, switching back to vLLM is just a matter of changing one endpoint. The fact that you can test it without much risk is its greatest advantage.

Three steps you can start right now:

Install SGLang and run a local server — After pip install sglang[all], you can start the server with python -m sglang.launch_server --model-path <your current model> --tp <number of GPUs> --host 0.0.0.0 --port 30000. In environments without Docker, set your HuggingFace token with export HF_TOKEN=<your-token> before starting. Gotcha: In multi-GPU Docker environments, running without --shm-size 32g will produce shared memory errors.
Change only base_url in your existing client to the SGLang port — Switch to http://localhost:30000/v1 and verify that existing tests still pass as-is. You can confirm whether it works while leaving all other code untouched. Gotcha: It's important not to judge performance based on just the first request without warmup.
Compare throughput and TTFT with the same workload — Use python -m sglang.bench_serving to simulate your actual traffic patterns (whether RAG context is included, whether it's multi-turn). If the numbers improve meaningfully, moving on to Docker deployment is recommended. Gotcha: 70B+ large models see the gap narrow to 3–5%, so don't apply small model benchmarks to large model expectations.

References

#SGLang#vLLM#LLM추론#RAG#KV캐시#RadixAttention#PagedAttention#구조화출력#TensorParallelism#멀티턴

Migrating LLM Inference from vLLM to SGLang: Why RAG and Multi-Turn Workloads See a 6x Throughput Difference | DEV BAK - 기술블로그

Migrating LLM Inference from vLLM to SGLang: Why RAG and Multi-Turn Workloads See a 6x Throughput Difference

Core Concepts

Why the KV Cache Becomes a Bottleneck

vLLM — The Industry Standard Built on PagedAttention

PagedAttention: A technique that divides and manages the KV cache in GPU memory into fixed-size pages. It allows more requests to be processed concurrently without memory waste.

SGLang — Pushing Prefix Reuse to the Extreme

SGLang (Structured Generation Language) is a framework built by the LMSYS group, and two core technologies create its differentiating edge.

Radix Tree: A tree data structure that shares common prefixes among strings. SGLang applies this to sharing KV cache for common prefix token sequences.

Comparing the Two Frameworks

Item	vLLM	SGLang
KV cache management	PagedAttention	RadixAttention (prefix reuse)
Structured output	Basic support	XGrammar (~3x JSON throughput)
Model support range	Broader	Focused on major models
Hardware compatibility	Broad, including AMD ROCm	NVIDIA-centric, expanding
OpenAI API compatibility	Supported	Supported
First request latency	Consistent	Stable after warmup
TTFT (multi-turn p95)	103ms	79ms (23% improvement)
Official PyTorch integration	—	March 2025

Official PyTorch Integration (March 2025): SGLang was officially included in the PyTorch ecosystem. This signifies recognition as a member of the PyTorch ecosystem rather than just a third-party project, allowing expectations of long-term maintenance and continuity.

Practical Application

Example 1: Drop-in Replacement — Changing Only the Endpoint

bash

# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --tensor-parallel-size 2
 
# Start SGLang server (same model, same API format)
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --tp 2 \
  --host 0.0.0.0 \
  --port 30000

In the client code, only the base_url needs to change.

python

from openai import OpenAI
 
# Code that used vLLM — only this line needs to change
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
)
 
# When switching to SGLang
client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY",
)
 
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello"}],
)

Changed Item	vLLM	SGLang
Server launch module	`vllm.entrypoints.openai.api_server`	`sglang.launch_server`
TP argument name	`--tensor-parallel-size`	`--tp`
Default port	8000	30000 (configurable)
Client changes	—	Modify `base_url` only

Tensor Parallelism (TP): A method that distributes the model's weight matrices across multiple GPUs for parallel processing. --tp 4 splits the model across 4 GPUs. In SGLang, --tp is the most stable option for multi-GPU operation.

Example 2: Serving SGLang with Docker

In production environments, Docker is much more convenient. There's an official image, so you can bring it up without any environment setup.

bash

docker run --gpus all \
  --shm-size 32g \
  -p 30000:30000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=<your-token>" \
  --ipc=host \
  lmsysorg/sglang:latest-runtime \
  python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 30000

Example 3: Maximizing RadixAttention Effect in a RAG Pipeline

python

import asyncio
from openai import AsyncOpenAI
 
client = AsyncOpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY",
)
 
SHARED_CONTEXT = """
You are a document analysis assistant.
[Document content — thousands of tokens of RAG context]
"""
 
async def query_with_rag(user_question: str) -> str:
    response = await client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[
            {"role": "system", "content": SHARED_CONTEXT},  # target for prefix reuse
            {"role": "user", "content": user_question},
        ],
    )
    return response.choices[0].message.content
 
async def main():
    # Questions asked from various angles about the same document — this pattern maximizes RadixAttention effectiveness
    questions = [
        "What is the central argument of this document?",
        "Please summarize the evidence the author presents.",
        "Summarize the content of the third paragraph in one sentence.",
        "What limitations are mentioned in this document?",
        "Please explain the conclusion in plain language.",
    ]
    results = await asyncio.gather(*[query_with_rag(q) for q in questions])
    return results

Example 4: Server Warmup and Benchmark Measurement

python

import asyncio
from openai import AsyncOpenAI
 
async def warmup_server(base_url: str, model: str, n: int = 5) -> None:
    client = AsyncOpenAI(base_url=base_url, api_key="EMPTY")
    tasks = [
        client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": "ping"}],
            max_tokens=1,
        )
        for _ in range(n)
    ]
    await asyncio.gather(*tasks)
    print(f"Warmup complete ({n} requests)")
 
# Run after server startup, before actual traffic
asyncio.run(
    warmup_server("http://localhost:30000/v1", "meta-llama/Llama-3.1-8B-Instruct")
)

Benchmarks can be measured using SGLang's built-in bench_serving script.

bash

# SGLang built-in benchmark
python -m sglang.bench_serving \
  --backend sglang \
  --host localhost \
  --port 30000 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --num-prompts 200 \
  --request-rate 10
 
# For comparison with vLLM — change only backend and port
python -m sglang.bench_serving \
  --backend vllm \
  --host localhost \
  --port 8000 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --num-prompts 200 \
  --request-rate 10

Pros and Cons Analysis

Advantages

Item	Details
RAG/multi-turn throughput	Up to 6.4x improvement over vLLM (when prefix reuse rate is high)
Small model general throughput	~29% advantage on H100
DeepSeek/MoE inference	Built-in MLA optimization kernel, 3.1x faster than vLLM
Structured output	JSON decoding throughput ~3x vs. standard approach
TTFT (multi-turn p95)	vLLM 103ms → SGLang 79ms (23% improvement)
Drop-in replacement	No client code changes needed with OpenAI-compatible API
Kubernetes support	Official Helm Chart provided
Quantization	FP4/FP8/INT4/AWQ/GPTQ support
Ecosystem	Official PyTorch integration in March 2025

MLA (Multi-head Latent Attention): An attention variant introduced by DeepSeek that compresses the KV cache into low-dimensional latent vectors to reduce memory. SGLang has built-in kernel optimizations specialized for this architecture, outperforming vLLM in both memory efficiency and inference speed when combined with MoE architectures.

Disadvantages and Caveats

Item	Details	Mitigation
Warmup required	Initial requests spend time loading RadixAttention cache	Send warmup requests after server startup (see Example 4)
DP instability	High performance variability when using `--dp` flag	Recommend using `--tp` (Tensor Parallelism) instead
Mixed-bit quantization	Mixed-bit quantization causes some conflicts	Use a single quantization scheme like FP8/AWQ
MoE gate layer	`mlp.gate` quantization kernel unsupported in some cases	Exclude that layer from quantization or check latest version
Large model gap	Performance gap narrows to 3–5% for 70B+ models	Judge based on workload characteristics for large models
Model support range	Non-mainstream architectures may be unsupported	Check official model list beforehand
Hardware compatibility	vLLM has the edge for AMD ROCm stability	Recommend keeping vLLM for AMD environments
Single-turn independent requests	No prefix reuse, almost no performance difference	Understand workload characteristics before deciding

Data Parallelism (DP): A method that places identical model copies on multiple GPUs and distributes requests among them. In SGLang, using --dp currently shows high performance variability and is not recommended. In multi-GPU environments, it's better to start by tuning --tp first.

The Most Common Mistakes in Practice

Judging by first-response latency without warmup — SGLang needs time to build up the RadixAttention cache during the first few requests. Concluding "no difference from vLLM" after just one request is a misjudgment. Run the warmup code from Example 4 first and look at numbers after sufficient traffic has passed through.
Expecting 6x from single-turn independent request workloads — If every request is entirely new content, RadixAttention's prefix reuse won't occur. In this case, the performance of the two frameworks is nearly identical. It's worth first understanding how much prefix sharing occurs in your workload.
Using the --dp flag instead of --tp — When utilizing multiple GPUs, Data Parallelism may feel more familiar, but --tp is currently much more stable in SGLang. For multi-GPU setups, it's recommended to tune --tp values before reaching for --dp.

Closing Thoughts

Three steps you can start right now:

Install SGLang and run a local server — After pip install sglang[all], you can start the server with python -m sglang.launch_server --model-path <your current model> --tp <number of GPUs> --host 0.0.0.0 --port 30000. In environments without Docker, set your HuggingFace token with export HF_TOKEN=<your-token> before starting. Gotcha: In multi-GPU Docker environments, running without --shm-size 32g will produce shared memory errors.
Change only base_url in your existing client to the SGLang port — Switch to http://localhost:30000/v1 and verify that existing tests still pass as-is. You can confirm whether it works while leaving all other code untouched. Gotcha: It's important not to judge performance based on just the first request without warmup.
Compare throughput and TTFT with the same workload — Use python -m sglang.bench_serving to simulate your actual traffic patterns (whether RAG context is included, whether it's multi-turn). If the numbers improve meaningfully, moving on to Docker deployment is recommended. Gotcha: 70B+ large models see the gap narrow to 3–5%, so don't apply small model benchmarks to large model expectations.

References

#SGLang#vLLM#LLM추론#RAG#KV캐시#RadixAttention#PagedAttention#구조화출력#TensorParallelism#멀티턴

Core Concepts

Why the KV Cache Becomes a Bottleneck

vLLM — The Industry Standard Built on PagedAttention

SGLang — Pushing Prefix Reuse to the Extreme

Comparing the Two Frameworks

Practical Application

Example 1: Drop-in Replacement — Changing Only the Endpoint

Example 2: Serving SGLang with Docker

Example 3: Maximizing RadixAttention Effect in a RAG Pipeline

Example 4: Server Warmup and Benchmark Measurement

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

Why the KV Cache Becomes a Bottleneck

vLLM — The Industry Standard Built on PagedAttention

SGLang — Pushing Prefix Reuse to the Extreme

Comparing the Two Frameworks

Practical Application

Example 1: Drop-in Replacement — Changing Only the Endpoint

Example 2: Serving SGLang with Docker

Example 3: Maximizing RadixAttention Effect in a RAG Pipeline

Example 4: Server Warmup and Benchmark Measurement

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

SGLang RadixAttention KV Cache Hit Rate: How We Raised Hit Rate from 3% to 78% with Prometheus Monitoring and Operational Tuning — Advanced Guide

SGLang Architecture That Extracts 6× Throughput from the Same GPUs — PD Disaggregation and HiCache

SGLang EPD Disaggregation and Pipeline Parallelism — An Architecture That Splits Vision-Language Model Serving into 3 Stages to Reduce TTFT by Up to 8x

Why KV Cache Hit Rate Drops to 0% When Scaling Out vLLM Pods, and How llm-d Solves It (Prefix-Aware Routing / Distributed KV Cache)

vLLM APC vs SGLang RadixAttention: KV Cache Architecture Differences and Workload-Based Selection Criteria

SGLang RadixAttention: How to Boost RAG Pipeline Throughput 5x by Reusing KV Cache for Identical Document Blocks