vLLM vs SGLang Performance Comparison — Choosing an Inference Engine Through the Lens of 2026 KV Cache Architecture

Have you ever served an LLM yourself? At first it seemed like model selection was everything, but once you deploy to production, you quickly realize that the choice of inference engine is what truly determines performance and cost. Early on, I brushed it off with "just use vLLM, right?"—until a multi-turn RAG pipeline produced a monthly bill 2.3× higher than expected. That's when I seriously compared the two engines. It turned out the problem wasn't the model; the system prompt and document context were being recomputed from scratch on every single request.

Both vLLM and SGLang are open-source LLM inference engines that originated at UC Berkeley. They may look similar at a glance, but their internal KV cache management approaches are fundamentally different—leading to performance gaps of up to 6× depending on workload characteristics. This article will help you understand the core differences between the two engines and give you clear criteria to confidently choose the right one for your situation.

Let's walk through, one by one, why the two engines produce such different results on different workloads.

Core Concepts

PagedAttention vs RadixAttention — Two Ways of Viewing the KV Cache

The most fundamental difference between the two engines is how they manage the KV (Key-Value) cache. Honestly, if you understand just this part, the rest of the selection criteria follows naturally.

KV Cache (Key-Value Cache): A memory space that stores the Key and Value matrices computed when a Transformer model processes previous tokens. Reusing these avoids recomputing already-processed portions, significantly boosting throughput.

vLLM's PagedAttention was inspired by virtual memory paging in operating systems. It divides GPU memory into fixed-size pages and allocates them non-contiguously. This greatly reduces memory fragmentation and lets requests of varying lengths share memory efficiently. However, each request receives independent pages, so common content between requests is still recomputed from scratch.

[vLLM - PagedAttention concept]
 
Request A: [Page1][Page2][Page3]
Request B: [Page4][Page5]
Request C: [Page6][Page7][Page8][Page9]
 
→ Each request is allocated independent memory pages
→ No shared prefix, every request computed from scratch

SGLang's RadixAttention, by contrast, manages the KV cache using a Radix Tree structure.

Radix Tree: A tree structure that shares common prefixes. It's analogous to storing "car", "card", and "care" in an alphabetical dictionary by keeping "car" once and branching only on "d" and "e". Applied to token sequences, a common leading token group is computed once and shared across multiple requests.

The key insight is that it automatically detects and reuses common prefixes (leading token sequences) across requests. If there's a repeating portion like a system prompt or document context, it is computed only once and all subsequent requests pull from that cached result.

sql

[SGLang - RadixAttention tree structure]
 
System prompt: "You are a helpful assistant. [Document A]"
                          │  (computed once, stored in cache)
            ┌─────────────┼─────────────┐
         Request A     Request B     Request C
    "Main argument?"  "Counter?"  "Summarize?"
         │                │               │
    (extra tokens)  (extra tokens)  (extra tokens)
      newly computed  newly computed  newly computed
 
→ Common portion (system prompt + document) computed only once
→ Only each request's unique portion is newly processed

The Difference in Numbers

Benchmarks based on an H100 GPU with the Llama-3.1-8B model as of 2026. Note that the gap narrows to 3–5% for larger 70B+ models, so interpretation varies by model size.

Metric	vLLM	SGLang	Notes
Throughput (tok/s, H100 / 8B)	~12,600	~16,200	SGLang ~29% ahead
Prefix-sharing workloads	baseline	up to 6.4×	RadixAttention effect
TTFT p95	baseline	5–8% lower	Time to first token
High concurrency (100+ requests)	~4,741 tok/s	Difficult to compare directly due to environment differences	vLLM C++ routing advantage
Model support breadth	Very wide	Relatively narrow	vLLM advantage
Hardware support	GPU, TPU, Trainium, Gaudi	GPU-centric	vLLM advantage

The figure 16,200 tok/s on H100 may not be intuitive, but in practical terms it means that even with 100 simultaneous users, each gets 162 tok/s. If an average sentence is 20 tokens, that's over 8 sentences generated per second per user.

TTFT (Time To First Token): The time from when a user sends a request to when the first token is returned. The key metric determining perceived responsiveness in streaming responses.

The reason vLLM remains stable under high concurrency is that its internal routing is implemented in C++. Python has a structural limitation where multiple threads cannot execute simultaneously due to the GIL (Global Interpreter Lock), whereas C++ has no such constraint—so even with 100+ concurrent requests, queue management itself doesn't become a bottleneck.

Practical Applications

Example 1: Using SGLang in a Multi-Turn RAG Pipeline

This is exactly the scenario that made me switch engines. In RAG pipelines, retrieved document context is often repeated across multiple user requests—the pattern of "answer questions based on this document." With vLLM, if 100 questions are asked about the same document, all 100 computations start from scratch.

python

# Start SGLang server (settings optimized for multi-turn RAG)
# python -m sglang.launch_server \
#   --model-path meta-llama/Llama-3.1-8B-Instruct \
#   --host 0.0.0.0 \
#   --port 30000 \
#   --mem-fraction-static 0.9 \
#   --max-prefill-tokens 16384
 
import openai
 
client = openai.OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY",
)
 
SHARED_CONTEXT = (
    "You are a document analysis expert.\n\n"
    "[Reference Document]\n"
    "{document_content}"
)
 
def ask_question(document: str, question: str, history: list) -> str:
    messages = [
        {
            "role": "system",
            "content": SHARED_CONTEXT.format(document_content=document),
        },
        *history,
        {"role": "user", "content": question},
    ]
 
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=messages,
        max_tokens=512,
    )
    return response.choices[0].message.content
 
# Actual document content (paste a long report, paper, etc. here)
sample_doc = (
    "The global AI market size in 2024 is estimated at approximately $200 billion. "
    "Key growth drivers are the popularization of generative AI services and cloud infrastructure expansion. "
    "North America holds 40% of the market, with the Asia-Pacific region showing notable growth."
)
 
history = []
questions = ["What is the main argument of this document?", "Are there any counterarguments?", "Summarize the conclusion."]
 
for question in questions:
    answer = ask_question(sample_doc, question, history)
    history.extend([
        {"role": "user", "content": question},
        {"role": "assistant", "content": answer},
    ])
    print(f"Q: {question}\nA: {answer}\n")

Code Point	Description
`SHARED_CONTEXT`	The portion common to all requests → automatically cached by RadixAttention
Accumulating `history`	Cache hit rate increases as the multi-turn context grows longer
Changing `base_url`	Use the OpenAI SDK as-is; just swap the URL
`mem-fraction-static 0.9`	Enlarges the cache pool — the larger it is, the greater the cache reuse benefit

When 100 users repeatedly ask questions about the same document, the system prompt + document context is computed only once, and the rest is retrieved from cache. Cases have been reported where this pattern reduced GPU usage by approximately 30%.

Example 2: Using vLLM for Large-Scale Batch Document Classification

When the prompt structure is fixed but each document's content is completely different—i.e., workloads with no opportunity for cache reuse—vLLM shines.

python

# Start vLLM server (optimized for batch processing)
# python -m vllm.entrypoints.openai.api_server \
#   --model meta-llama/Llama-3.1-70B-Instruct \
#   --tensor-parallel-size 4 \
#   --max-model-len 8192 \
#   --gpu-memory-utilization 0.95
 
import asyncio
import openai
 
client = openai.AsyncOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
)
 
CLASSIFICATION_PROMPT = (
    "Please classify the following text into one of the categories below.\n"
    "Categories: [Technology, Business, Sports, Entertainment, Politics]\n\n"
    "Text:\n{text}\n\n"
    "Answer with just one word for the category."
)
 
async def classify_document(text: str) -> str:
    response = await client.chat.completions.create(
        model="meta-llama/Llama-3.1-70B-Instruct",
        messages=[{
            "role": "user",
            "content": CLASSIFICATION_PROMPT.format(text=text),
        }],
        max_tokens=10,
        temperature=0,
    )
    return response.choices[0].message.content.strip()
 
async def batch_classify(documents: list[str]) -> list[str]:
    tasks = [classify_document(doc) for doc in documents]
    return await asyncio.gather(*tasks)
 
# List of documents each with unique content (load from files in actual use)
documents = [
    "Samsung Electronics announced successful mass production of 3nm process chips.",
    "Son Heung-min scored a hat-trick to lead his team to victory.",
    "The Federal Reserve cut its benchmark interest rate by 0.25 percentage points.",
]
results = asyncio.run(batch_classify(documents))
for doc, label in zip(documents, results):
    print(f"[{label}] {doc[:30]}...")

Code Point	Description
`tensor-parallel-size 4`	Distributes the 70B model across 4 GPUs — vLLM's extensive parallelism support
`asyncio.gather`	Dispatches requests concurrently; vLLM's continuous batching groups them internally
`max_tokens=10`	Short output length → maximizes throughput
Unique context per request	No opportunity for cache reuse → PagedAttention's memory efficiency is key

Example 3: Serving DeepSeek Models — SGLang as the De Facto Standard

If you're serving DeepSeek-R1 or DeepSeek-V3, SGLang is strongly recommended. SGLang has built-in dedicated optimizations for the DeepSeek family that demonstrate meaningful advantages over vLLM.

Three key optimizations to highlight:

MLA (Multi-head Latent Attention): A KV cache compression technique developed by DeepSeek. It projects Keys and Values into a low-dimensional latent space to dramatically reduce cache memory.
MoE (Mixture of Experts): An architecture where only some "expert" networks out of the model's total parameters are selectively activated per token. Maintains large model capacity while keeping actual computation low.
Elastic Expert Parallelism: A strategy that dynamically redistributes MoE expert layers across GPUs. Even if a specific GPU fails, the workload is automatically rebalanced so serving is not interrupted.

bash

# Start DeepSeek-R1 server (SGLang recommended settings)
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-R1 \
  --trust-remote-code \
  --tp 8 \
  --dp 2 \
  --host 0.0.0.0 \
  --port 30000 \
  --reasoning-parser deepseek-r1
  # MLA attention backend auto-enabled
  # MoE fault tolerance via Elastic Expert Parallelism

Once the server is up, you can receive inference results in Python like this. A notable feature of DeepSeek-R1 is that it returns the reasoning process inside <think> tags.

python

import openai
 
client = openai.OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY",
)
 
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1",
    messages=[{
        "role": "user",
        "content": (
            "Find and fix the bug in the following code:\n\n"
            "def factorial(n):\n"
            "    if n == 0:\n"
            "        return 0\n"
            "    return n * factorial(n-1)"
        ),
    }],
    max_tokens=2048,
    temperature=0.6,
)
 
full_response = response.choices[0].message.content
 
# Separate the reasoning process wrapped in <think> tags from the final answer
if "<think>" in full_response:
    think_end = full_response.find("</think>")
    reasoning = full_response[7:think_end]        # after <think>
    answer = full_response[think_end + 8:].strip() # after </think>
    print(f"[Reasoning Process]\n{reasoning}\n")
    print(f"[Final Answer]\n{answer}")
else:
    print(full_response)

Code Point	Description
`--tp 8 --dp 2`	Tensor parallel 8 + data parallel 2, recommended combination for DeepSeek-R1's scale
`--reasoning-parser deepseek-r1`	Enables `<think>` tag parsing, allowing separation of reasoning and answer
`think_end = find("</think>")`	Useful for separately logging or debugging R1's chain of reasoning

Pros and Cons Analysis

Advantages

vLLM

Item	Details
Model support breadth	Widest in the industry. Nearly all open-source models including Meta, Mistral, Cohere, Google
Hardware diversity	Supports NVIDIA GPU, TPU, AWS Trainium, Intel Gaudi
Community maturity	~3× more contributors than SGLang; documentation and ecosystem are stable
OpenAI compatibility	Existing OpenAI SDK apps work as-is with just one URL change
High-concurrency stability	C++ routing maintains stable throughput even with 100+ simultaneous requests
Speculative Decoding	Supports various spec decode methods (EAGLE, MTP, etc.) for 1.3–2× speed gains

SGLang

Item	Details
Prefix-reuse workloads	Up to 6.4× performance and up to 30% GPU memory savings with repeated context
Raw throughput	16,200 tok/s on H100 / 8B, ~29% ahead of vLLM
Structured output	FSM-based constrained decoding guarantees JSON and regex-formatted output
DeepSeek optimization	De facto standard serving engine for the DeepSeek model family
TTFT	5–8% reduction in p95 time to first token
Cloud support	Official support on AWS, GCP, Azure, Oracle Cloud

Constrained Decoding: A technique that forces model output to conform to a specific format (JSON schema, regex, etc.). Uses a finite state machine (FSM) to allow only valid tokens at each token generation step. Eliminates parsing errors at the source in function-calling and agent workflows.

Speculative Decoding: A method where a small "draft model" predicts multiple tokens ahead, and the large model verifies them all at once. Improves processing speed by 1.3–2× while maintaining output quality.

Drawbacks and Caveats

vLLM

Item	Details	Mitigation
Prefix-sharing limitations	Up to 29% lower throughput than SGLang on fixed-context repeat workloads	Consider switching to SGLang for those workloads
Multi-turn cache efficiency	Recomputation occurs on every request with long conversation histories	Consider migrating to SGLang

SGLang

To be honest, looking at this list of drawbacks makes SGLang seem inferior—but in practice, the perceived difference in DeepSeek serving or RAG workloads is quite significant.

Item	Details	Mitigation
Model support breadth	Fewer supported models than vLLM	Verify your model is on the support list first
Hardware limitations	TPU, Trainium, Gaudi support is in early stages	No issue if you're on an NVIDIA GPU environment
Community size	Relatively smaller than vLLM	Major cloud providers officially supporting it since 2025; growing rapidly
Operational maturity	Rapid update cycle leads to frequent API changes	Recommended to pin versions and upgrade incrementally

Which Engine to Choose — Decision Flow

If you've read the whole article and are thinking "so which one should I use?", the flow below may help.

yaml

Does my workload have a repeating common system prompt / document context?
  │
  ├─ Yes → Does the fixed context account for 30%+ of the total prompt?
  │           ├─ Yes → SGLang ✓ (maximizes RadixAttention effect)
  │           └─ No  → Are you using DeepSeek-family models?
  │                       ├─ Yes → SGLang ✓ (built-in dedicated optimizations)
  │                       └─ No  → Both are similar; vLLM is easier to operate
  │
  └─ No  → Are there 100+ concurrent requests?
              ├─ Yes → vLLM ✓ (C++ routing stability)
              └─ No  → Mixed GPU/TPU environment or frequent model changes?
                          ├─ Yes → vLLM ✓ (hardware & model diversity)
                          └─ No  → SGLang if throughput is priority,
                                   vLLM if ecosystem stability is priority

The Most Common Mistakes in Practice

I've made similar mistakes myself, so I'm sharing them here.

Choosing an engine without analyzing your workload — I also initially decided with "vLLM is well-known, so that's fine," and later had to rebuild my deployment pipeline from scratch when switching to multi-turn RAG. The right order is to first understand what patterns of requests will come in, then choose the engine.
Judging benchmarks by general numbers rather than your own workload — SGLang's advantage is clear for 7B–8B models, but the gap shrinks to 3–5% for 70B+ large models. Rather than general benchmarks, I recommend directly measuring numbers that match your model size and request patterns. Reproducing your actual traffic pattern with locust or wrk and directly observing TTFT, throughput, and GPU utilization makes things much clearer.
Overlooking high-concurrency scenarios — I missed this too, and making decisions based solely on raw throughput numbers can lead to surprises in production. In environments with 100+ concurrent requests, vLLM's C++ routing can be advantageous, so you need to factor in your expected concurrent user count.

Closing Thoughts

Your workload determines your engine. Start with vLLM for one-off batch jobs, and SGLang for repeated context or agent workloads.

Three steps you can take right now:

Identify your workload pattern: Check how often a common prefix (system prompt, document context) is repeated across requests. If the fixed context accounts for 30%+ of the total prompt, SGLang is likely the better choice.
Spin up both engines locally: Install each with pip install vllm and pip install sglang[all], then run them in OpenAI-compatible API mode. Both engines work with your existing code with just a base_url change.
Run benchmarks with your actual traffic pattern: Reproduce your real request patterns with locust or wrk and directly measure TTFT, throughput, and GPU utilization. Numbers tailored to your situation are far more valuable than general benchmarks.

References

#vLLM#SGLang#LLM추론엔진#KV캐시#PageAttention#RadixAttention#RAG#DeepSeek#MoE#SpeculativeDecoding

vLLM vs SGLang Performance Comparison — Choosing an Inference Engine Through the Lens of 2026 KV Cache Architecture | DEV BAK - 기술블로그

vLLM vs SGLang Performance Comparison — Choosing an Inference Engine Through the Lens of 2026 KV Cache Architecture

Let's walk through, one by one, why the two engines produce such different results on different workloads.

Core Concepts

PagedAttention vs RadixAttention — Two Ways of Viewing the KV Cache

The most fundamental difference between the two engines is how they manage the KV (Key-Value) cache. Honestly, if you understand just this part, the rest of the selection criteria follows naturally.

KV Cache (Key-Value Cache): A memory space that stores the Key and Value matrices computed when a Transformer model processes previous tokens. Reusing these avoids recomputing already-processed portions, significantly boosting throughput.

[vLLM - PagedAttention concept]
 
Request A: [Page1][Page2][Page3]
Request B: [Page4][Page5]
Request C: [Page6][Page7][Page8][Page9]
 
→ Each request is allocated independent memory pages
→ No shared prefix, every request computed from scratch

SGLang's RadixAttention, by contrast, manages the KV cache using a Radix Tree structure.

Radix Tree: A tree structure that shares common prefixes. It's analogous to storing "car", "card", and "care" in an alphabetical dictionary by keeping "car" once and branching only on "d" and "e". Applied to token sequences, a common leading token group is computed once and shared across multiple requests.

sql

[SGLang - RadixAttention tree structure]
 
System prompt: "You are a helpful assistant. [Document A]"
                          │  (computed once, stored in cache)
            ┌─────────────┼─────────────┐
         Request A     Request B     Request C
    "Main argument?"  "Counter?"  "Summarize?"
         │                │               │
    (extra tokens)  (extra tokens)  (extra tokens)
      newly computed  newly computed  newly computed
 
→ Common portion (system prompt + document) computed only once
→ Only each request's unique portion is newly processed

The Difference in Numbers

Benchmarks based on an H100 GPU with the Llama-3.1-8B model as of 2026. Note that the gap narrows to 3–5% for larger 70B+ models, so interpretation varies by model size.

Metric	vLLM	SGLang	Notes
Throughput (tok/s, H100 / 8B)	~12,600	~16,200	SGLang ~29% ahead
Prefix-sharing workloads	baseline	up to 6.4×	RadixAttention effect
TTFT p95	baseline	5–8% lower	Time to first token
High concurrency (100+ requests)	~4,741 tok/s	Difficult to compare directly due to environment differences	vLLM C++ routing advantage
Model support breadth	Very wide	Relatively narrow	vLLM advantage
Hardware support	GPU, TPU, Trainium, Gaudi	GPU-centric	vLLM advantage

TTFT (Time To First Token): The time from when a user sends a request to when the first token is returned. The key metric determining perceived responsiveness in streaming responses.

Practical Applications

Example 1: Using SGLang in a Multi-Turn RAG Pipeline

python

# Start SGLang server (settings optimized for multi-turn RAG)
# python -m sglang.launch_server \
#   --model-path meta-llama/Llama-3.1-8B-Instruct \
#   --host 0.0.0.0 \
#   --port 30000 \
#   --mem-fraction-static 0.9 \
#   --max-prefill-tokens 16384
 
import openai
 
client = openai.OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY",
)
 
SHARED_CONTEXT = (
    "You are a document analysis expert.\n\n"
    "[Reference Document]\n"
    "{document_content}"
)
 
def ask_question(document: str, question: str, history: list) -> str:
    messages = [
        {
            "role": "system",
            "content": SHARED_CONTEXT.format(document_content=document),
        },
        *history,
        {"role": "user", "content": question},
    ]
 
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=messages,
        max_tokens=512,
    )
    return response.choices[0].message.content
 
# Actual document content (paste a long report, paper, etc. here)
sample_doc = (
    "The global AI market size in 2024 is estimated at approximately $200 billion. "
    "Key growth drivers are the popularization of generative AI services and cloud infrastructure expansion. "
    "North America holds 40% of the market, with the Asia-Pacific region showing notable growth."
)
 
history = []
questions = ["What is the main argument of this document?", "Are there any counterarguments?", "Summarize the conclusion."]
 
for question in questions:
    answer = ask_question(sample_doc, question, history)
    history.extend([
        {"role": "user", "content": question},
        {"role": "assistant", "content": answer},
    ])
    print(f"Q: {question}\nA: {answer}\n")

Code Point	Description
`SHARED_CONTEXT`	The portion common to all requests → automatically cached by RadixAttention
Accumulating `history`	Cache hit rate increases as the multi-turn context grows longer
Changing `base_url`	Use the OpenAI SDK as-is; just swap the URL
`mem-fraction-static 0.9`	Enlarges the cache pool — the larger it is, the greater the cache reuse benefit

Example 2: Using vLLM for Large-Scale Batch Document Classification

When the prompt structure is fixed but each document's content is completely different—i.e., workloads with no opportunity for cache reuse—vLLM shines.

python

# Start vLLM server (optimized for batch processing)
# python -m vllm.entrypoints.openai.api_server \
#   --model meta-llama/Llama-3.1-70B-Instruct \
#   --tensor-parallel-size 4 \
#   --max-model-len 8192 \
#   --gpu-memory-utilization 0.95
 
import asyncio
import openai
 
client = openai.AsyncOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
)
 
CLASSIFICATION_PROMPT = (
    "Please classify the following text into one of the categories below.\n"
    "Categories: [Technology, Business, Sports, Entertainment, Politics]\n\n"
    "Text:\n{text}\n\n"
    "Answer with just one word for the category."
)
 
async def classify_document(text: str) -> str:
    response = await client.chat.completions.create(
        model="meta-llama/Llama-3.1-70B-Instruct",
        messages=[{
            "role": "user",
            "content": CLASSIFICATION_PROMPT.format(text=text),
        }],
        max_tokens=10,
        temperature=0,
    )
    return response.choices[0].message.content.strip()
 
async def batch_classify(documents: list[str]) -> list[str]:
    tasks = [classify_document(doc) for doc in documents]
    return await asyncio.gather(*tasks)
 
# List of documents each with unique content (load from files in actual use)
documents = [
    "Samsung Electronics announced successful mass production of 3nm process chips.",
    "Son Heung-min scored a hat-trick to lead his team to victory.",
    "The Federal Reserve cut its benchmark interest rate by 0.25 percentage points.",
]
results = asyncio.run(batch_classify(documents))
for doc, label in zip(documents, results):
    print(f"[{label}] {doc[:30]}...")

Code Point	Description
`tensor-parallel-size 4`	Distributes the 70B model across 4 GPUs — vLLM's extensive parallelism support
`asyncio.gather`	Dispatches requests concurrently; vLLM's continuous batching groups them internally
`max_tokens=10`	Short output length → maximizes throughput
Unique context per request	No opportunity for cache reuse → PagedAttention's memory efficiency is key

Example 3: Serving DeepSeek Models — SGLang as the De Facto Standard

If you're serving DeepSeek-R1 or DeepSeek-V3, SGLang is strongly recommended. SGLang has built-in dedicated optimizations for the DeepSeek family that demonstrate meaningful advantages over vLLM.

Three key optimizations to highlight:

MLA (Multi-head Latent Attention): A KV cache compression technique developed by DeepSeek. It projects Keys and Values into a low-dimensional latent space to dramatically reduce cache memory.
MoE (Mixture of Experts): An architecture where only some "expert" networks out of the model's total parameters are selectively activated per token. Maintains large model capacity while keeping actual computation low.
Elastic Expert Parallelism: A strategy that dynamically redistributes MoE expert layers across GPUs. Even if a specific GPU fails, the workload is automatically rebalanced so serving is not interrupted.

bash

# Start DeepSeek-R1 server (SGLang recommended settings)
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-R1 \
  --trust-remote-code \
  --tp 8 \
  --dp 2 \
  --host 0.0.0.0 \
  --port 30000 \
  --reasoning-parser deepseek-r1
  # MLA attention backend auto-enabled
  # MoE fault tolerance via Elastic Expert Parallelism

Once the server is up, you can receive inference results in Python like this. A notable feature of DeepSeek-R1 is that it returns the reasoning process inside <think> tags.

python

import openai
 
client = openai.OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY",
)
 
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1",
    messages=[{
        "role": "user",
        "content": (
            "Find and fix the bug in the following code:\n\n"
            "def factorial(n):\n"
            "    if n == 0:\n"
            "        return 0\n"
            "    return n * factorial(n-1)"
        ),
    }],
    max_tokens=2048,
    temperature=0.6,
)
 
full_response = response.choices[0].message.content
 
# Separate the reasoning process wrapped in <think> tags from the final answer
if "<think>" in full_response:
    think_end = full_response.find("</think>")
    reasoning = full_response[7:think_end]        # after <think>
    answer = full_response[think_end + 8:].strip() # after </think>
    print(f"[Reasoning Process]\n{reasoning}\n")
    print(f"[Final Answer]\n{answer}")
else:
    print(full_response)

Code Point	Description
`--tp 8 --dp 2`	Tensor parallel 8 + data parallel 2, recommended combination for DeepSeek-R1's scale
`--reasoning-parser deepseek-r1`	Enables `<think>` tag parsing, allowing separation of reasoning and answer
`think_end = find("</think>")`	Useful for separately logging or debugging R1's chain of reasoning

Pros and Cons Analysis

Advantages

vLLM

Item	Details
Model support breadth	Widest in the industry. Nearly all open-source models including Meta, Mistral, Cohere, Google
Hardware diversity	Supports NVIDIA GPU, TPU, AWS Trainium, Intel Gaudi
Community maturity	~3× more contributors than SGLang; documentation and ecosystem are stable
OpenAI compatibility	Existing OpenAI SDK apps work as-is with just one URL change
High-concurrency stability	C++ routing maintains stable throughput even with 100+ simultaneous requests
Speculative Decoding	Supports various spec decode methods (EAGLE, MTP, etc.) for 1.3–2× speed gains

SGLang

Item	Details
Prefix-reuse workloads	Up to 6.4× performance and up to 30% GPU memory savings with repeated context
Raw throughput	16,200 tok/s on H100 / 8B, ~29% ahead of vLLM
Structured output	FSM-based constrained decoding guarantees JSON and regex-formatted output
DeepSeek optimization	De facto standard serving engine for the DeepSeek model family
TTFT	5–8% reduction in p95 time to first token
Cloud support	Official support on AWS, GCP, Azure, Oracle Cloud

Constrained Decoding: A technique that forces model output to conform to a specific format (JSON schema, regex, etc.). Uses a finite state machine (FSM) to allow only valid tokens at each token generation step. Eliminates parsing errors at the source in function-calling and agent workflows.

Speculative Decoding: A method where a small "draft model" predicts multiple tokens ahead, and the large model verifies them all at once. Improves processing speed by 1.3–2× while maintaining output quality.

Drawbacks and Caveats

vLLM

Item	Details	Mitigation
Prefix-sharing limitations	Up to 29% lower throughput than SGLang on fixed-context repeat workloads	Consider switching to SGLang for those workloads
Multi-turn cache efficiency	Recomputation occurs on every request with long conversation histories	Consider migrating to SGLang

SGLang

To be honest, looking at this list of drawbacks makes SGLang seem inferior—but in practice, the perceived difference in DeepSeek serving or RAG workloads is quite significant.

Item	Details	Mitigation
Model support breadth	Fewer supported models than vLLM	Verify your model is on the support list first
Hardware limitations	TPU, Trainium, Gaudi support is in early stages	No issue if you're on an NVIDIA GPU environment
Community size	Relatively smaller than vLLM	Major cloud providers officially supporting it since 2025; growing rapidly
Operational maturity	Rapid update cycle leads to frequent API changes	Recommended to pin versions and upgrade incrementally

Which Engine to Choose — Decision Flow

If you've read the whole article and are thinking "so which one should I use?", the flow below may help.

yaml

Does my workload have a repeating common system prompt / document context?
  │
  ├─ Yes → Does the fixed context account for 30%+ of the total prompt?
  │           ├─ Yes → SGLang ✓ (maximizes RadixAttention effect)
  │           └─ No  → Are you using DeepSeek-family models?
  │                       ├─ Yes → SGLang ✓ (built-in dedicated optimizations)
  │                       └─ No  → Both are similar; vLLM is easier to operate
  │
  └─ No  → Are there 100+ concurrent requests?
              ├─ Yes → vLLM ✓ (C++ routing stability)
              └─ No  → Mixed GPU/TPU environment or frequent model changes?
                          ├─ Yes → vLLM ✓ (hardware & model diversity)
                          └─ No  → SGLang if throughput is priority,
                                   vLLM if ecosystem stability is priority

The Most Common Mistakes in Practice

I've made similar mistakes myself, so I'm sharing them here.

Choosing an engine without analyzing your workload — I also initially decided with "vLLM is well-known, so that's fine," and later had to rebuild my deployment pipeline from scratch when switching to multi-turn RAG. The right order is to first understand what patterns of requests will come in, then choose the engine.
Judging benchmarks by general numbers rather than your own workload — SGLang's advantage is clear for 7B–8B models, but the gap shrinks to 3–5% for 70B+ large models. Rather than general benchmarks, I recommend directly measuring numbers that match your model size and request patterns. Reproducing your actual traffic pattern with locust or wrk and directly observing TTFT, throughput, and GPU utilization makes things much clearer.
Overlooking high-concurrency scenarios — I missed this too, and making decisions based solely on raw throughput numbers can lead to surprises in production. In environments with 100+ concurrent requests, vLLM's C++ routing can be advantageous, so you need to factor in your expected concurrent user count.

Closing Thoughts

Your workload determines your engine. Start with vLLM for one-off batch jobs, and SGLang for repeated context or agent workloads.

Three steps you can take right now:

Identify your workload pattern: Check how often a common prefix (system prompt, document context) is repeated across requests. If the fixed context accounts for 30%+ of the total prompt, SGLang is likely the better choice.
Spin up both engines locally: Install each with pip install vllm and pip install sglang[all], then run them in OpenAI-compatible API mode. Both engines work with your existing code with just a base_url change.
Run benchmarks with your actual traffic pattern: Reproduce your real request patterns with locust or wrk and directly measure TTFT, throughput, and GPU utilization. Numbers tailored to your situation are far more valuable than general benchmarks.

References

#vLLM#SGLang#LLM추론엔진#KV캐시#PageAttention#RadixAttention#RAG#DeepSeek#MoE#SpeculativeDecoding

Core Concepts

PagedAttention vs RadixAttention — Two Ways of Viewing the KV Cache

The Difference in Numbers

Practical Applications

Example 1: Using SGLang in a Multi-Turn RAG Pipeline

Example 2: Using vLLM for Large-Scale Batch Document Classification

Example 3: Serving DeepSeek Models — SGLang as the De Facto Standard

Pros and Cons Analysis

Advantages

Drawbacks and Caveats

Which Engine to Choose — Decision Flow

The Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

PagedAttention vs RadixAttention — Two Ways of Viewing the KV Cache

The Difference in Numbers

Practical Applications

Example 1: Using SGLang in a Multi-Turn RAG Pipeline

Example 2: Using vLLM for Large-Scale Batch Document Classification

Example 3: Serving DeepSeek Models — SGLang as the De Facto Standard

Pros and Cons Analysis

Advantages

Drawbacks and Caveats

Which Engine to Choose — Decision Flow

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Cut LLM API Costs by Up to 80% — 5 Optimization Strategies Proven in GPT-4o & Claude Production

How to Specialize 7B·70B Models on a Single GPU — LoRA·QLoRA·PEFT Principles and Practical Code

AI Keeps Running Even Without the Cloud — Implementing an Edge AI On-Device Deployment Pipeline

Building Type-Safe AI Agents with PydanticAI — How We Caught 23 Bugs Before Production

MCP (Model Context Protocol) Connects Tools, A2A (Agent-to-Agent Protocol) Connects Agents: Division of Roles and Adoption Criteria in Multi-Agent Architecture

Hermes Agent: A Self-Improving AI Agent That Retains Learning Across Sessions