SGLang EPD Disaggregation and Pipeline Parallelism — An Architecture That Splits Vision-Language Model Serving into 3 Stages to Reduce TTFT by Up to 8x

Even if you've never directly served multimodal AI before, that's fine. These days, AI features that accept image input are becoming so widespread so quickly that sooner or later everyone will need to serve a Vision-Language Model (VLM) — an AI model that understands both text and images together. And when that moment arrives, you'll hit a wall much sooner than expected. "Why is it this slow? I just added a few images."

I ran into this problem myself the first time I set up VLM serving. I spent a long time trying to figure out why GPU utilization on the Prefill node kept stuttering every time an image was encoded, and ultimately the problem was the architecture itself — tying Vision Transformer (ViT) encoding and language model processing together on the same GPU. PD (Prefill-Decode) disaggregation, which has become the standard for LLM serving, is powerful, but it had a critical blind spot for VLMs.

EPD (Encoder-Prefill-Decode) Disaggregation is an architectural shift that fully separates these three stages into independent server pools, reducing TTFT (time to first token) by up to 6–8x for multi-image VLM serving. Combining this with Pipeline Parallelism allows even massive models like DeepSeek-V3 to scale naturally across multiple nodes. Let's explore why EPD is necessary, how to actually set it up, and the pitfalls you absolutely need to know before adopting it.

Core Concepts

Why PD Disaggregation Came First

To understand EPD, you need to look at its predecessor: PD Disaggregation. In LLM serving, Prefill and Decode have completely different computational characteristics.

Prefill: A compute-bound operation that processes all input tokens at once (compute-intensive)
Decode: A memory-bound operation that generates tokens one at a time in an autoregressive fashion (memory-bandwidth-intensive)

Running these two workloads together on the same GPU causes them to interfere with each other. That's why PD Disaggregation — running each in its own dedicated GPU pool — has become the industry standard. Major companies including Meta, LinkedIn, and Mistral already use this architecture in production.

Why PD Alone Isn't Enough for VLMs

The problem is that VLMs add ViT (Vision Transformer) encoding on top of all this. The ViT converts images or video into vectors (embeddings) that the language model can understand. Processing this on the Prefill node alongside everything else triggers three problems simultaneously.

Problem	Description
TP scaling inefficiency	Increasing Tensor Parallelism, used to accelerate the LLM, barely improves ViT processing speed
LLM GPU idle time	As image count grows, wait time for ViT encoding to finish increases dramatically
Memory consumption	Encoder memory eats into KV cache space and batch size

You might wonder why TP scaling doesn't benefit ViT. Tensor Parallelism splits layer operations across multiple GPUs and synchronizes results (all-reduce communication). Because the sequence length per layer in ViT is fixed by input image resolution and the layers themselves are relatively simple, communication costs quickly eat up the savings from distributed computation. You end up in a situation where adding more GPUs barely improves speed at all.

EPD Architecture — Full Separation into Three Stages

EPD solves this by completely separating these three stages into independent server pools.

[Image/Video Input]
       ↓
┌─────────────────────────┐
│     Encoder Node        │  ← ViT-only, horizontal scaling with data parallelism, embedding caching
└────────────┬────────────┘
             │ Vision embedding transfer (ZMQ / Mooncake)
             ↓
┌─────────────────────────┐
│     Prefill Node        │  ← LLM language prefill only, compute-bound
└────────────┬────────────┘
             │ KV Cache transfer
             ↓
┌─────────────────────────┐
│     Decode Node         │  ← Autoregressive token generation, memory-bound
└─────────────────────────┘

Because each stage is fully separated, when image request volume increases you can simply add more Encoder nodes, and when concurrent sessions increase you can horizontally scale Decode nodes independently. The separation also naturally enables the insight that ViT scales much more effectively with Data Parallelism than Tensor Parallelism.

TP vs DP at a glance: Tensor Parallelism (TP) splits a single layer's operations across multiple GPUs. Data Parallelism (DP) places identical copies of the model on multiple GPUs, where each copy handles different requests independently. Since TP incurs high communication overhead for ViT, using DP to respond linearly to increased request volume is far more efficient.

Vision embeddings and KV Cache need to be transferred between nodes. SGLang supports two transfer backends: ZMQ and Mooncake. ZMQ is simpler to configure for small-scale tests or single-cluster environments, while Mooncake offers higher transfer throughput for high-performance production environments using InfiniBand (a dedicated high-speed network for inter-GPU-server communication). vLLM uses a similar NIXL transfer layer for this purpose.

Combining with Pipeline Parallelism — For Large-Scale MoE Models and Long Contexts

Separately from EPD, SGLang announced in early 2026 the ability to combine chunked pipeline parallelism with PD Disaggregation.

Large-scale MoE (Mixture-of-Experts) models with hundreds of billions of parameters — like DeepSeek-V3 — are difficult to fit entirely into a single node's GPU memory. Pipeline Parallelism distributes model layers across multiple nodes like a pipeline, enabling scales that are impossible on a single node. When layers are distributed across multiple nodes, the KV cache for each node is also distributed, allowing processing of context lengths that a single node could never handle.

yaml

DeepSeek-V3.1 serving configuration (H20 cluster, SGLang)
 
PP4 TP8 configuration:
  - PP: layer pipeline distributed across 4 nodes
  - TP: 8 GPU tensor parallelism within each node
  - Chunk size: 12K tokens
 
Compared to TP32 single configuration:
  - Prefill Throughput: +30.5%
  - TTFT: -67.9% (approximately 3.1x reduction)
  - Strong Scaling Efficiency: 82.8%

Strong Scaling Efficiency: A metric indicating how close to N times the throughput you get when you add N times as many nodes. 100% is ideal; values in the 80s are considered quite good in practice.

MoE (Mixture-of-Experts): An architecture that doesn't activate all model parameters at once, but instead selectively uses only some "expert" layers per input. DeepSeek-V3 achieves low computation relative to its parameter count thanks to this MoE structure, but requires EP (Expert Parallelism) since the full parameters must be distributed across multiple GPUs.

Practical Application

Here are three practical examples. One for directly enabling EPD with a VLM workload, one for serving DeepSeek-V3-scale models across multiple nodes, and one for users of vLLM. Feel free to start with whichever example fits your situation.

Example 1: Multi-Image VLM Serving with SGLang EPD (Qwen2-VL)

Let's assume you have a VLM workload processing 4–8 images per request. Enabling SGLang EPD means launching Encoder, Prefill, and Decode servers separately. The commands below are based on SGLang v0.4.x; parameter names and router entrypoints may differ by version, so it's worth checking the official epd_disaggregation.md documentation as well.

bash

# 1. Start Encoder server (ViT-only, scale horizontally with --dp-size based on image request volume)
python -m sglang.launch_server \
  --model-path Qwen/Qwen2-VL-7B-Instruct \
  --port 30000 \
  --role encoder \
  --tp-size 1 \
  --dp-size 4
 
# 2. Start Prefill server
python -m sglang.launch_server \
  --model-path Qwen/Qwen2-VL-7B-Instruct \
  --port 30001 \
  --role prefill \
  --tp-size 4 \
  --disagg-prefill-port 40001 \
  --encoder-server-url http://localhost:30000
 
# 3. Start Decode server
python -m sglang.launch_server \
  --model-path Qwen/Qwen2-VL-7B-Instruct \
  --port 30002 \
  --role decode \
  --tp-size 4 \
  --disagg-decode-port 40002
 
# 4. Start Router (Proxy)
# The entrypoint and parameters may vary by SGLang version
# Please refer to the official epd_disaggregation.md commands first
python -m sglang.launch_server \
  --role router \
  --prefill http://localhost:30001 \
  --decode http://localhost:30002 \
  --port 8000

Client code can use the standard OpenAI SDK as-is, so you barely need to change existing code. This was one of the more pleasant surprises when first adopting EPD.

python

import openai
 
client = openai.Client(base_url="http://localhost:8000/v1", api_key="none")
 
response = client.chat.completions.create(
    model="Qwen2-VL-7B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/img1.jpg"}},
                {"type": "image_url", "image_url": {"url": "https://example.com/img2.jpg"}},
                {"type": "image_url", "image_url": {"url": "https://example.com/img3.jpg"}},
                {"type": "text", "text": "What do these three images have in common?"},
            ],
        }
    ],
)

Component	Role	Scaling Strategy
`--role encoder`	Runs ViT only, returns vision embeddings	Scale horizontally by increasing `--dp-size`
`--role prefill`	Handles language model prefill only	Accelerate computation by increasing `--tp-size`
`--role decode`	Autoregressive token generation	Add nodes to match concurrent session count
Router (Proxy)	Request routing and flow coordination	Keep the router itself lightweight

Example 2: DeepSeek-V3-Scale Serving with SGLang Pipeline Parallelism

Large-scale MoE models like DeepSeek-V3/R1 run out of GPU memory on a single node with TP alone. Combining Pipeline Parallelism with PD Disaggregation enables natural multi-node scaling.

bash

# Prefill server — PP4 TP8 configuration (4 nodes, 8 GPUs per node)
# Run with --node-rank set to 0, 1, 2, 3 differently on each node
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3 \
  --role prefill \
  --tp-size 8 \
  --pp-size 4 \
  --chunked-prefill-size 12288 \
  --dist-init-addr "node0:29500" \
  --node-rank 0 \
  --nnodes 4
 
# Decode server — includes EP32 large-scale Expert Parallelism
# Run on node4–node7, also with --node-rank set to 0–3 on each node
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3 \
  --role decode \
  --tp-size 8 \
  --ep-size 32 \
  --dist-init-addr "node4:29500" \
  --node-rank 0 \
  --nnodes 4

The following are actual throughput figures from the LMSYS report. "52,300 tokens/sec" may not mean much at first glance, but it's nearly equivalent to the figures announced in DeepSeek's official deployment — a remarkably impressive result for open-source serving.

python

# 96 H100 GPU configuration throughput (LMSYS report, based on 2000-token input)
performance_h100 = {
    "input_throughput": "52,300 tokens/sec",  # Equivalent to DeepSeek official figures
    "output_throughput": "22,300 tokens/sec",
}
 
# 64 H200 GPU configuration (improvement over single-node TP baseline)
performance_h200 = {
    "throughput_improvement": "3.8x",
    "ttft_reduction":  "3.5x (approximately 71% decrease)",
}

Example 3: Enabling Encoder Disaggregation in vLLM

Native encoder disaggregation is supported starting from vLLM v0.11.1. When serving models like LLaVA-OneVision, it's activated with a single CLI option. As explained above, applying TP to ViT offers no benefit due to communication costs, so leaving --encoder-tensor-parallel-size at 1 is the standard strategy.

bash

# Run vLLM CLI server — with NIXL transfer layer
vllm serve llava-hf/llava-onevision-qwen2-7b-ov-hf \
  --tensor-parallel-size 4 \
  --encoder-tensor-parallel-size 1 \
  --kv-transfer-config '{"kv_connector":"NixlConnector"}' \
  --port 8000

python

# When calling directly via the Python API
from vllm import LLM, SamplingParams
 
llm = LLM(
    model="llava-hf/llava-onevision-qwen2-7b-ov-hf",
    tensor_parallel_size=4,
    max_num_seqs=256,
    gpu_memory_utilization=0.9,
)

Pros and Cons Analysis

Before actually adopting EPD, it's worth checking one thing first. If your average image count per request is fewer than 2, the benefits of EPD will be limited. Forcing it on a text-heavy workload will only increase operational complexity. Measuring your current Prefill node's GPU utilization and idle time ratio with Prometheus and Grafana first will help you prioritize whether to adopt it.

Advantages

Item	Description
TTFT reduction	Up to 6–8x reduction for multi-image workloads (based on 4–8 images per request)
Throughput improvement	Up to 57% increase in end-to-end throughput
Memory efficiency	Up to 15x GPU memory savings at the encoding stage, enabling up to 22x larger batch sizes
Independent scaling	Encoder/Prefill/Decode can each be horizontally scaled to match workload characteristics
Embedding caching	ViT computation for repeatedly seen images can be cached and reused at the encoder server
Large-scale model serving	Combined with PP, enables multi-node scaling for models impossible on a single node

Disadvantages and Caveats

Item	Description	Mitigation
High-speed network required	Vision embeddings and KV cache must be transferred between nodes — overhead on standard Ethernet can negate the benefits	Ensure InfiniBand or 400GbE or better; use Mooncake/NIXL
Increased operational complexity	Three independent server pools must each be managed separately	Profile Encoder:Prefill:Decode ratios in advance based on workload characteristics
Chunk size sensitivity	Optimal chunk size for Pipeline Parallelism varies by model	DeepSeek-V3.1: 12K, Qwen3-235B: 6K — per-model benchmarking required
Role-switching overhead	Reassigning roles between nodes when workload changes incurs cost	Consider dynamic schedulers or autoscaling policies where possible
Model support scope	SGLang EPD supports major models including LLaVA-OneVision, Qwen2-VL, LLaVA-Video	For unsupported models, check the vLLM-Omni roadmap or develop an adapter

KV Cache transfer: The process of passing Key-Value matrices (the memory of attention) computed on the Prefill server to the Decode server. If this transfer becomes a bottleneck, the gains from disaggregation are cut in half — high-speed interconnect is a prerequisite.

The Most Common Mistakes in Practice

Adopting EPD without considering network bandwidth — In a standard Ethernet environment, embedding and KV transfer overhead can cancel out TTFT gains. It's recommended to first verify whether you have InfiniBand or 400GbE or better.
Leaving chunk size at the default — The optimal chunk size for Pipeline Parallelism varies significantly depending on model architecture and sequence length. Even with the same model, retuning may be necessary if the input distribution changes.
Fixing the Encoder:Prefill:Decode ratio — It's common for daytime hours to have a high proportion of image requests, while longer text-heavy requests dominate at night. Fixing the ratio leaves one side idle, so it's recommended to consider dynamic schedulers or autoscaling policies where possible.

Closing Thoughts

Honestly, I don't think EPD is necessary for every VLM serving setup. If you have 1–2 images or fewer per request, the operational complexity of managing three separate server pools may not yield enough benefit to justify it. However, when the proportion of image requests is high — especially when you're processing 4 or more images per request — EPD becomes less of an option and more of a necessity. If reading this article has helped you make an informed judgment about whether EPD is right for your own serving environment, that's the most valuable takeaway.

EPD Disaggregation is an architectural shift that moves VLM serving from a 2-stage to a 3-stage paradigm, and its value grows exponentially as image requests increase. To summarize how you can get started right now:

✓ Test EPD on a single machine with SGLang v0.4.x — After pip install sglang[all], you can run Encoder/Prefill/Decode servers locally using a relatively small VLM like Qwen2-VL-7B. Step-by-step commands are available in the official epd_disaggregation.md documentation.

✓ Measure your current workload's image request ratio — Using Prometheus and Grafana to measure GPU utilization and idle time ratio on your current Prefill node will help you prioritize adoption. EPD's benefits become clearly evident once the average number of images per request exceeds 2.

✓ Adopt incrementally — Start by applying PD Disaggregation first on node pairs where InfiniBand or high-speed Ethernet is available, confirm stability, then add Encoder disaggregation afterwards to reduce risk.

References

#EPD-Disaggregation#VLM#SGLang#PipelineParallelism#TensorParallelism#KVCache#ViT#LLM서빙#MoE#TTFT

SGLang EPD Disaggregation and Pipeline Parallelism — An Architecture That Splits Vision-Language Model Serving into 3 Stages to Reduce TTFT by Up to 8x

Core Concepts

Why PD Disaggregation Came First

To understand EPD, you need to look at its predecessor: PD Disaggregation. In LLM serving, Prefill and Decode have completely different computational characteristics.

Prefill: A compute-bound operation that processes all input tokens at once (compute-intensive)
Decode: A memory-bound operation that generates tokens one at a time in an autoregressive fashion (memory-bandwidth-intensive)

Why PD Alone Isn't Enough for VLMs

Problem	Description
TP scaling inefficiency	Increasing Tensor Parallelism, used to accelerate the LLM, barely improves ViT processing speed
LLM GPU idle time	As image count grows, wait time for ViT encoding to finish increases dramatically
Memory consumption	Encoder memory eats into KV cache space and batch size

EPD Architecture — Full Separation into Three Stages

EPD solves this by completely separating these three stages into independent server pools.

[Image/Video Input]
       ↓
┌─────────────────────────┐
│     Encoder Node        │  ← ViT-only, horizontal scaling with data parallelism, embedding caching
└────────────┬────────────┘
             │ Vision embedding transfer (ZMQ / Mooncake)
             ↓
┌─────────────────────────┐
│     Prefill Node        │  ← LLM language prefill only, compute-bound
└────────────┬────────────┘
             │ KV Cache transfer
             ↓
┌─────────────────────────┐
│     Decode Node         │  ← Autoregressive token generation, memory-bound
└─────────────────────────┘

TP vs DP at a glance: Tensor Parallelism (TP) splits a single layer's operations across multiple GPUs. Data Parallelism (DP) places identical copies of the model on multiple GPUs, where each copy handles different requests independently. Since TP incurs high communication overhead for ViT, using DP to respond linearly to increased request volume is far more efficient.

Combining with Pipeline Parallelism — For Large-Scale MoE Models and Long Contexts

Separately from EPD, SGLang announced in early 2026 the ability to combine chunked pipeline parallelism with PD Disaggregation.

yaml

DeepSeek-V3.1 serving configuration (H20 cluster, SGLang)
 
PP4 TP8 configuration:
  - PP: layer pipeline distributed across 4 nodes
  - TP: 8 GPU tensor parallelism within each node
  - Chunk size: 12K tokens
 
Compared to TP32 single configuration:
  - Prefill Throughput: +30.5%
  - TTFT: -67.9% (approximately 3.1x reduction)
  - Strong Scaling Efficiency: 82.8%

Strong Scaling Efficiency: A metric indicating how close to N times the throughput you get when you add N times as many nodes. 100% is ideal; values in the 80s are considered quite good in practice.

MoE (Mixture-of-Experts): An architecture that doesn't activate all model parameters at once, but instead selectively uses only some "expert" layers per input. DeepSeek-V3 achieves low computation relative to its parameter count thanks to this MoE structure, but requires EP (Expert Parallelism) since the full parameters must be distributed across multiple GPUs.

Practical Application

Example 1: Multi-Image VLM Serving with SGLang EPD (Qwen2-VL)

bash

# 1. Start Encoder server (ViT-only, scale horizontally with --dp-size based on image request volume)
python -m sglang.launch_server \
  --model-path Qwen/Qwen2-VL-7B-Instruct \
  --port 30000 \
  --role encoder \
  --tp-size 1 \
  --dp-size 4
 
# 2. Start Prefill server
python -m sglang.launch_server \
  --model-path Qwen/Qwen2-VL-7B-Instruct \
  --port 30001 \
  --role prefill \
  --tp-size 4 \
  --disagg-prefill-port 40001 \
  --encoder-server-url http://localhost:30000
 
# 3. Start Decode server
python -m sglang.launch_server \
  --model-path Qwen/Qwen2-VL-7B-Instruct \
  --port 30002 \
  --role decode \
  --tp-size 4 \
  --disagg-decode-port 40002
 
# 4. Start Router (Proxy)
# The entrypoint and parameters may vary by SGLang version
# Please refer to the official epd_disaggregation.md commands first
python -m sglang.launch_server \
  --role router \
  --prefill http://localhost:30001 \
  --decode http://localhost:30002 \
  --port 8000

Client code can use the standard OpenAI SDK as-is, so you barely need to change existing code. This was one of the more pleasant surprises when first adopting EPD.

python

import openai
 
client = openai.Client(base_url="http://localhost:8000/v1", api_key="none")
 
response = client.chat.completions.create(
    model="Qwen2-VL-7B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/img1.jpg"}},
                {"type": "image_url", "image_url": {"url": "https://example.com/img2.jpg"}},
                {"type": "image_url", "image_url": {"url": "https://example.com/img3.jpg"}},
                {"type": "text", "text": "What do these three images have in common?"},
            ],
        }
    ],
)

Component	Role	Scaling Strategy
`--role encoder`	Runs ViT only, returns vision embeddings	Scale horizontally by increasing `--dp-size`
`--role prefill`	Handles language model prefill only	Accelerate computation by increasing `--tp-size`
`--role decode`	Autoregressive token generation	Add nodes to match concurrent session count
Router (Proxy)	Request routing and flow coordination	Keep the router itself lightweight

Example 2: DeepSeek-V3-Scale Serving with SGLang Pipeline Parallelism

Large-scale MoE models like DeepSeek-V3/R1 run out of GPU memory on a single node with TP alone. Combining Pipeline Parallelism with PD Disaggregation enables natural multi-node scaling.

bash

# Prefill server — PP4 TP8 configuration (4 nodes, 8 GPUs per node)
# Run with --node-rank set to 0, 1, 2, 3 differently on each node
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3 \
  --role prefill \
  --tp-size 8 \
  --pp-size 4 \
  --chunked-prefill-size 12288 \
  --dist-init-addr "node0:29500" \
  --node-rank 0 \
  --nnodes 4
 
# Decode server — includes EP32 large-scale Expert Parallelism
# Run on node4–node7, also with --node-rank set to 0–3 on each node
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3 \
  --role decode \
  --tp-size 8 \
  --ep-size 32 \
  --dist-init-addr "node4:29500" \
  --node-rank 0 \
  --nnodes 4

python

# 96 H100 GPU configuration throughput (LMSYS report, based on 2000-token input)
performance_h100 = {
    "input_throughput": "52,300 tokens/sec",  # Equivalent to DeepSeek official figures
    "output_throughput": "22,300 tokens/sec",
}
 
# 64 H200 GPU configuration (improvement over single-node TP baseline)
performance_h200 = {
    "throughput_improvement": "3.8x",
    "ttft_reduction":  "3.5x (approximately 71% decrease)",
}

Example 3: Enabling Encoder Disaggregation in vLLM

bash

# Run vLLM CLI server — with NIXL transfer layer
vllm serve llava-hf/llava-onevision-qwen2-7b-ov-hf \
  --tensor-parallel-size 4 \
  --encoder-tensor-parallel-size 1 \
  --kv-transfer-config '{"kv_connector":"NixlConnector"}' \
  --port 8000

python

# When calling directly via the Python API
from vllm import LLM, SamplingParams
 
llm = LLM(
    model="llava-hf/llava-onevision-qwen2-7b-ov-hf",
    tensor_parallel_size=4,
    max_num_seqs=256,
    gpu_memory_utilization=0.9,
)

Pros and Cons Analysis

Advantages

Item	Description
TTFT reduction	Up to 6–8x reduction for multi-image workloads (based on 4–8 images per request)
Throughput improvement	Up to 57% increase in end-to-end throughput
Memory efficiency	Up to 15x GPU memory savings at the encoding stage, enabling up to 22x larger batch sizes
Independent scaling	Encoder/Prefill/Decode can each be horizontally scaled to match workload characteristics
Embedding caching	ViT computation for repeatedly seen images can be cached and reused at the encoder server
Large-scale model serving	Combined with PP, enables multi-node scaling for models impossible on a single node

Disadvantages and Caveats

Item	Description	Mitigation
High-speed network required	Vision embeddings and KV cache must be transferred between nodes — overhead on standard Ethernet can negate the benefits	Ensure InfiniBand or 400GbE or better; use Mooncake/NIXL
Increased operational complexity	Three independent server pools must each be managed separately	Profile Encoder:Prefill:Decode ratios in advance based on workload characteristics
Chunk size sensitivity	Optimal chunk size for Pipeline Parallelism varies by model	DeepSeek-V3.1: 12K, Qwen3-235B: 6K — per-model benchmarking required
Role-switching overhead	Reassigning roles between nodes when workload changes incurs cost	Consider dynamic schedulers or autoscaling policies where possible
Model support scope	SGLang EPD supports major models including LLaVA-OneVision, Qwen2-VL, LLaVA-Video	For unsupported models, check the vLLM-Omni roadmap or develop an adapter

KV Cache transfer: The process of passing Key-Value matrices (the memory of attention) computed on the Prefill server to the Decode server. If this transfer becomes a bottleneck, the gains from disaggregation are cut in half — high-speed interconnect is a prerequisite.

The Most Common Mistakes in Practice

Adopting EPD without considering network bandwidth — In a standard Ethernet environment, embedding and KV transfer overhead can cancel out TTFT gains. It's recommended to first verify whether you have InfiniBand or 400GbE or better.
Leaving chunk size at the default — The optimal chunk size for Pipeline Parallelism varies significantly depending on model architecture and sequence length. Even with the same model, retuning may be necessary if the input distribution changes.
Fixing the Encoder:Prefill:Decode ratio — It's common for daytime hours to have a high proportion of image requests, while longer text-heavy requests dominate at night. Fixing the ratio leaves one side idle, so it's recommended to consider dynamic schedulers or autoscaling policies where possible.

Closing Thoughts

References

#EPD-Disaggregation#VLM#SGLang#PipelineParallelism#TensorParallelism#KVCache#ViT#LLM서빙#MoE#TTFT

Core Concepts

Why PD Disaggregation Came First

Why PD Alone Isn't Enough for VLMs

EPD Architecture — Full Separation into Three Stages

Combining with Pipeline Parallelism — For Large-Scale MoE Models and Long Contexts

Practical Application

Example 1: Multi-Image VLM Serving with SGLang EPD (Qwen2-VL)

Example 2: DeepSeek-V3-Scale Serving with SGLang Pipeline Parallelism

Example 3: Enabling Encoder Disaggregation in vLLM

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

Why PD Disaggregation Came First

Why PD Alone Isn't Enough for VLMs

EPD Architecture — Full Separation into Three Stages

Combining with Pipeline Parallelism — For Large-Scale MoE Models and Long Contexts

Practical Application

Example 1: Multi-Image VLM Serving with SGLang EPD (Qwen2-VL)

Example 2: DeepSeek-V3-Scale Serving with SGLang Pipeline Parallelism

Example 3: Enabling Encoder Disaggregation in vLLM

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Why Serving DeepSeek-V3 on 96 H100s Is Possible: SGLang Expert Parallelism's Communication Optimization and Memory Fragmentation Solutions

XGrammar-2: The Design Principles Behind 80x Faster Structured Output

FP4 Quantization + Blackwell GPU: Conditions for 4× Throughput over H100 and When Not to Use It

SGLang Architecture That Extracts 6× Throughput from the Same GPUs — PD Disaggregation and HiCache

SGLang RadixAttention KV Cache Hit Rate: How We Raised Hit Rate from 3% to 78% with Prometheus Monitoring and Operational Tuning — Advanced Guide

Migrating LLM Inference from vLLM to SGLang: Why RAG and Multi-Turn Workloads See a 6x Throughput Difference