SGLang EPD Disaggregation and Pipeline Parallelism — An Architecture That Splits Vision-Language Model Serving into 3 Stages to Reduce TTFT by Up to 8x
Even if you've never directly served multimodal AI before, that's fine. These days, AI features that accept image input are becoming so widespread so quickly that sooner or later everyone will need to serve a Vision-Language Model (VLM) — an AI model that understands both text and images together. And when that moment arrives, you'll hit a wall much sooner than expected. "Why is it this slow? I just added a few images."
I ran into this problem myself the first time I set up VLM serving. I spent a long time trying to figure out why GPU utilization on the Prefill node kept stuttering every time an image was encoded, and ultimately the problem was the architecture itself — tying Vision Transformer (ViT) encoding and language model processing together on the same GPU. PD (Prefill-Decode) disaggregation, which has become the standard for LLM serving, is powerful, but it had a critical blind spot for VLMs.
EPD (Encoder-Prefill-Decode) Disaggregation is an architectural shift that fully separates these three stages into independent server pools, reducing TTFT (time to first token) by up to 6–8x for multi-image VLM serving. Combining this with Pipeline Parallelism allows even massive models like DeepSeek-V3 to scale naturally across multiple nodes. Let's explore why EPD is necessary, how to actually set it up, and the pitfalls you absolutely need to know before adopting it.
Core Concepts
Why PD Disaggregation Came First
To understand EPD, you need to look at its predecessor: PD Disaggregation. In LLM serving, Prefill and Decode have completely different computational characteristics.
- Prefill: A compute-bound operation that processes all input tokens at once (compute-intensive)
- Decode: A memory-bound operation that generates tokens one at a time in an autoregressive fashion (memory-bandwidth-intensive)
Running these two workloads together on the same GPU causes them to interfere with each other. That's why PD Disaggregation — running each in its own dedicated GPU pool — has become the industry standard. Major companies including Meta, LinkedIn, and Mistral already use this architecture in production.
Why PD Alone Isn't Enough for VLMs
The problem is that VLMs add ViT (Vision Transformer) encoding on top of all this. The ViT converts images or video into vectors (embeddings) that the language model can understand. Processing this on the Prefill node alongside everything else triggers three problems simultaneously.
| Problem | Description |
|---|---|
| TP scaling inefficiency | Increasing Tensor Parallelism, used to accelerate the LLM, barely improves ViT processing speed |
| LLM GPU idle time | As image count grows, wait time for ViT encoding to finish increases dramatically |
| Memory consumption | Encoder memory eats into KV cache space and batch size |
You might wonder why TP scaling doesn't benefit ViT. Tensor Parallelism splits layer operations across multiple GPUs and synchronizes results (all-reduce communication). Because the sequence length per layer in ViT is fixed by input image resolution and the layers themselves are relatively simple, communication costs quickly eat up the savings from distributed computation. You end up in a situation where adding more GPUs barely improves speed at all.
EPD Architecture — Full Separation into Three Stages
EPD solves this by completely separating these three stages into independent server pools.
[Image/Video Input]
↓
┌─────────────────────────┐
│ Encoder Node │ ← ViT-only, horizontal scaling with data parallelism, embedding caching
└────────────┬────────────┘
│ Vision embedding transfer (ZMQ / Mooncake)
↓
┌─────────────────────────┐
│ Prefill Node │ ← LLM language prefill only, compute-bound
└────────────┬────────────┘
│ KV Cache transfer
↓
┌─────────────────────────┐
│ Decode Node │ ← Autoregressive token generation, memory-bound
└─────────────────────────┘Because each stage is fully separated, when image request volume increases you can simply add more Encoder nodes, and when concurrent sessions increase you can horizontally scale Decode nodes independently. The separation also naturally enables the insight that ViT scales much more effectively with Data Parallelism than Tensor Parallelism.
TP vs DP at a glance: Tensor Parallelism (TP) splits a single layer's operations across multiple GPUs. Data Parallelism (DP) places identical copies of the model on multiple GPUs, where each copy handles different requests independently. Since TP incurs high communication overhead for ViT, using DP to respond linearly to increased request volume is far more efficient.
Vision embeddings and KV Cache need to be transferred between nodes. SGLang supports two transfer backends: ZMQ and Mooncake. ZMQ is simpler to configure for small-scale tests or single-cluster environments, while Mooncake offers higher transfer throughput for high-performance production environments using InfiniBand (a dedicated high-speed network for inter-GPU-server communication). vLLM uses a similar NIXL transfer layer for this purpose.
Combining with Pipeline Parallelism — For Large-Scale MoE Models and Long Contexts
Separately from EPD, SGLang announced in early 2026 the ability to combine chunked pipeline parallelism with PD Disaggregation.
Large-scale MoE (Mixture-of-Experts) models with hundreds of billions of parameters — like DeepSeek-V3 — are difficult to fit entirely into a single node's GPU memory. Pipeline Parallelism distributes model layers across multiple nodes like a pipeline, enabling scales that are impossible on a single node. When layers are distributed across multiple nodes, the KV cache for each node is also distributed, allowing processing of context lengths that a single node could never handle.
DeepSeek-V3.1 serving configuration (H20 cluster, SGLang)
PP4 TP8 configuration:
- PP: layer pipeline distributed across 4 nodes
- TP: 8 GPU tensor parallelism within each node
- Chunk size: 12K tokens
Compared to TP32 single configuration:
- Prefill Throughput: +30.5%
- TTFT: -67.9% (approximately 3.1x reduction)
- Strong Scaling Efficiency: 82.8%Strong Scaling Efficiency: A metric indicating how close to N times the throughput you get when you add N times as many nodes. 100% is ideal; values in the 80s are considered quite good in practice.
MoE (Mixture-of-Experts): An architecture that doesn't activate all model parameters at once, but instead selectively uses only some "expert" layers per input. DeepSeek-V3 achieves low computation relative to its parameter count thanks to this MoE structure, but requires EP (Expert Parallelism) since the full parameters must be distributed across multiple GPUs.
Practical Application
Here are three practical examples. One for directly enabling EPD with a VLM workload, one for serving DeepSeek-V3-scale models across multiple nodes, and one for users of vLLM. Feel free to start with whichever example fits your situation.
Example 1: Multi-Image VLM Serving with SGLang EPD (Qwen2-VL)
Let's assume you have a VLM workload processing 4–8 images per request. Enabling SGLang EPD means launching Encoder, Prefill, and Decode servers separately. The commands below are based on SGLang v0.4.x; parameter names and router entrypoints may differ by version, so it's worth checking the official epd_disaggregation.md documentation as well.
# 1. Start Encoder server (ViT-only, scale horizontally with --dp-size based on image request volume)
python -m sglang.launch_server \
--model-path Qwen/Qwen2-VL-7B-Instruct \
--port 30000 \
--role encoder \
--tp-size 1 \
--dp-size 4
# 2. Start Prefill server
python -m sglang.launch_server \
--model-path Qwen/Qwen2-VL-7B-Instruct \
--port 30001 \
--role prefill \
--tp-size 4 \
--disagg-prefill-port 40001 \
--encoder-server-url http://localhost:30000
# 3. Start Decode server
python -m sglang.launch_server \
--model-path Qwen/Qwen2-VL-7B-Instruct \
--port 30002 \
--role decode \
--tp-size 4 \
--disagg-decode-port 40002
# 4. Start Router (Proxy)
# The entrypoint and parameters may vary by SGLang version
# Please refer to the official epd_disaggregation.md commands first
python -m sglang.launch_server \
--role router \
--prefill http://localhost:30001 \
--decode http://localhost:30002 \
--port 8000Client code can use the standard OpenAI SDK as-is, so you barely need to change existing code. This was one of the more pleasant surprises when first adopting EPD.
import openai
client = openai.Client(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="Qwen2-VL-7B-Instruct",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/img1.jpg"}},
{"type": "image_url", "image_url": {"url": "https://example.com/img2.jpg"}},
{"type": "image_url", "image_url": {"url": "https://example.com/img3.jpg"}},
{"type": "text", "text": "What do these three images have in common?"},
],
}
],
)| Component | Role | Scaling Strategy |
|---|---|---|
--role encoder |
Runs ViT only, returns vision embeddings | Scale horizontally by increasing --dp-size |
--role prefill |
Handles language model prefill only | Accelerate computation by increasing --tp-size |
--role decode |
Autoregressive token generation | Add nodes to match concurrent session count |
| Router (Proxy) | Request routing and flow coordination | Keep the router itself lightweight |
Example 2: DeepSeek-V3-Scale Serving with SGLang Pipeline Parallelism
Large-scale MoE models like DeepSeek-V3/R1 run out of GPU memory on a single node with TP alone. Combining Pipeline Parallelism with PD Disaggregation enables natural multi-node scaling.
# Prefill server — PP4 TP8 configuration (4 nodes, 8 GPUs per node)
# Run with --node-rank set to 0, 1, 2, 3 differently on each node
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3 \
--role prefill \
--tp-size 8 \
--pp-size 4 \
--chunked-prefill-size 12288 \
--dist-init-addr "node0:29500" \
--node-rank 0 \
--nnodes 4
# Decode server — includes EP32 large-scale Expert Parallelism
# Run on node4–node7, also with --node-rank set to 0–3 on each node
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3 \
--role decode \
--tp-size 8 \
--ep-size 32 \
--dist-init-addr "node4:29500" \
--node-rank 0 \
--nnodes 4The following are actual throughput figures from the LMSYS report. "52,300 tokens/sec" may not mean much at first glance, but it's nearly equivalent to the figures announced in DeepSeek's official deployment — a remarkably impressive result for open-source serving.
# 96 H100 GPU configuration throughput (LMSYS report, based on 2000-token input)
performance_h100 = {
"input_throughput": "52,300 tokens/sec", # Equivalent to DeepSeek official figures
"output_throughput": "22,300 tokens/sec",
}
# 64 H200 GPU configuration (improvement over single-node TP baseline)
performance_h200 = {
"throughput_improvement": "3.8x",
"ttft_reduction": "3.5x (approximately 71% decrease)",
}Example 3: Enabling Encoder Disaggregation in vLLM
Native encoder disaggregation is supported starting from vLLM v0.11.1. When serving models like LLaVA-OneVision, it's activated with a single CLI option. As explained above, applying TP to ViT offers no benefit due to communication costs, so leaving --encoder-tensor-parallel-size at 1 is the standard strategy.
# Run vLLM CLI server — with NIXL transfer layer
vllm serve llava-hf/llava-onevision-qwen2-7b-ov-hf \
--tensor-parallel-size 4 \
--encoder-tensor-parallel-size 1 \
--kv-transfer-config '{"kv_connector":"NixlConnector"}' \
--port 8000# When calling directly via the Python API
from vllm import LLM, SamplingParams
llm = LLM(
model="llava-hf/llava-onevision-qwen2-7b-ov-hf",
tensor_parallel_size=4,
max_num_seqs=256,
gpu_memory_utilization=0.9,
)Pros and Cons Analysis
Before actually adopting EPD, it's worth checking one thing first. If your average image count per request is fewer than 2, the benefits of EPD will be limited. Forcing it on a text-heavy workload will only increase operational complexity. Measuring your current Prefill node's GPU utilization and idle time ratio with Prometheus and Grafana first will help you prioritize whether to adopt it.
Advantages
| Item | Description |
|---|---|
| TTFT reduction | Up to 6–8x reduction for multi-image workloads (based on 4–8 images per request) |
| Throughput improvement | Up to 57% increase in end-to-end throughput |
| Memory efficiency | Up to 15x GPU memory savings at the encoding stage, enabling up to 22x larger batch sizes |
| Independent scaling | Encoder/Prefill/Decode can each be horizontally scaled to match workload characteristics |
| Embedding caching | ViT computation for repeatedly seen images can be cached and reused at the encoder server |
| Large-scale model serving | Combined with PP, enables multi-node scaling for models impossible on a single node |
Disadvantages and Caveats
| Item | Description | Mitigation |
|---|---|---|
| High-speed network required | Vision embeddings and KV cache must be transferred between nodes — overhead on standard Ethernet can negate the benefits | Ensure InfiniBand or 400GbE or better; use Mooncake/NIXL |
| Increased operational complexity | Three independent server pools must each be managed separately | Profile Encoder:Prefill:Decode ratios in advance based on workload characteristics |
| Chunk size sensitivity | Optimal chunk size for Pipeline Parallelism varies by model | DeepSeek-V3.1: 12K, Qwen3-235B: 6K — per-model benchmarking required |
| Role-switching overhead | Reassigning roles between nodes when workload changes incurs cost | Consider dynamic schedulers or autoscaling policies where possible |
| Model support scope | SGLang EPD supports major models including LLaVA-OneVision, Qwen2-VL, LLaVA-Video | For unsupported models, check the vLLM-Omni roadmap or develop an adapter |
KV Cache transfer: The process of passing Key-Value matrices (the memory of attention) computed on the Prefill server to the Decode server. If this transfer becomes a bottleneck, the gains from disaggregation are cut in half — high-speed interconnect is a prerequisite.
The Most Common Mistakes in Practice
-
Adopting EPD without considering network bandwidth — In a standard Ethernet environment, embedding and KV transfer overhead can cancel out TTFT gains. It's recommended to first verify whether you have InfiniBand or 400GbE or better.
-
Leaving chunk size at the default — The optimal chunk size for Pipeline Parallelism varies significantly depending on model architecture and sequence length. Even with the same model, retuning may be necessary if the input distribution changes.
-
Fixing the Encoder:Prefill:Decode ratio — It's common for daytime hours to have a high proportion of image requests, while longer text-heavy requests dominate at night. Fixing the ratio leaves one side idle, so it's recommended to consider dynamic schedulers or autoscaling policies where possible.
Closing Thoughts
Honestly, I don't think EPD is necessary for every VLM serving setup. If you have 1–2 images or fewer per request, the operational complexity of managing three separate server pools may not yield enough benefit to justify it. However, when the proportion of image requests is high — especially when you're processing 4 or more images per request — EPD becomes less of an option and more of a necessity. If reading this article has helped you make an informed judgment about whether EPD is right for your own serving environment, that's the most valuable takeaway.
EPD Disaggregation is an architectural shift that moves VLM serving from a 2-stage to a 3-stage paradigm, and its value grows exponentially as image requests increase. To summarize how you can get started right now:
✓ Test EPD on a single machine with SGLang v0.4.x — After pip install sglang[all], you can run Encoder/Prefill/Decode servers locally using a relatively small VLM like Qwen2-VL-7B. Step-by-step commands are available in the official epd_disaggregation.md documentation.
✓ Measure your current workload's image request ratio — Using Prometheus and Grafana to measure GPU utilization and idle time ratio on your current Prefill node will help you prioritize adoption. EPD's benefits become clearly evident once the average number of images per request exceeds 2.
✓ Adopt incrementally — Start by applying PD Disaggregation first on node pairs where InfiniBand or high-speed Ethernet is available, confirm stability, then add Encoder disaggregation afterwards to reduce risk.
References
- EPD Disaggregation: Elastic Encoder Scaling for VLMs in SGLang | LMSYS Blog
- Pipeline Parallelism in SGLang: Scaling to Million-Token Contexts | LMSYS Blog
- Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs | LMSYS Blog
- Efficiently Serving Large Multimodal Models Using EPD Disaggregation | arxiv 2501.05460
- Encoder Disaggregation for Scalable Multimodal Model Serving | vLLM Blog
- SGLang EPD Disaggregation Official Documentation | GitHub
- Disaggregated Encoder | vLLM Official Documentation
- vLLM-Omni: Fully Disaggregated Serving for Any-to-Any Multimodal Models | arxiv 2602.02204
- Prefill-Decode Disaggregation | Ray Serve Official Documentation
- EPDServe Official Implementation | GitHub