SGLang Architecture That Extracts 6× Throughput from the Same GPUs — PD Disaggregation and HiCache
When I first took on LLM serving, the most baffling question was "We have enough GPUs, so why is this so slow?" The monitoring dashboard showed GPU memory nearly maxed out and compute running continuously, yet actual response times fell far short of expectations. For a while I kept thinking the answer was buying more GPUs — but as it turned out, the root cause was somewhere else entirely. Prefill (processing the entire input) and Decode (generating tokens one at a time) are two fundamentally different workloads, and we had them fighting over the same GPU pool.
The entire industry is now moving to address this problem at its root. Nearly every production LLM serving framework — NVIDIA Dynamo, vLLM, Ray Serve LLM, llm-d — has adopted or is integrating a PD-separated architecture. This trend, which started with Hao AI Lab's DistServe research, has since extended to Moonshot AI's Kimi service and Together AI's CPD (Cache-aware PD) architecture. Combining SGLang's officially supported PD Disaggregation (Prefill-Decode separation) with its hierarchical KV cache system HiCache, numbers like 5× throughput versus the same resources (PD separation alone) on a 96× H100 cluster — up to 6× when HiCache is added — and up to 84% reduction in TTFT (Time To First Token) become reality.
This article is written primarily for backend and ML engineers who have experience with, or interest in, LLM serving. Familiarity with basic distributed inference concepts like Tensor Parallelism will make it easier to follow, but any abbreviation appearing for the first time will be explained on the spot. After reading this, you will have a framework for judging why PD should be separated, how HiCache works, and which combination to try first in your own environment.
Core Concept: Why Prefill and Decode Should Not Share the Same Pool
They Are Completely Different Beasts from a Hardware Perspective
For a long time I lumped Prefill and Decode together as just "the first stage and the second stage." But from a GPU resource standpoint, these are entirely different kinds of work.
| Prefill | Decode | |
|---|---|---|
| Nature | Compute-bound | Memory-bound |
| Input | Thousands to hundreds of thousands of tokens processed in parallel | Tokens generated sequentially, one at a time |
| Bottleneck | FLOPs (compute volume) | HBM bandwidth |
| Optimal GPU | High Tensor Core density | Large HBM capacity |
HBM (High Bandwidth Memory): The ultra-fast memory attached to a GPU — what people commonly call "GPU memory." An H100 carries 80 GB of HBM3. Memory-bound workloads like Decode are primarily bottlenecked by this HBM's bandwidth.
Prefill processes all input tokens at once, making it a compute-intensive workload that fully utilizes GPU cores. Decode, by contrast, repeatedly reads an already-built KV cache to generate tokens one at a time, making HBM bandwidth the bottleneck. When both workloads share the same GPU pool, neither can be optimized properly.
The problems do not stop there. When a large Prefill batch arrives, in-flight Decode streams stall. This is prefill interruption — from the user's perspective, the response simply freezes. On top of that, DP Attention Imbalance compounds the issue. This imbalance occurs when long-context requests concentrate on a specific DP (Data Parallel) rank: that rank becomes overloaded while the rest sit idle, causing unpredictable latency spikes.
TP/EP/DP/PP: Abbreviations for the distributed parallelization strategies commonly used in LLMs. TP (Tensor Parallelism) splits weight matrices across GPUs to parallelize each layer's computation; EP (Expert Parallelism) distributes the expert layers of MoE models; DP (Data Parallelism) distributes requests across identical model replicas; PP (Pipeline Parallelism) groups layers and pipelines them across multiple GPUs.
PD Disaggregation: Physically Separating the Two
The solution is exactly what the name says. Separate the Prefill server pool from the Decode server pool, and apply independently optimized hardware and parallelization strategies to each.
[Client Request]
│
▼
[Load Balancer]
│
┌────┴────┐
▼ ▼
[Prefill [Decode
Server Server
Pool] Pool]
│ ▲
└──KV Cache Transfer─┘
(RDMA/RoCE)Prefill servers process all input tokens in parallel to produce the KV cache, then hand it off to Decode servers via an RDMA-based high-speed network. Decode servers use the received KV cache to generate tokens one at a time.
An important point is that different GPUs can be attached to each server pool. For example, it is possible to run a heterogeneous mix — H100s with higher compute performance on Prefill servers and A100s with larger HBM capacity on Decode servers. Because KV cache transfer accounts for roughly 25% of total latency, the network between Prefill and Decode must be InfiniBand or RoCE (RDMA over Converged Ethernet).
RDMA (Remote Direct Memory Access): A technology that accesses another server's memory directly over the network without going through the CPU. It is essentially mandatory for transferring KV caches that can reach several GB with low latency.
SGLang began officially supporting this architecture based on the Mooncake TransferEngine starting in April 2025.
HiCache: A Three-Tier Architecture That Doesn't Discard Evicted KV Caches
From RadixAttention to HiRadixTree
Honestly, when I first saw the HiCache concept I thought "isn't this just cache tiering?" — but to really understand it, you first need to know SGLang's RadixAttention.
RadixAttention is a prefix-sharing KV cache technique developed by SGLang. It uses a Radix tree structure to manage KV caches so that requests sharing the same system prompt or context can reuse them. For example, if 100 requests all carry the same long system prompt, the KV cache for that prompt is computed once and reused from the tree.
The catch is that RadixAttention operates only on GPU HBM. When memory runs out, the cache is evicted, and if the same context arrives again, it is recomputed from scratch. HiCache takes inspiration from the CPU's three-level cache design and extends this to multiple tiers. A HiRadixTree acts as a page table, tracking which tier each KV cache page currently resides in, while a Cache Controller automatically coordinates data movement between tiers in the background.
The Three-Tier Structure and GPU-Assisted I/O
L1: GPU HBM — Fastest, capacity-limited (tens of GB)
↕ cudaMemcpyAsync
L2: CPU DRAM — Several times L1's capacity (hundreds of GB)
↕ GPU-assisted I/O kernel
L3: Distributed storage — Virtually unlimited (Mooncake, Tair, 3FS...)The L2↔L1 path uses standard cudaMemcpyAsync, but the L3↔L2 path involves a GPU-assisted I/O kernel. The conventional approach has the CPU coordinate reading data from storage and copying it to the GPU; GPU-assisted I/O bypasses the CPU entirely — the GPU's DMA (Direct Memory Access) engine reads directly from storage. This difference yields up to 3× improvement in L3 access throughput.
Rather than being discarded, caches evicted from L1 are preserved in L2 or L3, so when the same context arrives again it can be loaded without the expensive recomputation.
Mooncake Store: A KVCache-centric distributed storage system developed by Moonshot AI. It supports high-bandwidth, low-latency transfer via RDMA and zero-copy technology, and won the FAST 2025 Best Paper award. It is the most commonly used L3 backend for SGLang HiCache.
When the Two Technologies Meet
If PD Disaggregation isolates the GPU compute load of Prefill servers, HiCache preserves the KV caches computed by those Prefill servers across tiers to prevent recomputation on repeated requests. For scenarios where the same long context recurs — multi-turn conversations, agentic workloads — this combination is especially powerful, and benchmarks combining the two technologies yield figures of up to 6× throughput and up to 84% reduction in TTFT.
Pros and Cons Analysis
This section is placed early so you can first assess whether this fits your environment before looking at code examples.
Advantages
| Item | Details |
|---|---|
| Independent scaling | Prefill (compute-bound) and Decode (memory-bound) servers can be scaled up or down separately to match demand, enabling cost-efficient capacity planning |
| Higher KV cache reuse | HiCache L2/L3 preserves evicted caches, enabling up to 84% TTFT reduction for multi-turn and agentic workloads |
| Interference elimination | Prefill interruption — where Prefill halts Decode — is eliminated, improving TPOT stability |
| Heterogeneous hardware support | GPUs optimized for compute (e.g., H100) can be placed on Prefill, while GPUs with larger HBM (e.g., A100) can be placed on Decode |
| Throughput improvement | Up to 5× output throughput versus pure TP on the same resources (96× H100, PD separation alone) |
| Cost reduction | $0.20/1M output tokens by LMSYS measurements, enabling economical operation versus standard deployments |
TPOT (Time Per Output Token): The time required to generate a single token. Along with TTFT (Time To First Token), it is a key metric for LLM serving quality, directly influenced by Decode-stage stability.
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| KV cache transfer overhead | Prefill→Decode KV cache transfer accounts for roughly 25% of total latency; longer contexts produce multi-GB data movement | InfiniBand or RoCE-based RDMA networking is required |
| Operational complexity | Prefill servers, Decode servers, HiCache backends, and load balancers must each be operated, adding fault-tolerance management | Consider standardizing and automating configuration with Kubernetes + Helm |
| HiCache tier promotion latency | Loading KV caches from L2/L3 to L1 adds time, potentially causing latency spikes on cache misses | Profiling workload patterns and tuning to keep hot caches in L1/L2 is necessary |
| PP + PD + HiCache integration complexity | Triple integration of PP+PD+HiCache for very large models (60+ layers) is a new feature whose compatibility was only achieved in January 2026 | Allow sufficient validation time before production use |
| Cache consistency issues | Cache consistency bugs have been reported with HiCache + PP combinations (GitHub Issue #22607) | Check the latest SGLang release notes and issue tracker before using this combination |
| Initial infrastructure cost | The cost of building RDMA networking and distributed storage like Mooncake/3FS is substantial | It is possible to start with L1+L2 only (no L3) at small scale and expand incrementally |
The production readiness by combination is summarized below.
| Combination | Status | Notes |
|---|---|---|
| PD Disaggregation (standalone) | ✅ Production-ready | Officially supported since April 2025 |
| PD + HiCache L1+L2 | ✅ Production-ready | Stable through CPU offload |
| PD + HiCache L3 (Mooncake) | ⚠️ Validation recommended | Stable, but tuning may be needed depending on environment |
| PP + PD + HiCache | ⚠️ Experimental | Compatibility achieved January 2026; cache consistency bugs exist |
The Most Common Mistakes in Practice
1. Attempting KV cache transfer over standard Ethernet
I also started out thinking "let me just test with 10GbE first" — and only after seeing latency nearly double did I truly appreciate the necessity of RDMA. Transferring multi-GB KV caches over standard Ethernet cancels out the benefits of PD separation with overhead. If you cannot provision InfiniBand or RoCE for the Prefill-Decode network, it is worth reconsidering the adoption entirely.
2. Fixing the Prefill:Decode ratio at 1:1
The Prefill:Decode = 1:3 ratio used in LMSYS's DeepSeek-R1 deployment is frequently cited, which leads some to treat it as a universal ratio. This ratio was chosen specifically for DeepSeek-R1's long reasoning outputs — an extreme Decode-heavy characteristic. Short contexts with large batches produce high Prefill load; many streaming requests produce high Decode load. It is better to profile your actual traffic patterns first and then decide on the ratio.
3. "I enabled HiCache L3 but somehow it got slower"
HiCache shines in workloads with repeated long common prefixes — system prompts, RAG documents, and the like. If each request brings a completely different context, all you are adding is L3 access overhead. Measuring your current workload's cache hit rate before applying HiCache lets you immediately judge whether investing in L3 is warranted.
Practical Application
Example 1: Basic PD Disaggregation Setup
You can start with the simplest form. This example assumes a Mooncake cluster is already installed and configured. If you want to start without Mooncake, you can use --disaggregation-transfer-backend nixl for NVIDIA NIXL, or verify the concept first with transfers between different GPUs on a single machine.
# Start Prefill server (using GPU 0)
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-70B-Instruct \
--disaggregation-mode prefill \
--disaggregation-transfer-backend mooncake \
--host 0.0.0.0 \
--port 30000
# Start Decode server (using GPUs 1,2,3)
# A prefill-mode server cannot complete requests on its own without a Decode server.
# A Decode server must always be present to receive the KV cache.
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-70B-Instruct \
--disaggregation-mode decode \
--disaggregation-transfer-backend mooncake \
--host 0.0.0.0 \
--port 30001
# Start load balancer — proxy that ties together Prefill/Decode servers
python -m sglang.launch_router \
--prefill-server http://prefill-host:30000 \
--decode-server http://decode-host:30001 \
--port 8080| Parameter | Description |
|---|---|
--disaggregation-mode |
Assign role as prefill or decode. Both roles must exist together |
--disaggregation-transfer-backend |
KV cache transfer backend. mooncake or nixl available |
launch_router |
Acts as a proxy tying together the Prefill/Decode servers |
Example 2: Enabling HiCache L2/L3 Tiers
HiCache uses only L1 (GPU HBM) by default. To also use CPU memory and external storage, add the following settings.
# Enable L2 (CPU DRAM) + L3 (Mooncake) via CLI
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-70B-Instruct \
--enable-hierarchical-cache \
--cpu-offload-gb 64 \
--kv-cache-store-backend mooncake \
--kv-cache-store-url mooncake://mooncake-cluster:8080# Python API approach (sglang_server_config.py)
# Note: Parameter names in ServerArgs may vary across SGLang versions.
# Always verify against the --help output or official docs for your installed version.
from sglang import ServerArgs
args = ServerArgs(
model_path="meta-llama/Llama-3.1-70B-Instruct",
enable_hierarchical_cache=True,
cpu_offload_gb=64,
kv_cache_store_backend="mooncake",
kv_cache_store_url="mooncake://your-mooncake-cluster:8080",
)Example 3: Reference Configuration for DeepSeek-R1-Scale Multi-Node Deployment
The following is a conceptual representation of the key ratios in the configuration deployed by the LMSYS team across 12 nodes (96× H100). This is pseudocode for reference purposes only — not a runnable config file.
# Conceptual reference configuration (not runnable — see official docs for each component for actual deployment)
prefill_nodes:
count: 3 # 25% of total, TP=8 (Tensor Parallelism)
gpus_per_node: 8
role: handles compute-bound workloads
decode_nodes:
count: 9 # 75% of total, EP+DP (Expert Parallelism + Data Parallelism)
gpus_per_node: 8
role: handles memory-bound workloads
network:
backend: InfiniBand # RDMA required
kv_transfer_latency_budget: less than 25% of total latency
results:
input_tps: 52300 # input tokens/second
output_tps: 22300 # output tokens/second/node
cost: "$0.20 / 1M output tokens"The Prefill:Decode = 1:3 ratio was chosen to match DeepSeek-R1's characteristic of long reasoning outputs. In practice it varies with workload characteristics (context length, batch size, request patterns), so treat these numbers as a starting point and adjust based on profiling results.
Example 4: Integrating Alibaba Cloud Tair as the L3 Backend (Agentic Workloads)
For workloads where the same long system prompt recurs — multi-turn conversations, agent workloads — using a Redis-compatible store as L3 is another option. The Prefill server in this example also requires a separate Decode server and load balancer to be fully operational.
# Prefill server configuration using Tair as L3 backend
python -m sglang.launch_server \
--model-path Qwen/Qwen2.5-72B-Instruct \
--enable-hierarchical-cache \
--cpu-offload-gb 128 \
--kv-cache-store-backend tair \
--kv-cache-store-url redis://tair-endpoint:6379 \
--disaggregation-mode prefillIn this configuration, multiple SGLang instances within the same cluster share the KV cache in Tair. When an agent repeatedly processes the same long context (tool descriptions, history, etc.), TTFT is significantly reduced.
Closing Thoughts
Combining PD Disaggregation and HiCache is a story about "using the GPUs you have properly," not "buying more GPUs." Figures like up to 6× throughput and 84% TTFT reduction on the same infrastructure became reality between 2025 and 2026, and from hands-on experimentation, the biggest insight is that the gains come from measuring ratios and cache hit rates. The fastest path is to understand the architecture first, measure your actual workload, and then decide whether to adopt.
Here are three steps you can take right now.
-
Run a single-machine PD Disaggregation experiment locally: In a Python 3.10+ environment, run
pip install "sglang[all]", then start a Prefill server and a Decode server on different GPUs in the same machine using different ports. Proof-of-concept validation is possible via NVLink or PCIe bandwidth even without InfiniBand. The SGLang official PD Disaggregation documentation has well-organized step-by-step examples. -
Start with HiCache L2 (CPU offload): Without any distributed storage, the
--enable-hierarchical-cache --cpu-offload-gb 64options alone let you preserve KV caches evicted from GPU HBM into CPU memory. If you have multi-turn conversation workloads, you can immediately measure the change in TTFT. -
Profile your actual workload before deciding on ratios: Use SGLang's built-in official benchmarking tool
sglang.bench_servingor a custom script to first measure your current traffic's average input token count, output token count, and cache hit rate. These numbers are the most reliable basis for deciding the Prefill:Decode ratio and prioritizing investment in HiCache tiers.
References
- SGLang Official PD Disaggregation Documentation | sglang.ai
- DeepSeek Deployment with PD Disaggregation on 96 H100 GPUs | LMSYS
- SGLang HiCache Official Announcement | LMSYS
- EPD Disaggregation for Vision-Language Models | LMSYS
- Pipeline Parallelism in SGLang - Million-Token Contexts | LMSYS
- SGLang HiCache System Design Documentation | sglang.io
- Mooncake x SGLang HiCache System Design | kvcache-ai
- Complete Guide to Mooncake HiCache Integration | kvcache-ai
- NVIDIA NIXL Disaggregated Inference Guide | spheron.network
- Prefill-Decode Disaggregation on MI300X with SGLang | AMD ROCm
- NVIDIA Dynamo SGLang Disaggregation Documentation | nvidia.com
- 18-Month Retrospective on Disaggregated Inference | Hao AI Lab
- Cache-aware Prefill-Decode Disaggregation (CPD) | Together AI
- Mooncake Paper (FAST 2025 Best Paper) | arXiv
- Alibaba Cloud Tair + SGLang HiCache Case Study | alibabacloud.com