Privacy Policy© 2026 DEV BAK - TECH BLOG. All rights reserved.
DEV BAK - TECH BLOG
AI

SGLang Architecture That Extracts 6× Throughput from the Same GPUs — PD Disaggregation and HiCache

When I first took on LLM serving, the most baffling question was "We have enough GPUs, so why is this so slow?" The monitoring dashboard showed GPU memory nearly maxed out and compute running continuously, yet actual response times fell far short of expectations. For a while I kept thinking the answer was buying more GPUs — but as it turned out, the root cause was somewhere else entirely. Prefill (processing the entire input) and Decode (generating tokens one at a time) are two fundamentally different workloads, and we had them fighting over the same GPU pool.

The entire industry is now moving to address this problem at its root. Nearly every production LLM serving framework — NVIDIA Dynamo, vLLM, Ray Serve LLM, llm-d — has adopted or is integrating a PD-separated architecture. This trend, which started with Hao AI Lab's DistServe research, has since extended to Moonshot AI's Kimi service and Together AI's CPD (Cache-aware PD) architecture. Combining SGLang's officially supported PD Disaggregation (Prefill-Decode separation) with its hierarchical KV cache system HiCache, numbers like 5× throughput versus the same resources (PD separation alone) on a 96× H100 cluster — up to 6× when HiCache is added — and up to 84% reduction in TTFT (Time To First Token) become reality.

This article is written primarily for backend and ML engineers who have experience with, or interest in, LLM serving. Familiarity with basic distributed inference concepts like Tensor Parallelism will make it easier to follow, but any abbreviation appearing for the first time will be explained on the spot. After reading this, you will have a framework for judging why PD should be separated, how HiCache works, and which combination to try first in your own environment.


Core Concept: Why Prefill and Decode Should Not Share the Same Pool

They Are Completely Different Beasts from a Hardware Perspective

For a long time I lumped Prefill and Decode together as just "the first stage and the second stage." But from a GPU resource standpoint, these are entirely different kinds of work.

Prefill Decode
Nature Compute-bound Memory-bound
Input Thousands to hundreds of thousands of tokens processed in parallel Tokens generated sequentially, one at a time
Bottleneck FLOPs (compute volume) HBM bandwidth
Optimal GPU High Tensor Core density Large HBM capacity

HBM (High Bandwidth Memory): The ultra-fast memory attached to a GPU — what people commonly call "GPU memory." An H100 carries 80 GB of HBM3. Memory-bound workloads like Decode are primarily bottlenecked by this HBM's bandwidth.

Prefill processes all input tokens at once, making it a compute-intensive workload that fully utilizes GPU cores. Decode, by contrast, repeatedly reads an already-built KV cache to generate tokens one at a time, making HBM bandwidth the bottleneck. When both workloads share the same GPU pool, neither can be optimized properly.

The problems do not stop there. When a large Prefill batch arrives, in-flight Decode streams stall. This is prefill interruption — from the user's perspective, the response simply freezes. On top of that, DP Attention Imbalance compounds the issue. This imbalance occurs when long-context requests concentrate on a specific DP (Data Parallel) rank: that rank becomes overloaded while the rest sit idle, causing unpredictable latency spikes.

TP/EP/DP/PP: Abbreviations for the distributed parallelization strategies commonly used in LLMs. TP (Tensor Parallelism) splits weight matrices across GPUs to parallelize each layer's computation; EP (Expert Parallelism) distributes the expert layers of MoE models; DP (Data Parallelism) distributes requests across identical model replicas; PP (Pipeline Parallelism) groups layers and pipelines them across multiple GPUs.

PD Disaggregation: Physically Separating the Two

The solution is exactly what the name says. Separate the Prefill server pool from the Decode server pool, and apply independently optimized hardware and parallelization strategies to each.

[Client Request]
       │
       ▼
[Load Balancer]
       │
  ┌────┴────┐
  ▼         ▼
[Prefill    [Decode
 Server     Server
 Pool]      Pool]
  │              ▲
  └──KV Cache Transfer─┘
     (RDMA/RoCE)

Prefill servers process all input tokens in parallel to produce the KV cache, then hand it off to Decode servers via an RDMA-based high-speed network. Decode servers use the received KV cache to generate tokens one at a time.

An important point is that different GPUs can be attached to each server pool. For example, it is possible to run a heterogeneous mix — H100s with higher compute performance on Prefill servers and A100s with larger HBM capacity on Decode servers. Because KV cache transfer accounts for roughly 25% of total latency, the network between Prefill and Decode must be InfiniBand or RoCE (RDMA over Converged Ethernet).

RDMA (Remote Direct Memory Access): A technology that accesses another server's memory directly over the network without going through the CPU. It is essentially mandatory for transferring KV caches that can reach several GB with low latency.

SGLang began officially supporting this architecture based on the Mooncake TransferEngine starting in April 2025.


HiCache: A Three-Tier Architecture That Doesn't Discard Evicted KV Caches

From RadixAttention to HiRadixTree

Honestly, when I first saw the HiCache concept I thought "isn't this just cache tiering?" — but to really understand it, you first need to know SGLang's RadixAttention.

RadixAttention is a prefix-sharing KV cache technique developed by SGLang. It uses a Radix tree structure to manage KV caches so that requests sharing the same system prompt or context can reuse them. For example, if 100 requests all carry the same long system prompt, the KV cache for that prompt is computed once and reused from the tree.

The catch is that RadixAttention operates only on GPU HBM. When memory runs out, the cache is evicted, and if the same context arrives again, it is recomputed from scratch. HiCache takes inspiration from the CPU's three-level cache design and extends this to multiple tiers. A HiRadixTree acts as a page table, tracking which tier each KV cache page currently resides in, while a Cache Controller automatically coordinates data movement between tiers in the background.

The Three-Tier Structure and GPU-Assisted I/O

L1: GPU HBM        — Fastest, capacity-limited (tens of GB)
     ↕ cudaMemcpyAsync
L2: CPU DRAM       — Several times L1's capacity (hundreds of GB)
     ↕ GPU-assisted I/O kernel
L3: Distributed storage — Virtually unlimited (Mooncake, Tair, 3FS...)

The L2↔L1 path uses standard cudaMemcpyAsync, but the L3↔L2 path involves a GPU-assisted I/O kernel. The conventional approach has the CPU coordinate reading data from storage and copying it to the GPU; GPU-assisted I/O bypasses the CPU entirely — the GPU's DMA (Direct Memory Access) engine reads directly from storage. This difference yields up to 3× improvement in L3 access throughput.

Rather than being discarded, caches evicted from L1 are preserved in L2 or L3, so when the same context arrives again it can be loaded without the expensive recomputation.

Mooncake Store: A KVCache-centric distributed storage system developed by Moonshot AI. It supports high-bandwidth, low-latency transfer via RDMA and zero-copy technology, and won the FAST 2025 Best Paper award. It is the most commonly used L3 backend for SGLang HiCache.

When the Two Technologies Meet

If PD Disaggregation isolates the GPU compute load of Prefill servers, HiCache preserves the KV caches computed by those Prefill servers across tiers to prevent recomputation on repeated requests. For scenarios where the same long context recurs — multi-turn conversations, agentic workloads — this combination is especially powerful, and benchmarks combining the two technologies yield figures of up to 6× throughput and up to 84% reduction in TTFT.


Pros and Cons Analysis

This section is placed early so you can first assess whether this fits your environment before looking at code examples.

Advantages

Item Details
Independent scaling Prefill (compute-bound) and Decode (memory-bound) servers can be scaled up or down separately to match demand, enabling cost-efficient capacity planning
Higher KV cache reuse HiCache L2/L3 preserves evicted caches, enabling up to 84% TTFT reduction for multi-turn and agentic workloads
Interference elimination Prefill interruption — where Prefill halts Decode — is eliminated, improving TPOT stability
Heterogeneous hardware support GPUs optimized for compute (e.g., H100) can be placed on Prefill, while GPUs with larger HBM (e.g., A100) can be placed on Decode
Throughput improvement Up to 5× output throughput versus pure TP on the same resources (96× H100, PD separation alone)
Cost reduction $0.20/1M output tokens by LMSYS measurements, enabling economical operation versus standard deployments

TPOT (Time Per Output Token): The time required to generate a single token. Along with TTFT (Time To First Token), it is a key metric for LLM serving quality, directly influenced by Decode-stage stability.

Disadvantages and Caveats

Item Details Mitigation
KV cache transfer overhead Prefill→Decode KV cache transfer accounts for roughly 25% of total latency; longer contexts produce multi-GB data movement InfiniBand or RoCE-based RDMA networking is required
Operational complexity Prefill servers, Decode servers, HiCache backends, and load balancers must each be operated, adding fault-tolerance management Consider standardizing and automating configuration with Kubernetes + Helm
HiCache tier promotion latency Loading KV caches from L2/L3 to L1 adds time, potentially causing latency spikes on cache misses Profiling workload patterns and tuning to keep hot caches in L1/L2 is necessary
PP + PD + HiCache integration complexity Triple integration of PP+PD+HiCache for very large models (60+ layers) is a new feature whose compatibility was only achieved in January 2026 Allow sufficient validation time before production use
Cache consistency issues Cache consistency bugs have been reported with HiCache + PP combinations (GitHub Issue #22607) Check the latest SGLang release notes and issue tracker before using this combination
Initial infrastructure cost The cost of building RDMA networking and distributed storage like Mooncake/3FS is substantial It is possible to start with L1+L2 only (no L3) at small scale and expand incrementally

The production readiness by combination is summarized below.

Combination Status Notes
PD Disaggregation (standalone) ✅ Production-ready Officially supported since April 2025
PD + HiCache L1+L2 ✅ Production-ready Stable through CPU offload
PD + HiCache L3 (Mooncake) ⚠️ Validation recommended Stable, but tuning may be needed depending on environment
PP + PD + HiCache ⚠️ Experimental Compatibility achieved January 2026; cache consistency bugs exist

The Most Common Mistakes in Practice

1. Attempting KV cache transfer over standard Ethernet

I also started out thinking "let me just test with 10GbE first" — and only after seeing latency nearly double did I truly appreciate the necessity of RDMA. Transferring multi-GB KV caches over standard Ethernet cancels out the benefits of PD separation with overhead. If you cannot provision InfiniBand or RoCE for the Prefill-Decode network, it is worth reconsidering the adoption entirely.

2. Fixing the Prefill:Decode ratio at 1:1

The Prefill:Decode = 1:3 ratio used in LMSYS's DeepSeek-R1 deployment is frequently cited, which leads some to treat it as a universal ratio. This ratio was chosen specifically for DeepSeek-R1's long reasoning outputs — an extreme Decode-heavy characteristic. Short contexts with large batches produce high Prefill load; many streaming requests produce high Decode load. It is better to profile your actual traffic patterns first and then decide on the ratio.

3. "I enabled HiCache L3 but somehow it got slower"

HiCache shines in workloads with repeated long common prefixes — system prompts, RAG documents, and the like. If each request brings a completely different context, all you are adding is L3 access overhead. Measuring your current workload's cache hit rate before applying HiCache lets you immediately judge whether investing in L3 is warranted.


Practical Application

Example 1: Basic PD Disaggregation Setup

You can start with the simplest form. This example assumes a Mooncake cluster is already installed and configured. If you want to start without Mooncake, you can use --disaggregation-transfer-backend nixl for NVIDIA NIXL, or verify the concept first with transfers between different GPUs on a single machine.

bash
# Start Prefill server (using GPU 0)
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --disaggregation-mode prefill \
  --disaggregation-transfer-backend mooncake \
  --host 0.0.0.0 \
  --port 30000
 
# Start Decode server (using GPUs 1,2,3)
# A prefill-mode server cannot complete requests on its own without a Decode server.
# A Decode server must always be present to receive the KV cache.
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --disaggregation-mode decode \
  --disaggregation-transfer-backend mooncake \
  --host 0.0.0.0 \
  --port 30001
 
# Start load balancer — proxy that ties together Prefill/Decode servers
python -m sglang.launch_router \
  --prefill-server http://prefill-host:30000 \
  --decode-server http://decode-host:30001 \
  --port 8080
Parameter Description
--disaggregation-mode Assign role as prefill or decode. Both roles must exist together
--disaggregation-transfer-backend KV cache transfer backend. mooncake or nixl available
launch_router Acts as a proxy tying together the Prefill/Decode servers

Example 2: Enabling HiCache L2/L3 Tiers

HiCache uses only L1 (GPU HBM) by default. To also use CPU memory and external storage, add the following settings.

bash
# Enable L2 (CPU DRAM) + L3 (Mooncake) via CLI
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --enable-hierarchical-cache \
  --cpu-offload-gb 64 \
  --kv-cache-store-backend mooncake \
  --kv-cache-store-url mooncake://mooncake-cluster:8080
python
# Python API approach (sglang_server_config.py)
# Note: Parameter names in ServerArgs may vary across SGLang versions.
# Always verify against the --help output or official docs for your installed version.
from sglang import ServerArgs
 
args = ServerArgs(
    model_path="meta-llama/Llama-3.1-70B-Instruct",
    enable_hierarchical_cache=True,
    cpu_offload_gb=64,
    kv_cache_store_backend="mooncake",
    kv_cache_store_url="mooncake://your-mooncake-cluster:8080",
)

Example 3: Reference Configuration for DeepSeek-R1-Scale Multi-Node Deployment

The following is a conceptual representation of the key ratios in the configuration deployed by the LMSYS team across 12 nodes (96× H100). This is pseudocode for reference purposes only — not a runnable config file.

yaml
# Conceptual reference configuration (not runnable — see official docs for each component for actual deployment)
prefill_nodes:
  count: 3          # 25% of total, TP=8 (Tensor Parallelism)
  gpus_per_node: 8
  role: handles compute-bound workloads
 
decode_nodes:
  count: 9          # 75% of total, EP+DP (Expert Parallelism + Data Parallelism)
  gpus_per_node: 8
  role: handles memory-bound workloads
 
network:
  backend: InfiniBand  # RDMA required
  kv_transfer_latency_budget: less than 25% of total latency
 
results:
  input_tps: 52300    # input tokens/second
  output_tps: 22300   # output tokens/second/node
  cost: "$0.20 / 1M output tokens"

The Prefill:Decode = 1:3 ratio was chosen to match DeepSeek-R1's characteristic of long reasoning outputs. In practice it varies with workload characteristics (context length, batch size, request patterns), so treat these numbers as a starting point and adjust based on profiling results.

Example 4: Integrating Alibaba Cloud Tair as the L3 Backend (Agentic Workloads)

For workloads where the same long system prompt recurs — multi-turn conversations, agent workloads — using a Redis-compatible store as L3 is another option. The Prefill server in this example also requires a separate Decode server and load balancer to be fully operational.

bash
# Prefill server configuration using Tair as L3 backend
python -m sglang.launch_server \
  --model-path Qwen/Qwen2.5-72B-Instruct \
  --enable-hierarchical-cache \
  --cpu-offload-gb 128 \
  --kv-cache-store-backend tair \
  --kv-cache-store-url redis://tair-endpoint:6379 \
  --disaggregation-mode prefill

In this configuration, multiple SGLang instances within the same cluster share the KV cache in Tair. When an agent repeatedly processes the same long context (tool descriptions, history, etc.), TTFT is significantly reduced.


Closing Thoughts

Combining PD Disaggregation and HiCache is a story about "using the GPUs you have properly," not "buying more GPUs." Figures like up to 6× throughput and 84% TTFT reduction on the same infrastructure became reality between 2025 and 2026, and from hands-on experimentation, the biggest insight is that the gains come from measuring ratios and cache hit rates. The fastest path is to understand the architecture first, measure your actual workload, and then decide whether to adopt.

Here are three steps you can take right now.

  1. Run a single-machine PD Disaggregation experiment locally: In a Python 3.10+ environment, run pip install "sglang[all]", then start a Prefill server and a Decode server on different GPUs in the same machine using different ports. Proof-of-concept validation is possible via NVLink or PCIe bandwidth even without InfiniBand. The SGLang official PD Disaggregation documentation has well-organized step-by-step examples.

  2. Start with HiCache L2 (CPU offload): Without any distributed storage, the --enable-hierarchical-cache --cpu-offload-gb 64 options alone let you preserve KV caches evicted from GPU HBM into CPU memory. If you have multi-turn conversation workloads, you can immediately measure the change in TTFT.

  3. Profile your actual workload before deciding on ratios: Use SGLang's built-in official benchmarking tool sglang.bench_serving or a custom script to first measure your current traffic's average input token count, output token count, and cache hit rate. These numbers are the most reliable basis for deciding the Prefill:Decode ratio and prioritizing investment in HiCache tiers.


References

  • SGLang Official PD Disaggregation Documentation | sglang.ai
  • DeepSeek Deployment with PD Disaggregation on 96 H100 GPUs | LMSYS
  • SGLang HiCache Official Announcement | LMSYS
  • EPD Disaggregation for Vision-Language Models | LMSYS
  • Pipeline Parallelism in SGLang - Million-Token Contexts | LMSYS
  • SGLang HiCache System Design Documentation | sglang.io
  • Mooncake x SGLang HiCache System Design | kvcache-ai
  • Complete Guide to Mooncake HiCache Integration | kvcache-ai
  • NVIDIA NIXL Disaggregated Inference Guide | spheron.network
  • Prefill-Decode Disaggregation on MI300X with SGLang | AMD ROCm
  • NVIDIA Dynamo SGLang Disaggregation Documentation | nvidia.com
  • 18-Month Retrospective on Disaggregated Inference | Hao AI Lab
  • Cache-aware Prefill-Decode Disaggregation (CPD) | Together AI
  • Mooncake Paper (FAST 2025 Best Paper) | arXiv
  • Alibaba Cloud Tair + SGLang HiCache Case Study | alibabacloud.com
#SGLang#LLM서빙#PD-Disaggregation#KV캐시#HiCache#RDMA#분산추론#RadixAttention#GPU최적화#Mooncake
Share

Table of Contents

Core Concept: Why Prefill and Decode Should Not Share the Same PoolThey Are Completely Different Beasts from a Hardware PerspectivePD Disaggregation: Physically Separating the TwoHiCache: A Three-Tier Architecture That Doesn't Discard Evicted KV CachesFrom RadixAttention to HiRadixTreeThe Three-Tier Structure and GPU-Assisted I/OWhen the Two Technologies MeetPros and Cons AnalysisAdvantagesDisadvantages and CaveatsThe Most Common Mistakes in PracticePractical ApplicationExample 1: Basic PD Disaggregation SetupExample 2: Enabling HiCache L2/L3 TiersExample 3: Reference Configuration for DeepSeek-R1-Scale Multi-Node DeploymentExample 4: Integrating Alibaba Cloud Tair as the L3 Backend (Agentic Workloads)Closing ThoughtsReferences

Recommended Posts

SGLang EPD Disaggregation and Pipeline Parallelism — An Architecture That Splits Vision-Language Model Serving into 3 Stages to Reduce TTFT by Up to 8x
AI

SGLang EPD Disaggregation and Pipeline Parallelism — An Architecture That Splits Vision-Language Model Serving into 3 Stages to Reduce TTFT by Up to 8x

Even if you've never directly served multimodal AI before, that's fine. These days, AI features that accept image input are becoming so widespread so quickly th...

May 27, 202623 min read
Why Serving DeepSeek-V3 on 96 H100s Is Possible: SGLang Expert Parallelism's Communication Optimization and Memory Fragmentation Solutions
AI

Why Serving DeepSeek-V3 on 96 H100s Is Possible: SGLang Expert Parallelism's Communication Optimization and Memory Fragmentation Solutions

52,300 input tokens/s. This is the figure LMSYS announced in May 2025 when they became the first to openly deploy DeepSeek-V3 on 96 H100 GPUs. It was initially ...

May 28, 202622 min read
XGrammar-2: The Design Principles Behind 80x Faster Structured Output
AI

XGrammar-2: The Design Principles Behind 80x Faster Structured Output

When an LLM calls a tool or returns JSON, there's actually quite a heavy operation running behind the scenes. Every time the model emits a token, it must determ...

May 28, 202623 min read
SGLang RadixAttention KV Cache Hit Rate: How We Raised Hit Rate from 3% to 78% with Prometheus Monitoring and Operational Tuning — Advanced Guide
AI

SGLang RadixAttention KV Cache Hit Rate: How We Raised Hit Rate from 3% to 78% with Prometheus Monitoring and Operational Tuning — Advanced Guide

When running LLM serving infrastructure, GPU costs can quickly spiral out of control. Back when I was operating a multi-turn chatbot service, I eventually reali...

May 27, 202621 min read
Migrating LLM Inference from vLLM to SGLang: Why RAG and Multi-Turn Workloads See a 6x Throughput Difference
AI

Migrating LLM Inference from vLLM to SGLang: Why RAG and Multi-Turn Workloads See a 6x Throughput Difference

Honestly, my first reaction when I came across SGLang was "another new framework?" vLLM was working well enough, and touching a serving stack that's already run...

May 27, 202621 min read
Why KV Cache Hit Rate Drops to 0% When Scaling Out vLLM Pods, and How llm-d Solves It (Prefix-Aware Routing / Distributed KV Cache)
AI

Why KV Cache Hit Rate Drops to 0% When Scaling Out vLLM Pods, and How llm-d Solves It (Prefix-Aware Routing / Distributed KV Cache)

When operating an LLM service, you will eventually encounter this situation. When you had Automatic Prefix Caching (APC) enabled in vLLM and ran a multi-turn ch...

May 26, 202626 min read