Why Serving DeepSeek-V3 on 96 H100s Is Possible: SGLang Expert Parallelism's Communication Optimization and Memory Fragmentation Solutions
52,300 input tokens/s. This is the figure LMSYS announced in May 2025 when they became the first to openly deploy DeepSeek-V3 on 96 H100 GPUs. It was initially counterintuitive that a 685B MoE model could achieve this throughput. A single node simply doesn't have enough VRAM, and blindly scaling up Tensor Parallelism means GPUs spend more time waiting on each other than actually computing. When I first saw this number, my immediate reaction was "how on earth?"
The secret was Expert Parallelism (EP) — but understanding EP as simply "spreading Experts across multiple GPUs" only gets you halfway there. Achieving this performance requires solving three problems simultaneously: an overlap strategy that hides inter-node communication behind computation, EPLB to balance the token load across Experts, and the silent memory fragmentation that occurs during TP sharding.
This article is written for developers and ML engineers interested in LLM serving. It walks through the entire picture in one flow: the mechanics of EP's 4-stage pipeline, the backends and configurations you need to choose for real deployments, and the difficult-to-diagnose fragmentation problem. After reading this, you'll have a solid basis for deciding which combination of TP, EP, and DP to choose in large-scale MoE deployments.
If you're already familiar with MoE and EP concepts, feel free to jump straight to the Practical Application section.
Core Concepts
The Relationship Between MoE Gating and Expert Parallelism
In a Mixture-of-Experts (MoE) model, input tokens are routed through a gating network to only the top-K experts. Think of it as "send this token to Expert 3 and Expert 7." While the total parameter count is in the hundreds of billions, the actual active computation is only a fraction of what a Dense model would require.
The idea behind Expert Parallelism is straightforward. By placing each Expert on a different GPU, computation only needs to happen on the GPU that hosts the relevant Expert when a token is routed. Rather than slicing matrices and distributing the pieces as in TP, EP physically separates the Experts themselves.
Expert Parallelism (EP): A parallelization strategy that distributes MoE model Expert weights across multiple GPUs. Unlike Tensor Parallelism, each GPU handles independent Expert computation, so the communication pattern remains relatively simple as you scale.
When comparing EP directly against a TP-only configuration, throughput improvements of up to 5x have been reported for DeepSeek-V3 under conditions of 2,000-token inputs and sufficient batch sizes. Results vary by batch size and sequence length, so measuring against your own workload is advisable.
In practice, large-scale deployments often combine EP with TP rather than using EP alone. Since NVLink provides sufficient bandwidth within a node, keeping TP small and using EP for inter-node scaling — an EP + TP hybrid — can be a flexible choice.
SGLang EP's 4-Stage Pipeline
SGLang cleanly separates the MoE forward pass into four distinct stages. This structure matters because each stage can have its own independent kernel and optimization strategy applied to it.
| Stage | Role | Key Operation |
|---|---|---|
| Dispatch | Send tokens to their assigned Expert GPU | all-to-all communication |
| Pre-permute | Reorder tokens by Expert assignment | Memory rearrangement |
| Core Runner | Execute Expert computation on the assigned GPU | Grouped GEMM |
| Post-permute + Combine | Restore original order and collect results | all-to-all communication |
I initially wondered why the Pre-permute stage was necessary. The reason is that the Grouped GEMM kernel used in the Core Runner requires tokens assigned to the same Expert to be laid out contiguously in memory to operate efficiently. If token order is scrambled after Dispatch, the GPU ends up hunting through memory for each token individually, and Tensor Core utilization drops significantly.
It's natural to wonder whether having two all-to-all communication rounds doubles the communication cost — and that question leads directly to the core topic of the next section.
Practical Application
Strategy 1: SBO and TBO — Filling Communication Wait Time with Computation
Inter-node communication becoming a bottleneck is inevitable. NVLink supports roughly 600 GB/s between GPUs within a node, but NDR InfiniBand between nodes runs at 400 Gb/s — translating to an effective bandwidth of around 50 GB/s. Note the different units: in practice, inter-node communication is more than 10x slower than intra-node. As node count grows, this gap dominates overall latency.
SGLang's approach is not to make communication faster, but to run other computation in parallel while communication is in progress.
SBO (Single-Batch Overlap): While all-to-all communication is in flight for one batch, the Shared Expert computation — which every token passes through — is processed simultaneously. This eliminates the idle time GPUs spend waiting on the network.
TBO (Two-Batch Overlap): The batch is split into two micro-batches, so while the first batch is communicating, the second batch is computing. This simultaneously increases throughput and reduces peak memory usage by roughly half — a dual benefit.
Below is the CLI command to launch an SGLang server with an EP configuration. This is based on SGLang v0.4 and above; flag names can change between versions, so checking the official documentation for your version alongside this is recommended.
# Single-node 8-GPU EP configuration
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3 \
--tp-size 1 \ # minimize TP
--ep-size 8 \ # use all 8 GPUs in the node for EP
--dp-size 1 \ # single-node baseline
--enable-dp-attention # eliminate KV cache replication
# For multi-node scaling (12 nodes × 8 GPUs = 96 H100s)
# add --dp-size 12 and run the same command on each node
# overlap strategy is specified via config file or version-specific additional flags| Overlap Strategy | Characteristics | Recommended When |
|---|---|---|
| SBO | Single batch, simple implementation, stable latency | Latency-sensitive services, first time applying EP |
| TBO | Dual batch, simultaneous throughput↑ + memory↓ | Throughput-optimized batch workloads |
If you've enabled EP but performance falls short of expectations, the first thing to check is whether communication overlap is actually being applied. If all-to-all is running as a synchronous blocking call, more than half of EP's advantage disappears.
Backend Selection: DeepEP and RDMA-Based Asynchronous Communication
Standard NCCL's all-to-all is synchronous and blocking. The GPU simply stalls and waits until communication completes. DeepEP was developed to solve this problem.
RDMA (Remote Direct Memory Access): Technology that transfers data directly between GPU memory buffers without CPU involvement. While NCCL passes through multiple software layers internally, RDMA communicates at the hardware level, drastically reducing CPU overhead and latency. Particularly effective in InfiniBand environments.
DeepEP implements EP-dedicated all-to-all using pure RDMA-based asynchronous kernels, and provides two modes suited to different workload characteristics.
# Specify DeepEP as the SGLang backend
export SGLANG_EP_BACKEND=deepep
# Prefill nodes: high-throughput Normal mode
export DEEPEP_MODE=normal
# Decode nodes: CUDA Graph-compatible low-latency mode
export DEEPEP_MODE=low_latencySupport for these environment variables may vary by SGLang version. It is recommended to verify against the release notes and official documentation for your version of SGLang before deploying.
DeepEP Low-latency mode: A kernel designed exclusively for Decode. Its hook-based design occupies the minimum number of SMs (Streaming Multiprocessors, GPU compute units), preventing the paradoxical situation where communication overlap actually reduces computational throughput.
Using the same mode on both Prefill and Decode nodes means neither will run optimally. In PD Disaggregation configurations, it's important to specify the mode separately per node type.
Beyond DeepEP, alternatives such as MoriEP (with direct SGLang integration) and UCCL-EP (focused on cloud portability) have been emerging competitively since 2025. They all share the same direction: implementing EP-dedicated communication without depending on NCCL.
Diagnosing Memory Fragmentation: DeepSeek-V3's 18,432 Problem
This is the part that silently eats away at performance in production. It's more insidious precisely because it's invisible.
DeepSeek-V3's FFN intermediate dimension is 18,432. With TP32 applied, 18,432 ÷ 32 = 576 — which is not a multiple of 128, the alignment boundary for H100 Tensor Cores. The hardware internally adds padding, and Tensor Core utilization quietly drops. It's a classic pattern where GPU utilization metrics look healthy but actual computational efficiency is low. I only traced this problem after profiling SM utilization directly.
# Misalignment fragmentation diagnosis — check alignment and padding ratio in TP configuration
ffn_hidden_dim = 18432
tp_size = 32
alignment = 128
shard_size = ffn_hidden_dim // tp_size # 576
is_aligned = (shard_size % alignment) == 0
if not is_aligned:
aligned_size = ((shard_size + alignment - 1) // alignment) * alignment
padding_ratio = (aligned_size - shard_size) / aligned_size
print(f"Shard size: {shard_size}")
print(f"128-aligned: {is_aligned}") # False
print(f"Padding ratio: {padding_ratio:.1%}") # ~12.5% wasted
print("→ Recommend monitoring actual SM utilization with nvidia-smi dmon")KV cache fragmentation is a separate issue. Under high concurrent streaming requests, internal fragmentation accumulates in the PagedAttention block pool due to variable-length sequences, and cases have been reported where reserving more than 2x the actual VRAM in use becomes necessary.
| Mitigation Strategy | Effect | Notes |
|---|---|---|
| Prefer DP over TP | Eliminates FFN alignment fragmentation at the source | Combines naturally with EP |
| DP Attention | Eliminates KV cache replication | --enable-dp-attention flag |
| MLA | Up to 93.3% reduction in KV cache | Built into DeepSeek-V3 by default |
| FP8 KV cache quantization | ~50% reduction in cache memory | Negligible quality loss |
Pros and Cons Analysis
Advantages
Personally, EP's greatest strength is that "the scaling behavior is predictable." With TP, all-reduce communication grows sharply in proportion to the number of participating GPUs. With EP's all-to-all, each node only exchanges its own Expert results, so the growth in communication as node count increases is much more gradual than with TP. The difference was quite pronounced when I compared the two configurations directly.
| Item | Description |
|---|---|
| Enables serving very large models | Distributes 685B+ models that exceed single-GPU VRAM limits across multiple nodes |
| Maximizes sparse activation efficiency | Only activated Experts compute, dramatically reducing effective computation compared to Dense models |
| More gradual scaling than TP | Uses all-to-all without all-reduce, so communication overhead grows more slowly with additional nodes than TP |
| Independent per-stage optimization | Dispatch, GEMM, and Combine can each be swapped for optimal kernels independently |
| Automated load balancing with EPLB | Dynamically redistributes token-heavy Experts to minimize GPU idle time |
Disadvantages and Caveats
Honestly, the most bewildering moment when I first encountered EP in production was asking "the configuration looks right — why is it slow?" In most cases the answer was the second or third item in the table below.
| Item | Description | Mitigation |
|---|---|---|
| Token load imbalance | If tokens concentrate on specific Experts, that GPU becomes the bottleneck for everything | Apply EPLB, configure Expert redundancy |
| Inter-node bandwidth gap | NVLink (600 GB/s) vs. InfiniBand effective ~50 GB/s, more than 10x difference | Prioritize intra-node EP, hide inter-node with SBO/TBO overlap |
| all-to-all blocking | Standard NCCL is synchronous blocking, causing GPU stalls | Replace with DeepEP/MoriEP asynchronous RDMA backend |
| SM occupancy conflicts | If overlap implementation over-consumes SMs, computational throughput paradoxically decreases | Use DeepEP hook-based design, choose mode carefully |
| FFN alignment fragmentation | Tensor Core alignment boundary violations during TP sharding reduce utilization | Reduce TP and shift to DP scaling |
| KV cache internal fragmentation | Variable-length sequences waste block pool capacity, requiring reservations over 2x actual usage | Apply MLA + FP8 quantization combination |
EPLB (Expert Parallelism Load Balancer): An Expert placement load balancing algorithm published by DeepSeek. It analyzes token routing statistics and replicates heavily loaded Experts across multiple GPUs. It is now being adopted as standard in major MoE serving systems including DeepSeek-V3, Qwen3, and Kimi-K2.
Most Common Mistakes in Production
-
Scaling TP first: The instinct of "I need more GPUs, so let me increase TP first" backfires with MoE. The larger TP becomes, the worse FFN alignment fragmentation gets, and all-reduce communication scales up proportionally. I've frequently seen teams go through pain by approaching it in this order. It's better to evaluate EP + DP combinations first.
-
Applying EP without SBO/TBO: If you've enabled EP but performance is below expectations, communication overlap is often simply not active. If all-to-all is running as a synchronous blocking call, more than half of EP's advantage is gone.
-
Using the same settings for Prefill and Decode: DeepEP's Normal mode and Low-latency mode are not just options — they are separate kernels designed for different workload characteristics. In PD Disaggregation configurations, applying the same settings to both node types means neither will operate optimally.
Closing Thoughts
At the scale of 96 H100s, the key is the EP + DP combination, and the benefits only materialize when you solve all three problems together: communication overlap (SBO), load balancing (EPLB), and memory alignment. If you're applying EP for the first time, starting with SBO, verifying workload characteristics, and then transitioning to TBO is the safest path.
Here are three steps you can take right now.
-
Pre-diagnose memory fragmentation: Before finalizing your deployment configuration, check whether your TP shard size is a multiple of 128. Use the diagnostic code above to calculate the padding ratio, and observe actual SM utilization patterns with
nvidia-smi dmon— this will make subsequent bottleneck analysis much easier. If there's a large gap between reserved and actual VRAM usage, that's a signal that internal fragmentation is underway. -
Small-scale EP test: Start with a single node of 8 GPUs using
--ep-size 8 --dp-size 1and compare throughput before and after applying SBO.sglang.bench_servingmakes this easy to measure. Enabling EPLB and DP Attention together at this stage lets you isolate the contribution of each component. -
Verify environment and bandwidth: If you're preparing for a multi-node configuration, use
python -m sglang.check_envto check your NCCL version and GPU-to-GPU NVLink topology, andib_write_bwto measure effective InfiniBand bandwidth in advance. If inter-node bandwidth is lower than expected, this will help you determine whether SBO alone is sufficient or whether you need the DeepEP asynchronous backend.
References
- Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs | LMSYS Blog
- Expert Parallelism | SGLang Official Documentation
- Expert Parallelism for MoE Models | DeepWiki
- DeepEP: an efficient expert-parallel communication library | GitHub
- UCCL-EP: Portable Expert-Parallel Communication | arXiv
- Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling | arXiv
- Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures | arXiv
- SGLang Official GitHub Repository
- PD Disaggregation | SGLang Official Documentation