Why Serving DeepSeek-V3 on 96 H100s Is Possible: SGLang Expert Parallelism's Communication Optimization and Memory Fragmentation Solutions

52,300 input tokens/s. This is the figure LMSYS announced in May 2025 when they became the first to openly deploy DeepSeek-V3 on 96 H100 GPUs. It was initially counterintuitive that a 685B MoE model could achieve this throughput. A single node simply doesn't have enough VRAM, and blindly scaling up Tensor Parallelism means GPUs spend more time waiting on each other than actually computing. When I first saw this number, my immediate reaction was "how on earth?"

The secret was Expert Parallelism (EP) — but understanding EP as simply "spreading Experts across multiple GPUs" only gets you halfway there. Achieving this performance requires solving three problems simultaneously: an overlap strategy that hides inter-node communication behind computation, EPLB to balance the token load across Experts, and the silent memory fragmentation that occurs during TP sharding.

This article is written for developers and ML engineers interested in LLM serving. It walks through the entire picture in one flow: the mechanics of EP's 4-stage pipeline, the backends and configurations you need to choose for real deployments, and the difficult-to-diagnose fragmentation problem. After reading this, you'll have a solid basis for deciding which combination of TP, EP, and DP to choose in large-scale MoE deployments.

If you're already familiar with MoE and EP concepts, feel free to jump straight to the Practical Application section.

Core Concepts

The Relationship Between MoE Gating and Expert Parallelism

In a Mixture-of-Experts (MoE) model, input tokens are routed through a gating network to only the top-K experts. Think of it as "send this token to Expert 3 and Expert 7." While the total parameter count is in the hundreds of billions, the actual active computation is only a fraction of what a Dense model would require.

The idea behind Expert Parallelism is straightforward. By placing each Expert on a different GPU, computation only needs to happen on the GPU that hosts the relevant Expert when a token is routed. Rather than slicing matrices and distributing the pieces as in TP, EP physically separates the Experts themselves.

Expert Parallelism (EP): A parallelization strategy that distributes MoE model Expert weights across multiple GPUs. Unlike Tensor Parallelism, each GPU handles independent Expert computation, so the communication pattern remains relatively simple as you scale.

When comparing EP directly against a TP-only configuration, throughput improvements of up to 5x have been reported for DeepSeek-V3 under conditions of 2,000-token inputs and sufficient batch sizes. Results vary by batch size and sequence length, so measuring against your own workload is advisable.

In practice, large-scale deployments often combine EP with TP rather than using EP alone. Since NVLink provides sufficient bandwidth within a node, keeping TP small and using EP for inter-node scaling — an EP + TP hybrid — can be a flexible choice.

SGLang EP's 4-Stage Pipeline

SGLang cleanly separates the MoE forward pass into four distinct stages. This structure matters because each stage can have its own independent kernel and optimization strategy applied to it.

Stage	Role	Key Operation
Dispatch	Send tokens to their assigned Expert GPU	all-to-all communication
Pre-permute	Reorder tokens by Expert assignment	Memory rearrangement
Core Runner	Execute Expert computation on the assigned GPU	Grouped GEMM
Post-permute + Combine	Restore original order and collect results	all-to-all communication

I initially wondered why the Pre-permute stage was necessary. The reason is that the Grouped GEMM kernel used in the Core Runner requires tokens assigned to the same Expert to be laid out contiguously in memory to operate efficiently. If token order is scrambled after Dispatch, the GPU ends up hunting through memory for each token individually, and Tensor Core utilization drops significantly.

It's natural to wonder whether having two all-to-all communication rounds doubles the communication cost — and that question leads directly to the core topic of the next section.

Practical Application

Strategy 1: SBO and TBO — Filling Communication Wait Time with Computation

Inter-node communication becoming a bottleneck is inevitable. NVLink supports roughly 600 GB/s between GPUs within a node, but NDR InfiniBand between nodes runs at 400 Gb/s — translating to an effective bandwidth of around 50 GB/s. Note the different units: in practice, inter-node communication is more than 10x slower than intra-node. As node count grows, this gap dominates overall latency.

SGLang's approach is not to make communication faster, but to run other computation in parallel while communication is in progress.

SBO (Single-Batch Overlap): While all-to-all communication is in flight for one batch, the Shared Expert computation — which every token passes through — is processed simultaneously. This eliminates the idle time GPUs spend waiting on the network.

TBO (Two-Batch Overlap): The batch is split into two micro-batches, so while the first batch is communicating, the second batch is computing. This simultaneously increases throughput and reduces peak memory usage by roughly half — a dual benefit.

Below is the CLI command to launch an SGLang server with an EP configuration. This is based on SGLang v0.4 and above; flag names can change between versions, so checking the official documentation for your version alongside this is recommended.

bash

# Single-node 8-GPU EP configuration
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3 \
  --tp-size 1 \           # minimize TP
  --ep-size 8 \           # use all 8 GPUs in the node for EP
  --dp-size 1 \           # single-node baseline
  --enable-dp-attention   # eliminate KV cache replication
 
# For multi-node scaling (12 nodes × 8 GPUs = 96 H100s)
# add --dp-size 12 and run the same command on each node
# overlap strategy is specified via config file or version-specific additional flags

Overlap Strategy	Characteristics	Recommended When
SBO	Single batch, simple implementation, stable latency	Latency-sensitive services, first time applying EP
TBO	Dual batch, simultaneous throughput↑ + memory↓	Throughput-optimized batch workloads

If you've enabled EP but performance falls short of expectations, the first thing to check is whether communication overlap is actually being applied. If all-to-all is running as a synchronous blocking call, more than half of EP's advantage disappears.

Backend Selection: DeepEP and RDMA-Based Asynchronous Communication

Standard NCCL's all-to-all is synchronous and blocking. The GPU simply stalls and waits until communication completes. DeepEP was developed to solve this problem.

RDMA (Remote Direct Memory Access): Technology that transfers data directly between GPU memory buffers without CPU involvement. While NCCL passes through multiple software layers internally, RDMA communicates at the hardware level, drastically reducing CPU overhead and latency. Particularly effective in InfiniBand environments.

DeepEP implements EP-dedicated all-to-all using pure RDMA-based asynchronous kernels, and provides two modes suited to different workload characteristics.

bash

# Specify DeepEP as the SGLang backend
export SGLANG_EP_BACKEND=deepep
 
# Prefill nodes: high-throughput Normal mode
export DEEPEP_MODE=normal
 
# Decode nodes: CUDA Graph-compatible low-latency mode
export DEEPEP_MODE=low_latency

Support for these environment variables may vary by SGLang version. It is recommended to verify against the release notes and official documentation for your version of SGLang before deploying.

DeepEP Low-latency mode: A kernel designed exclusively for Decode. Its hook-based design occupies the minimum number of SMs (Streaming Multiprocessors, GPU compute units), preventing the paradoxical situation where communication overlap actually reduces computational throughput.

Using the same mode on both Prefill and Decode nodes means neither will run optimally. In PD Disaggregation configurations, it's important to specify the mode separately per node type.

Beyond DeepEP, alternatives such as MoriEP (with direct SGLang integration) and UCCL-EP (focused on cloud portability) have been emerging competitively since 2025. They all share the same direction: implementing EP-dedicated communication without depending on NCCL.

Diagnosing Memory Fragmentation: DeepSeek-V3's 18,432 Problem

This is the part that silently eats away at performance in production. It's more insidious precisely because it's invisible.

DeepSeek-V3's FFN intermediate dimension is 18,432. With TP32 applied, 18,432 ÷ 32 = 576 — which is not a multiple of 128, the alignment boundary for H100 Tensor Cores. The hardware internally adds padding, and Tensor Core utilization quietly drops. It's a classic pattern where GPU utilization metrics look healthy but actual computational efficiency is low. I only traced this problem after profiling SM utilization directly.

python

# Misalignment fragmentation diagnosis — check alignment and padding ratio in TP configuration
ffn_hidden_dim = 18432
tp_size = 32
alignment = 128
 
shard_size = ffn_hidden_dim // tp_size    # 576
is_aligned = (shard_size % alignment) == 0
 
if not is_aligned:
    aligned_size = ((shard_size + alignment - 1) // alignment) * alignment
    padding_ratio = (aligned_size - shard_size) / aligned_size
    print(f"Shard size: {shard_size}")
    print(f"128-aligned: {is_aligned}")                    # False
    print(f"Padding ratio: {padding_ratio:.1%}")           # ~12.5% wasted
    print("→ Recommend monitoring actual SM utilization with nvidia-smi dmon")

KV cache fragmentation is a separate issue. Under high concurrent streaming requests, internal fragmentation accumulates in the PagedAttention block pool due to variable-length sequences, and cases have been reported where reserving more than 2x the actual VRAM in use becomes necessary.

Mitigation Strategy	Effect	Notes
Prefer DP over TP	Eliminates FFN alignment fragmentation at the source	Combines naturally with EP
DP Attention	Eliminates KV cache replication	`--enable-dp-attention` flag
MLA	Up to 93.3% reduction in KV cache	Built into DeepSeek-V3 by default
FP8 KV cache quantization	~50% reduction in cache memory	Negligible quality loss

Pros and Cons Analysis

Advantages

Personally, EP's greatest strength is that "the scaling behavior is predictable." With TP, all-reduce communication grows sharply in proportion to the number of participating GPUs. With EP's all-to-all, each node only exchanges its own Expert results, so the growth in communication as node count increases is much more gradual than with TP. The difference was quite pronounced when I compared the two configurations directly.

Item	Description
Enables serving very large models	Distributes 685B+ models that exceed single-GPU VRAM limits across multiple nodes
Maximizes sparse activation efficiency	Only activated Experts compute, dramatically reducing effective computation compared to Dense models
More gradual scaling than TP	Uses all-to-all without all-reduce, so communication overhead grows more slowly with additional nodes than TP
Independent per-stage optimization	Dispatch, GEMM, and Combine can each be swapped for optimal kernels independently
Automated load balancing with EPLB	Dynamically redistributes token-heavy Experts to minimize GPU idle time

Disadvantages and Caveats

Honestly, the most bewildering moment when I first encountered EP in production was asking "the configuration looks right — why is it slow?" In most cases the answer was the second or third item in the table below.

Item	Description	Mitigation
Token load imbalance	If tokens concentrate on specific Experts, that GPU becomes the bottleneck for everything	Apply EPLB, configure Expert redundancy
Inter-node bandwidth gap	NVLink (600 GB/s) vs. InfiniBand effective ~50 GB/s, more than 10x difference	Prioritize intra-node EP, hide inter-node with SBO/TBO overlap
all-to-all blocking	Standard NCCL is synchronous blocking, causing GPU stalls	Replace with DeepEP/MoriEP asynchronous RDMA backend
SM occupancy conflicts	If overlap implementation over-consumes SMs, computational throughput paradoxically decreases	Use DeepEP hook-based design, choose mode carefully
FFN alignment fragmentation	Tensor Core alignment boundary violations during TP sharding reduce utilization	Reduce TP and shift to DP scaling
KV cache internal fragmentation	Variable-length sequences waste block pool capacity, requiring reservations over 2x actual usage	Apply MLA + FP8 quantization combination

EPLB (Expert Parallelism Load Balancer): An Expert placement load balancing algorithm published by DeepSeek. It analyzes token routing statistics and replicates heavily loaded Experts across multiple GPUs. It is now being adopted as standard in major MoE serving systems including DeepSeek-V3, Qwen3, and Kimi-K2.

Most Common Mistakes in Production

Scaling TP first: The instinct of "I need more GPUs, so let me increase TP first" backfires with MoE. The larger TP becomes, the worse FFN alignment fragmentation gets, and all-reduce communication scales up proportionally. I've frequently seen teams go through pain by approaching it in this order. It's better to evaluate EP + DP combinations first.
Applying EP without SBO/TBO: If you've enabled EP but performance is below expectations, communication overlap is often simply not active. If all-to-all is running as a synchronous blocking call, more than half of EP's advantage is gone.
Using the same settings for Prefill and Decode: DeepEP's Normal mode and Low-latency mode are not just options — they are separate kernels designed for different workload characteristics. In PD Disaggregation configurations, applying the same settings to both node types means neither will operate optimally.

Closing Thoughts

At the scale of 96 H100s, the key is the EP + DP combination, and the benefits only materialize when you solve all three problems together: communication overlap (SBO), load balancing (EPLB), and memory alignment. If you're applying EP for the first time, starting with SBO, verifying workload characteristics, and then transitioning to TBO is the safest path.

Here are three steps you can take right now.

Pre-diagnose memory fragmentation: Before finalizing your deployment configuration, check whether your TP shard size is a multiple of 128. Use the diagnostic code above to calculate the padding ratio, and observe actual SM utilization patterns with nvidia-smi dmon — this will make subsequent bottleneck analysis much easier. If there's a large gap between reserved and actual VRAM usage, that's a signal that internal fragmentation is underway.
Small-scale EP test: Start with a single node of 8 GPUs using --ep-size 8 --dp-size 1 and compare throughput before and after applying SBO. sglang.bench_serving makes this easy to measure. Enabling EPLB and DP Attention together at this stage lets you isolate the contribution of each component.
Verify environment and bandwidth: If you're preparing for a multi-node configuration, use python -m sglang.check_env to check your NCCL version and GPU-to-GPU NVLink topology, and ib_write_bw to measure effective InfiniBand bandwidth in advance. If inter-node bandwidth is lower than expected, this will help you determine whether SBO alone is sufficient or whether you need the DeepEP asynchronous backend.

References

#ExpertParallelism#MoE#SGLang#DeepSeek-V3#LLM서빙#DeepEP#RDMA#TensorParallelism#EPLB#GPU병렬화

Why Serving DeepSeek-V3 on 96 H100s Is Possible: SGLang Expert Parallelism's Communication Optimization and Memory Fragmentation Solutions

If you're already familiar with MoE and EP concepts, feel free to jump straight to the Practical Application section.

Core Concepts

The Relationship Between MoE Gating and Expert Parallelism

Expert Parallelism (EP): A parallelization strategy that distributes MoE model Expert weights across multiple GPUs. Unlike Tensor Parallelism, each GPU handles independent Expert computation, so the communication pattern remains relatively simple as you scale.

SGLang EP's 4-Stage Pipeline

SGLang cleanly separates the MoE forward pass into four distinct stages. This structure matters because each stage can have its own independent kernel and optimization strategy applied to it.

Stage	Role	Key Operation
Dispatch	Send tokens to their assigned Expert GPU	all-to-all communication
Pre-permute	Reorder tokens by Expert assignment	Memory rearrangement
Core Runner	Execute Expert computation on the assigned GPU	Grouped GEMM
Post-permute + Combine	Restore original order and collect results	all-to-all communication

It's natural to wonder whether having two all-to-all communication rounds doubles the communication cost — and that question leads directly to the core topic of the next section.

Practical Application

Strategy 1: SBO and TBO — Filling Communication Wait Time with Computation

SGLang's approach is not to make communication faster, but to run other computation in parallel while communication is in progress.

bash

# Single-node 8-GPU EP configuration
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3 \
  --tp-size 1 \           # minimize TP
  --ep-size 8 \           # use all 8 GPUs in the node for EP
  --dp-size 1 \           # single-node baseline
  --enable-dp-attention   # eliminate KV cache replication
 
# For multi-node scaling (12 nodes × 8 GPUs = 96 H100s)
# add --dp-size 12 and run the same command on each node
# overlap strategy is specified via config file or version-specific additional flags

Overlap Strategy	Characteristics	Recommended When
SBO	Single batch, simple implementation, stable latency	Latency-sensitive services, first time applying EP
TBO	Dual batch, simultaneous throughput↑ + memory↓	Throughput-optimized batch workloads

Backend Selection: DeepEP and RDMA-Based Asynchronous Communication

Standard NCCL's all-to-all is synchronous and blocking. The GPU simply stalls and waits until communication completes. DeepEP was developed to solve this problem.

RDMA (Remote Direct Memory Access): Technology that transfers data directly between GPU memory buffers without CPU involvement. While NCCL passes through multiple software layers internally, RDMA communicates at the hardware level, drastically reducing CPU overhead and latency. Particularly effective in InfiniBand environments.

DeepEP implements EP-dedicated all-to-all using pure RDMA-based asynchronous kernels, and provides two modes suited to different workload characteristics.

bash

# Specify DeepEP as the SGLang backend
export SGLANG_EP_BACKEND=deepep
 
# Prefill nodes: high-throughput Normal mode
export DEEPEP_MODE=normal
 
# Decode nodes: CUDA Graph-compatible low-latency mode
export DEEPEP_MODE=low_latency

Support for these environment variables may vary by SGLang version. It is recommended to verify against the release notes and official documentation for your version of SGLang before deploying.

DeepEP Low-latency mode: A kernel designed exclusively for Decode. Its hook-based design occupies the minimum number of SMs (Streaming Multiprocessors, GPU compute units), preventing the paradoxical situation where communication overlap actually reduces computational throughput.

Using the same mode on both Prefill and Decode nodes means neither will run optimally. In PD Disaggregation configurations, it's important to specify the mode separately per node type.

Diagnosing Memory Fragmentation: DeepSeek-V3's 18,432 Problem

This is the part that silently eats away at performance in production. It's more insidious precisely because it's invisible.

python

# Misalignment fragmentation diagnosis — check alignment and padding ratio in TP configuration
ffn_hidden_dim = 18432
tp_size = 32
alignment = 128
 
shard_size = ffn_hidden_dim // tp_size    # 576
is_aligned = (shard_size % alignment) == 0
 
if not is_aligned:
    aligned_size = ((shard_size + alignment - 1) // alignment) * alignment
    padding_ratio = (aligned_size - shard_size) / aligned_size
    print(f"Shard size: {shard_size}")
    print(f"128-aligned: {is_aligned}")                    # False
    print(f"Padding ratio: {padding_ratio:.1%}")           # ~12.5% wasted
    print("→ Recommend monitoring actual SM utilization with nvidia-smi dmon")

Mitigation Strategy	Effect	Notes
Prefer DP over TP	Eliminates FFN alignment fragmentation at the source	Combines naturally with EP
DP Attention	Eliminates KV cache replication	`--enable-dp-attention` flag
MLA	Up to 93.3% reduction in KV cache	Built into DeepSeek-V3 by default
FP8 KV cache quantization	~50% reduction in cache memory	Negligible quality loss

Pros and Cons Analysis

Advantages

Item	Description
Enables serving very large models	Distributes 685B+ models that exceed single-GPU VRAM limits across multiple nodes
Maximizes sparse activation efficiency	Only activated Experts compute, dramatically reducing effective computation compared to Dense models
More gradual scaling than TP	Uses all-to-all without all-reduce, so communication overhead grows more slowly with additional nodes than TP
Independent per-stage optimization	Dispatch, GEMM, and Combine can each be swapped for optimal kernels independently
Automated load balancing with EPLB	Dynamically redistributes token-heavy Experts to minimize GPU idle time

Disadvantages and Caveats

Item	Description	Mitigation
Token load imbalance	If tokens concentrate on specific Experts, that GPU becomes the bottleneck for everything	Apply EPLB, configure Expert redundancy
Inter-node bandwidth gap	NVLink (600 GB/s) vs. InfiniBand effective ~50 GB/s, more than 10x difference	Prioritize intra-node EP, hide inter-node with SBO/TBO overlap
all-to-all blocking	Standard NCCL is synchronous blocking, causing GPU stalls	Replace with DeepEP/MoriEP asynchronous RDMA backend
SM occupancy conflicts	If overlap implementation over-consumes SMs, computational throughput paradoxically decreases	Use DeepEP hook-based design, choose mode carefully
FFN alignment fragmentation	Tensor Core alignment boundary violations during TP sharding reduce utilization	Reduce TP and shift to DP scaling
KV cache internal fragmentation	Variable-length sequences waste block pool capacity, requiring reservations over 2x actual usage	Apply MLA + FP8 quantization combination

EPLB (Expert Parallelism Load Balancer): An Expert placement load balancing algorithm published by DeepSeek. It analyzes token routing statistics and replicates heavily loaded Experts across multiple GPUs. It is now being adopted as standard in major MoE serving systems including DeepSeek-V3, Qwen3, and Kimi-K2.

Most Common Mistakes in Production

Scaling TP first: The instinct of "I need more GPUs, so let me increase TP first" backfires with MoE. The larger TP becomes, the worse FFN alignment fragmentation gets, and all-reduce communication scales up proportionally. I've frequently seen teams go through pain by approaching it in this order. It's better to evaluate EP + DP combinations first.
Applying EP without SBO/TBO: If you've enabled EP but performance is below expectations, communication overlap is often simply not active. If all-to-all is running as a synchronous blocking call, more than half of EP's advantage is gone.
Using the same settings for Prefill and Decode: DeepEP's Normal mode and Low-latency mode are not just options — they are separate kernels designed for different workload characteristics. In PD Disaggregation configurations, applying the same settings to both node types means neither will operate optimally.

Closing Thoughts

Here are three steps you can take right now.

Pre-diagnose memory fragmentation: Before finalizing your deployment configuration, check whether your TP shard size is a multiple of 128. Use the diagnostic code above to calculate the padding ratio, and observe actual SM utilization patterns with nvidia-smi dmon — this will make subsequent bottleneck analysis much easier. If there's a large gap between reserved and actual VRAM usage, that's a signal that internal fragmentation is underway.
Small-scale EP test: Start with a single node of 8 GPUs using --ep-size 8 --dp-size 1 and compare throughput before and after applying SBO. sglang.bench_serving makes this easy to measure. Enabling EPLB and DP Attention together at this stage lets you isolate the contribution of each component.
Verify environment and bandwidth: If you're preparing for a multi-node configuration, use python -m sglang.check_env to check your NCCL version and GPU-to-GPU NVLink topology, and ib_write_bw to measure effective InfiniBand bandwidth in advance. If inter-node bandwidth is lower than expected, this will help you determine whether SBO alone is sufficient or whether you need the DeepEP asynchronous backend.

References

#ExpertParallelism#MoE#SGLang#DeepSeek-V3#LLM서빙#DeepEP#RDMA#TensorParallelism#EPLB#GPU병렬화

Core Concepts

The Relationship Between MoE Gating and Expert Parallelism

SGLang EP's 4-Stage Pipeline

Practical Application

Strategy 1: SBO and TBO — Filling Communication Wait Time with Computation

Backend Selection: DeepEP and RDMA-Based Asynchronous Communication

Diagnosing Memory Fragmentation: DeepSeek-V3's 18,432 Problem

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

Most Common Mistakes in Production

Closing Thoughts

References

Core Concepts

The Relationship Between MoE Gating and Expert Parallelism

SGLang EP's 4-Stage Pipeline

Practical Application

Strategy 1: SBO and TBO — Filling Communication Wait Time with Computation

Backend Selection: DeepEP and RDMA-Based Asynchronous Communication

Diagnosing Memory Fragmentation: DeepSeek-V3's 18,432 Problem

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

Most Common Mistakes in Production

Closing Thoughts

References

Recommended Posts

XGrammar-2: The Design Principles Behind 80x Faster Structured Output

FP4 Quantization + Blackwell GPU: Conditions for 4× Throughput over H100 and When Not to Use It

Why 88% of AI Agents Fail in Production: The 5-Layer Harness Architecture Is the Answer

SGLang EPD Disaggregation and Pipeline Parallelism — An Architecture That Splits Vision-Language Model Serving into 3 Stages to Reduce TTFT by Up to 8x

SGLang Architecture That Extracts 6× Throughput from the Same GPUs — PD Disaggregation and HiCache

SGLang RadixAttention KV Cache Hit Rate: How We Raised Hit Rate from 3% to 78% with Prometheus Monitoring and Operational Tuning — Advanced Guide