Privacy Policy© 2026 DEV BAK - TECH BLOG. All rights reserved.
DEV BAK - TECH BLOG
AI

Why Serving DeepSeek-V3 on 96 H100s Is Possible: SGLang Expert Parallelism's Communication Optimization and Memory Fragmentation Solutions

52,300 input tokens/s. This is the figure LMSYS announced in May 2025 when they became the first to openly deploy DeepSeek-V3 on 96 H100 GPUs. It was initially counterintuitive that a 685B MoE model could achieve this throughput. A single node simply doesn't have enough VRAM, and blindly scaling up Tensor Parallelism means GPUs spend more time waiting on each other than actually computing. When I first saw this number, my immediate reaction was "how on earth?"

The secret was Expert Parallelism (EP) — but understanding EP as simply "spreading Experts across multiple GPUs" only gets you halfway there. Achieving this performance requires solving three problems simultaneously: an overlap strategy that hides inter-node communication behind computation, EPLB to balance the token load across Experts, and the silent memory fragmentation that occurs during TP sharding.

This article is written for developers and ML engineers interested in LLM serving. It walks through the entire picture in one flow: the mechanics of EP's 4-stage pipeline, the backends and configurations you need to choose for real deployments, and the difficult-to-diagnose fragmentation problem. After reading this, you'll have a solid basis for deciding which combination of TP, EP, and DP to choose in large-scale MoE deployments.

If you're already familiar with MoE and EP concepts, feel free to jump straight to the Practical Application section.


Core Concepts

The Relationship Between MoE Gating and Expert Parallelism

In a Mixture-of-Experts (MoE) model, input tokens are routed through a gating network to only the top-K experts. Think of it as "send this token to Expert 3 and Expert 7." While the total parameter count is in the hundreds of billions, the actual active computation is only a fraction of what a Dense model would require.

The idea behind Expert Parallelism is straightforward. By placing each Expert on a different GPU, computation only needs to happen on the GPU that hosts the relevant Expert when a token is routed. Rather than slicing matrices and distributing the pieces as in TP, EP physically separates the Experts themselves.

Expert Parallelism (EP): A parallelization strategy that distributes MoE model Expert weights across multiple GPUs. Unlike Tensor Parallelism, each GPU handles independent Expert computation, so the communication pattern remains relatively simple as you scale.

When comparing EP directly against a TP-only configuration, throughput improvements of up to 5x have been reported for DeepSeek-V3 under conditions of 2,000-token inputs and sufficient batch sizes. Results vary by batch size and sequence length, so measuring against your own workload is advisable.

In practice, large-scale deployments often combine EP with TP rather than using EP alone. Since NVLink provides sufficient bandwidth within a node, keeping TP small and using EP for inter-node scaling — an EP + TP hybrid — can be a flexible choice.

SGLang EP's 4-Stage Pipeline

SGLang cleanly separates the MoE forward pass into four distinct stages. This structure matters because each stage can have its own independent kernel and optimization strategy applied to it.

Stage Role Key Operation
Dispatch Send tokens to their assigned Expert GPU all-to-all communication
Pre-permute Reorder tokens by Expert assignment Memory rearrangement
Core Runner Execute Expert computation on the assigned GPU Grouped GEMM
Post-permute + Combine Restore original order and collect results all-to-all communication

I initially wondered why the Pre-permute stage was necessary. The reason is that the Grouped GEMM kernel used in the Core Runner requires tokens assigned to the same Expert to be laid out contiguously in memory to operate efficiently. If token order is scrambled after Dispatch, the GPU ends up hunting through memory for each token individually, and Tensor Core utilization drops significantly.

It's natural to wonder whether having two all-to-all communication rounds doubles the communication cost — and that question leads directly to the core topic of the next section.


Practical Application

Strategy 1: SBO and TBO — Filling Communication Wait Time with Computation

Inter-node communication becoming a bottleneck is inevitable. NVLink supports roughly 600 GB/s between GPUs within a node, but NDR InfiniBand between nodes runs at 400 Gb/s — translating to an effective bandwidth of around 50 GB/s. Note the different units: in practice, inter-node communication is more than 10x slower than intra-node. As node count grows, this gap dominates overall latency.

SGLang's approach is not to make communication faster, but to run other computation in parallel while communication is in progress.

SBO (Single-Batch Overlap): While all-to-all communication is in flight for one batch, the Shared Expert computation — which every token passes through — is processed simultaneously. This eliminates the idle time GPUs spend waiting on the network.

TBO (Two-Batch Overlap): The batch is split into two micro-batches, so while the first batch is communicating, the second batch is computing. This simultaneously increases throughput and reduces peak memory usage by roughly half — a dual benefit.

Below is the CLI command to launch an SGLang server with an EP configuration. This is based on SGLang v0.4 and above; flag names can change between versions, so checking the official documentation for your version alongside this is recommended.

bash
# Single-node 8-GPU EP configuration
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3 \
  --tp-size 1 \           # minimize TP
  --ep-size 8 \           # use all 8 GPUs in the node for EP
  --dp-size 1 \           # single-node baseline
  --enable-dp-attention   # eliminate KV cache replication
 
# For multi-node scaling (12 nodes × 8 GPUs = 96 H100s)
# add --dp-size 12 and run the same command on each node
# overlap strategy is specified via config file or version-specific additional flags
Overlap Strategy Characteristics Recommended When
SBO Single batch, simple implementation, stable latency Latency-sensitive services, first time applying EP
TBO Dual batch, simultaneous throughput↑ + memory↓ Throughput-optimized batch workloads

If you've enabled EP but performance falls short of expectations, the first thing to check is whether communication overlap is actually being applied. If all-to-all is running as a synchronous blocking call, more than half of EP's advantage disappears.

Backend Selection: DeepEP and RDMA-Based Asynchronous Communication

Standard NCCL's all-to-all is synchronous and blocking. The GPU simply stalls and waits until communication completes. DeepEP was developed to solve this problem.

RDMA (Remote Direct Memory Access): Technology that transfers data directly between GPU memory buffers without CPU involvement. While NCCL passes through multiple software layers internally, RDMA communicates at the hardware level, drastically reducing CPU overhead and latency. Particularly effective in InfiniBand environments.

DeepEP implements EP-dedicated all-to-all using pure RDMA-based asynchronous kernels, and provides two modes suited to different workload characteristics.

bash
# Specify DeepEP as the SGLang backend
export SGLANG_EP_BACKEND=deepep
 
# Prefill nodes: high-throughput Normal mode
export DEEPEP_MODE=normal
 
# Decode nodes: CUDA Graph-compatible low-latency mode
export DEEPEP_MODE=low_latency

Support for these environment variables may vary by SGLang version. It is recommended to verify against the release notes and official documentation for your version of SGLang before deploying.

DeepEP Low-latency mode: A kernel designed exclusively for Decode. Its hook-based design occupies the minimum number of SMs (Streaming Multiprocessors, GPU compute units), preventing the paradoxical situation where communication overlap actually reduces computational throughput.

Using the same mode on both Prefill and Decode nodes means neither will run optimally. In PD Disaggregation configurations, it's important to specify the mode separately per node type.

Beyond DeepEP, alternatives such as MoriEP (with direct SGLang integration) and UCCL-EP (focused on cloud portability) have been emerging competitively since 2025. They all share the same direction: implementing EP-dedicated communication without depending on NCCL.

Diagnosing Memory Fragmentation: DeepSeek-V3's 18,432 Problem

This is the part that silently eats away at performance in production. It's more insidious precisely because it's invisible.

DeepSeek-V3's FFN intermediate dimension is 18,432. With TP32 applied, 18,432 ÷ 32 = 576 — which is not a multiple of 128, the alignment boundary for H100 Tensor Cores. The hardware internally adds padding, and Tensor Core utilization quietly drops. It's a classic pattern where GPU utilization metrics look healthy but actual computational efficiency is low. I only traced this problem after profiling SM utilization directly.

python
# Misalignment fragmentation diagnosis — check alignment and padding ratio in TP configuration
ffn_hidden_dim = 18432
tp_size = 32
alignment = 128
 
shard_size = ffn_hidden_dim // tp_size    # 576
is_aligned = (shard_size % alignment) == 0
 
if not is_aligned:
    aligned_size = ((shard_size + alignment - 1) // alignment) * alignment
    padding_ratio = (aligned_size - shard_size) / aligned_size
    print(f"Shard size: {shard_size}")
    print(f"128-aligned: {is_aligned}")                    # False
    print(f"Padding ratio: {padding_ratio:.1%}")           # ~12.5% wasted
    print("→ Recommend monitoring actual SM utilization with nvidia-smi dmon")

KV cache fragmentation is a separate issue. Under high concurrent streaming requests, internal fragmentation accumulates in the PagedAttention block pool due to variable-length sequences, and cases have been reported where reserving more than 2x the actual VRAM in use becomes necessary.

Mitigation Strategy Effect Notes
Prefer DP over TP Eliminates FFN alignment fragmentation at the source Combines naturally with EP
DP Attention Eliminates KV cache replication --enable-dp-attention flag
MLA Up to 93.3% reduction in KV cache Built into DeepSeek-V3 by default
FP8 KV cache quantization ~50% reduction in cache memory Negligible quality loss

Pros and Cons Analysis

Advantages

Personally, EP's greatest strength is that "the scaling behavior is predictable." With TP, all-reduce communication grows sharply in proportion to the number of participating GPUs. With EP's all-to-all, each node only exchanges its own Expert results, so the growth in communication as node count increases is much more gradual than with TP. The difference was quite pronounced when I compared the two configurations directly.

Item Description
Enables serving very large models Distributes 685B+ models that exceed single-GPU VRAM limits across multiple nodes
Maximizes sparse activation efficiency Only activated Experts compute, dramatically reducing effective computation compared to Dense models
More gradual scaling than TP Uses all-to-all without all-reduce, so communication overhead grows more slowly with additional nodes than TP
Independent per-stage optimization Dispatch, GEMM, and Combine can each be swapped for optimal kernels independently
Automated load balancing with EPLB Dynamically redistributes token-heavy Experts to minimize GPU idle time

Disadvantages and Caveats

Honestly, the most bewildering moment when I first encountered EP in production was asking "the configuration looks right — why is it slow?" In most cases the answer was the second or third item in the table below.

Item Description Mitigation
Token load imbalance If tokens concentrate on specific Experts, that GPU becomes the bottleneck for everything Apply EPLB, configure Expert redundancy
Inter-node bandwidth gap NVLink (600 GB/s) vs. InfiniBand effective ~50 GB/s, more than 10x difference Prioritize intra-node EP, hide inter-node with SBO/TBO overlap
all-to-all blocking Standard NCCL is synchronous blocking, causing GPU stalls Replace with DeepEP/MoriEP asynchronous RDMA backend
SM occupancy conflicts If overlap implementation over-consumes SMs, computational throughput paradoxically decreases Use DeepEP hook-based design, choose mode carefully
FFN alignment fragmentation Tensor Core alignment boundary violations during TP sharding reduce utilization Reduce TP and shift to DP scaling
KV cache internal fragmentation Variable-length sequences waste block pool capacity, requiring reservations over 2x actual usage Apply MLA + FP8 quantization combination

EPLB (Expert Parallelism Load Balancer): An Expert placement load balancing algorithm published by DeepSeek. It analyzes token routing statistics and replicates heavily loaded Experts across multiple GPUs. It is now being adopted as standard in major MoE serving systems including DeepSeek-V3, Qwen3, and Kimi-K2.

Most Common Mistakes in Production

  1. Scaling TP first: The instinct of "I need more GPUs, so let me increase TP first" backfires with MoE. The larger TP becomes, the worse FFN alignment fragmentation gets, and all-reduce communication scales up proportionally. I've frequently seen teams go through pain by approaching it in this order. It's better to evaluate EP + DP combinations first.

  2. Applying EP without SBO/TBO: If you've enabled EP but performance is below expectations, communication overlap is often simply not active. If all-to-all is running as a synchronous blocking call, more than half of EP's advantage is gone.

  3. Using the same settings for Prefill and Decode: DeepEP's Normal mode and Low-latency mode are not just options — they are separate kernels designed for different workload characteristics. In PD Disaggregation configurations, applying the same settings to both node types means neither will operate optimally.


Closing Thoughts

At the scale of 96 H100s, the key is the EP + DP combination, and the benefits only materialize when you solve all three problems together: communication overlap (SBO), load balancing (EPLB), and memory alignment. If you're applying EP for the first time, starting with SBO, verifying workload characteristics, and then transitioning to TBO is the safest path.

Here are three steps you can take right now.

  1. Pre-diagnose memory fragmentation: Before finalizing your deployment configuration, check whether your TP shard size is a multiple of 128. Use the diagnostic code above to calculate the padding ratio, and observe actual SM utilization patterns with nvidia-smi dmon — this will make subsequent bottleneck analysis much easier. If there's a large gap between reserved and actual VRAM usage, that's a signal that internal fragmentation is underway.

  2. Small-scale EP test: Start with a single node of 8 GPUs using --ep-size 8 --dp-size 1 and compare throughput before and after applying SBO. sglang.bench_serving makes this easy to measure. Enabling EPLB and DP Attention together at this stage lets you isolate the contribution of each component.

  3. Verify environment and bandwidth: If you're preparing for a multi-node configuration, use python -m sglang.check_env to check your NCCL version and GPU-to-GPU NVLink topology, and ib_write_bw to measure effective InfiniBand bandwidth in advance. If inter-node bandwidth is lower than expected, this will help you determine whether SBO alone is sufficient or whether you need the DeepEP asynchronous backend.


References

  • Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs | LMSYS Blog
  • Expert Parallelism | SGLang Official Documentation
  • Expert Parallelism for MoE Models | DeepWiki
  • DeepEP: an efficient expert-parallel communication library | GitHub
  • UCCL-EP: Portable Expert-Parallel Communication | arXiv
  • Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling | arXiv
  • Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures | arXiv
  • SGLang Official GitHub Repository
  • PD Disaggregation | SGLang Official Documentation
#ExpertParallelism#MoE#SGLang#DeepSeek-V3#LLM서빙#DeepEP#RDMA#TensorParallelism#EPLB#GPU병렬화
Share

Table of Contents

Core ConceptsThe Relationship Between MoE Gating and Expert ParallelismSGLang EP's 4-Stage PipelinePractical ApplicationStrategy 1: SBO and TBO — Filling Communication Wait Time with ComputationBackend Selection: DeepEP and RDMA-Based Asynchronous CommunicationDiagnosing Memory Fragmentation: DeepSeek-V3's 18,432 ProblemPros and Cons AnalysisAdvantagesDisadvantages and CaveatsMost Common Mistakes in ProductionClosing ThoughtsReferences

Recommended Posts

XGrammar-2: The Design Principles Behind 80x Faster Structured Output
AI

XGrammar-2: The Design Principles Behind 80x Faster Structured Output

When an LLM calls a tool or returns JSON, there's actually quite a heavy operation running behind the scenes. Every time the model emits a token, it must determ...

May 28, 202623 min read
SGLang EPD Disaggregation and Pipeline Parallelism — An Architecture That Splits Vision-Language Model Serving into 3 Stages to Reduce TTFT by Up to 8x
AI

SGLang EPD Disaggregation and Pipeline Parallelism — An Architecture That Splits Vision-Language Model Serving into 3 Stages to Reduce TTFT by Up to 8x

Even if you've never directly served multimodal AI before, that's fine. These days, AI features that accept image input are becoming so widespread so quickly th...

May 27, 202623 min read
SGLang Architecture That Extracts 6× Throughput from the Same GPUs — PD Disaggregation and HiCache
AI

SGLang Architecture That Extracts 6× Throughput from the Same GPUs — PD Disaggregation and HiCache

When I first took on LLM serving, the most baffling question was "We have enough GPUs, so why is this so slow?" The monitoring dashboard showed GPU memory nearl...

May 27, 202627 min read
SGLang RadixAttention KV Cache Hit Rate: How We Raised Hit Rate from 3% to 78% with Prometheus Monitoring and Operational Tuning — Advanced Guide
AI

SGLang RadixAttention KV Cache Hit Rate: How We Raised Hit Rate from 3% to 78% with Prometheus Monitoring and Operational Tuning — Advanced Guide

When running LLM serving infrastructure, GPU costs can quickly spiral out of control. Back when I was operating a multi-turn chatbot service, I eventually reali...

May 27, 202621 min read
Migrating LLM Inference from vLLM to SGLang: Why RAG and Multi-Turn Workloads See a 6x Throughput Difference
AI

Migrating LLM Inference from vLLM to SGLang: Why RAG and Multi-Turn Workloads See a 6x Throughput Difference

Honestly, my first reaction when I came across SGLang was "another new framework?" vLLM was working well enough, and touching a serving stack that's already run...

May 27, 202621 min read
Why KV Cache Hit Rate Drops to 0% When Scaling Out vLLM Pods, and How llm-d Solves It (Prefix-Aware Routing / Distributed KV Cache)
AI

Why KV Cache Hit Rate Drops to 0% When Scaling Out vLLM Pods, and How llm-d Solves It (Prefix-Aware Routing / Distributed KV Cache)

When operating an LLM service, you will eventually encounter this situation. When you had Automatic Prefix Caching (APC) enabled in vLLM and ran a multi-turn ch...

May 26, 202626 min read