Privacy Policy© 2026 DEV BAK - TECH BLOG. All rights reserved.
DEV BAK - TECH BLOG
AI

FP4 Quantization + Blackwell GPU: Conditions for 4× Throughput over H100 and When Not to Use It

`RTX 5090 / B200 / vLLM FP4 / llm-compressor Results*

When I first got my hands on the RTX 5090, I honestly thought, "Just another new GPU." But after seeing the FP4 quantization numbers, I changed my mind. In the MLPerf v5.0 official benchmark, a single B200 delivered 10,755 tokens per second on Llama 2 70B inference, and a DGX B200 8-GPU system recorded 3.1× the throughput of a DGX H200. The key point is that these are certified benchmark figures, not marketing numbers. And right at that moment, I started wondering, "Does this apply to my models too?" — so I ran the code myself to find out.

This article covers three things:

  • Why FP4 is a meaningful technology right now, and how it ties into the Blackwell architecture
  • Three scenarios validated with real code — LLM serving, MoE models, and image generation
  • When you should NOT use FP4 — accuracy loss figures by task and mitigation strategies

Rather than vague warnings like "quantization has accuracy loss, so be careful," this article goes all the way to showing how much loss appears in which tasks. FP4 only makes sense when paired with a Blackwell GPU, and knowing exactly those conditions and exceptions is the first step to actually using this technology. If your team is feeling the weight of GPU server costs, I'd recommend reading to the end.


Core Concepts

Why FP4 Differs from INT4: Why Nonlinear Representation Is Advantageous for LLM Weight Distributions

I initially thought, "What's the difference between two 4-bit formats?" The short answer is that the representation scheme itself is different. INT4 can only represent integer grid points in the range −8 to 7. NVFP4, by contrast, is a floating-point format with an exponent field, so it represents values at nonlinear intervals. Why does this matter? LLM weight distributions have most values clustered near zero, with rare extreme outliers mixed in. INT4 faces a dilemma: if you widen the scale to accommodate those outliers, precision near zero degrades. FP4 mitigates this problem to a meaningful degree thanks to its nonlinear spacing.

A brief breakdown of NVFP4's internal structure:

Component Description
Per-block FP8 scale One scale factor shared across every 16 values
Per-tensor FP32 scale An additional scale applied across the entire tensor
Tensor Core accumulation FP4 operations accumulated in FP16 to minimize error

Looking at this table naturally raises two questions.

Why accumulate in FP16 instead of FP32? FP4 operation results already have a limited value range, making an FP32 accumulator overkill. Accumulating in FP16 cuts memory bandwidth in half while still providing sufficient precision for LLM inference. It's a deliberate design choice favoring bandwidth reduction over precision.

Why is the memory ratio 0.29× instead of 0.25×? A naive bit ratio (4/16) would give 0.25×, but NVFP4 has block-scale overhead. 16 values (64 bits) + 1 FP8 scale (8 bits) = 72 bits. Storing the same 16 values in FP16 would require 256 bits. Therefore 72 ÷ 256 ≈ 0.28×, which rounds to 0.29×. INT4 has no such scale overhead and achieves 0.25×, but sacrifices that level of precision control in exchange.

For reference, MXFP4 (the format led by Microsoft/Intel) takes a block size of 32 to reduce overhead. The block-16 of NVFP4 vs. block-32 of MXFP4 reflects a design tradeoff between precision and compression efficiency. If you're using NVIDIA GPUs, NVFP4 is the go-to; if cross-platform compatibility matters, it's worth tracking MXFP4 standardization trends as well.

Term: PTQ (Post-Training Quantization) — A technique that quantizes a trained model without retraining. It can be applied quickly but requires a calibration dataset and may yield lower accuracy than quantization-aware training (QAT).

W4A4: Why Throughput Jumps 4×

What's important is that FP4 doesn't just compress weights. W4A4 reduces both weights and activations to FP4. Compared to W4A16 or W4A8, which compress only the weights, W4A4 simultaneously reduces memory bandwidth and computation, leading to far greater throughput gains. The prerequisite for this combination is that Blackwell's 5th-generation Tensor Cores handle FP4 operations natively.

Key point: FP4's speed advantage does not appear on pre-Blackwell GPUs. On Hopper (H100) or Ampere, it is handled via software emulation, eliminating or even reversing the speed benefit. Always verify the GPU architecture first.

Precision Format Comparison at a Glance

Format Bits Memory (vs FP16) Blackwell Native
FP16 16 1× ✓
FP8 8 0.5× ✓
NVFP4 4 0.29× (including block scale) ✓ (exclusive)
INT4 4 0.25× △ (emulated)

In this table, the last two rows are the most practically significant. NVFP4 compresses slightly less than INT4 at 0.29×, but runs natively on Blackwell — and that architectural difference is the premise for all the code examples that follow.


Practical Application

Example 1: LLM Quantization — Llama 3 FP4 Serving with llm-compressor + vLLM

If you're already in the vLLM ecosystem, you can build a W4A4 FP4 quantization pipeline with llm-compressor. The code is surprisingly simple once you have a calibration dataset ready. The first time I did it, I thought, "Is this really all there is?" — the boilerplate is minimal.

python
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
 
# Prepare calibration dataset (512 samples recommended)
recipe = QuantizationModifier(
    targets="Linear",
    scheme="NVFP4",          # Apply W4A4 FP4
    ignore=["lm_head"],      # Keep output layer in FP16
)
 
oneshot(
    model="meta-llama/Llama-3-8B-Instruct",
    dataset="open_platypus",
    recipe=recipe,
    num_calibration_samples=512,
    output_dir="./llama3-nvfp4",
)
bash
# Install FP4-supported version (as of May 2025)
pip install "llmcompressor>=0.3.0" "vllm>=0.4.2"
 
# Serve the quantized model with vLLM
vllm serve ./llama3-nvfp4 \
  --quantization nvfp4 \
  --dtype float16
  # ~2,400 tok/s on RTX 5090 (approximately 2.8× improvement over FP16)
Parameter Description
scheme="NVFP4" Apply W4A4 FP4 quantization
ignore=["lm_head"] Keep final classification layer in FP16 (stabilizes accuracy)
num_calibration_samples 512–1024 recommended. Fewer is faster but hurts accuracy

Version note: NVFP4 support is only stable from certain versions onward. If you install via pip install llmcompressor and hit FP4-related errors, suspect a version conflict first.

Example 2: Selective MoE Model Quantization — NVIDIA ModelOpt

Once you've done LLM quantization, it's natural to want to apply it to MoE models like Mixtral or DeepSeek-V3. But this is where I made a mistake early on. I pushed all layers to FP4 at once and ended up with broken routing behavior.

The routing layer in an MoE model is the core component that decides which expert receives each token. Applying FP4 to this layer destabilizes the routing decisions themselves, degrading overall output quality. The nvfp4_experts_only setting in ModelOpt is the workaround for this problem.

python
import modelopt.torch.quantization as mtq
from torch.utils.data import DataLoader
 
# forward_loop: passes calibration data through the model to collect value statistics
# used to determine quantization scales
def forward_loop(model):
    for batch in calibration_dataloader:  # a DataLoader prepared in advance
        model(**batch)
 
# Expert layers only: NVFP4; everything else stays FP8
quant_config = {
    "quant_cfg": {
        "*experts*": {"num_bits": (4, 3), "axis": None},  # NVFP4
        "*": {"num_bits": (8, 7), "axis": None},           # FP8
    }
}
 
# Use a predefined config (recommended)
model = mtq.quantize(model, mtq.NVFP4_EXPERTS_ONLY_CFG, forward_loop)
mtq.print_quant_summary(model)  # Check per-layer quantization results
bash
pip install "nvidia-modelopt[torch]>=0.17.0"

nvfp4_experts_only applies FP4 only to the expert layers, while keeping attention and non-expert layers at FP8. I think it's a practically well-balanced setting for the memory-versus-accuracy tradeoff.

Example 3: Image Generation — Optimizing the FLUX Model with TorchAO + NVFP4

This example requires a brief explanation of one important design choice up front. Why use W4A16 (weights-only FP4) instead of the W4A4 described earlier? Diffusion models currently don't fully support activation quantization (A4). Transformer-based diffusion architectures have activation patterns that differ from LLMs, causing severe image quality degradation when A4 is applied, and the runtime support needed to stabilize this has not yet matured. That's why this scenario uses W4A16 — only the weights are brought down to FP4.

SVDQuant, developed by MIT HAN Lab, absorbs the outlier components of weights into a low-rank matrix via SVD before applying FP4. Once outliers are absorbed into the low-rank matrix, the remaining weights have a narrower value range, making them coverable by FP4's limited representational range. This approach achieved a PSNR of 21.5 on FLUX.1-dev. A PSNR above 20 is generally considered visually acceptable, and 21.5 is a level where the result is difficult to distinguish from the original 16-bit model for most use cases.

python
from torchao.quantization import quantize_, nvfp4_weight_only
from diffusers import FluxPipeline
import torch
 
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16,
)
 
# W4A16: apply NVFP4 to weights only (activations remain bfloat16)
quantize_(pipe.transformer, nvfp4_weight_only())
pipe.to("cuda")
 
image = pipe(
    "a photo of a cat sitting on a GPU",
    num_inference_steps=20,
).images[0]
image.save("output.png")
# ~3× speed improvement over FP8 and 4× memory reduction on RTX 5090
bash
pip install "torch>=2.5" "torchao>=0.6.0" diffusers transformers

Caution: For I2V (Image-to-Video) tasks, FP4 precision may be insufficient to preserve fine details from the reference image, causing quality to fall below practical thresholds. T2V (Text-to-Video) is comparatively tolerant, but for I2V, maintaining FP8 or higher is recommended.


Pros and Cons Analysis

Now that we've seen enough numbers and code, let's return to the most important question in practice: the table below should help you decide "Is FP4 right for our team?"

Advantages

Item Figure Note
Memory reduction ~3.5× vs FP16 Enables loading larger models into the same VRAM
Throughput gain Up to 4× vs H100 FP8 MLPerf v5.0 certified figure
Accuracy loss <1% (PTQ baseline) Based on general LLM benchmarks. Task-specific variance exists (see disadvantages table)
Energy efficiency Up to 2× vs FP8 Per NVIDIA announcement
Ecosystem TensorRT-LLM, vLLM, TorchAO SGLang support also planned

Disadvantages and Caveats

Of these, the item encountered most often in practice is undoubtedly "hardware lock-in." Teams using RTX 4090 sometimes try FP4 only to find it slower — and this is exactly why.

Item Description Mitigation
Hardware lock-in Native execution requires Blackwell (SM 10.0+) Use FP8 on Hopper/Ampere
Task-specific accuracy variance Up to 8-point drop on code generation tasks vs INT4 baseline (different conditions than general benchmarks) Keep sensitive layers at FP8 for code, math, and I2V
Calibration data required A representative calibration dataset is essential Use 512–1024 samples from real production queries
KV cache constraint TRT-LLM FP4 KV cache requires offline ModelOpt vLLM supports weights-only FP4
Unfinished standards Divergence between NVFP4 and MXFP4 formats Cross-platform porting incurs conversion costs
Extreme VRAM constraints Environments where even FP4 is too large Consider AWQ INT4 (see below)

Term: AWQ (Activation-aware Weight Quantization) — An INT4 quantization technique that preserves high precision for important weight channels by accounting for activation distributions. It works even without Blackwell and in extremely VRAM-constrained environments, and is broadly supported across the Transformers/llama.cpp ecosystems.

Term: MXFP4 — The FP4 variant of the MX (Microscaling) format led by Microsoft and Intel. It is not interchangeable with NVFP4 due to a different block-scaling approach (block size 32 vs. 16).

The Most Common Mistakes in Practice

  1. Attempting NVFP4 on a Hopper GPU (H100) — It is handled via software emulation and ends up being slower than FP8. You can verify with torch.cuda.get_device_capability(): Blackwell returns (10, 0), Hopper (H100) returns (9, 0). Native FP4 execution requires (10, 0) or higher.

  2. Building the calibration dataset from random samples — Calibrating on data with a different distribution than what your actual service receives leads to unexpected accuracy drops in production. Use real log-based samples whenever possible.

  3. Applying FP4 uniformly to all layers — A mixed-precision strategy that keeps sensitive components like attention layers and MoE routing layers at FP8 can significantly reduce accuracy loss. Use ModelOpt's AutoQuantize or the nvfp4_experts_only setting to achieve this.


Closing Thoughts

Paired with Blackwell GPUs, FP4 has moved beyond "experimental technology" to the point where it deserves serious consideration for production workloads. That said, the right order is to first check your GPU architecture, task characteristics, and accuracy requirements. The decision flow below should help:

Blackwell GPU (RTX 50, B100/B200) + inference optimization  → Actively consider NVFP4
Hopper / Ampere GPU                                          → FP8 first
Code generation, math reasoning, I2V tasks                  → Keep sensitive layers at FP8, or use mixed precision
Extreme VRAM constraints (no Blackwell)                     → Consider AWQ INT4

Three steps you can start right now:

  1. Check your GPU compute capability first — Run python -c "import torch; print(torch.cuda.get_device_capability())". If the result is (10, 0) or higher, you have Blackwell and native FP4 execution is available. If it's (9, 0), you have Hopper (H100) and FP8 is the better choice.

  2. Get a feel for the speed with pre-quantized models from Hugging Face — Download a Llama 3 NVFP4 checkpoint from the nvidia/ namespace and serve it with vLLM. You can immediately see the throughput difference without doing any quantization work yourself.

  3. If you need to quantize yourself, start with llm-compressor — In the example code above, just swap in your model path and calibration dataset. You can see first results with as few as 512 calibration samples.


References

  • Introducing NVFP4 for Efficient and Accurate Low-Precision Inference | NVIDIA Technical Blog
  • Scaling NVFP4 Inference for FLUX.2 on NVIDIA Blackwell Data Center GPUs | NVIDIA Technical Blog
  • NVIDIA TensorRT Unlocks FP4 Image Generation for Blackwell GeForce RTX 50 Series | NVIDIA Technical Blog
  • FP4 Quantization on Blackwell GPUs: Throughput, Cost, and When It's Worth It | Spheron Blog
  • Nvidia publishes first Blackwell B200 MLPerf results: Up to 4X faster than H100 using FP4 | Tom's Hardware
  • fp4 Quantization with NVFP4 - LLM Compressor Docs | vLLM
  • NVFP4 Quantization | DGX Spark | NVIDIA Build
  • Faster Diffusion on Blackwell: MXFP8 and NVFP4 with Diffusers and TorchAO | PyTorch Blog
  • SVDQuant Meets NVFP4: 4× Smaller and 3× Faster FLUX | MIT HAN Lab
  • Testing NVFP4 Quantization on RTX 5090: Quality Gap Between T2V and I2V | Zenn
  • FP4 All the Way: Fully Quantized Training of LLMs | arXiv
  • Microbenchmarking NVIDIA's Blackwell Architecture | arXiv
  • NVIDIA Model Optimizer GitHub Repository
  • FP4 vs FP8 vs FP16 for LLM Inference: Which Precision Should You Use? | VRLA Tech
  • Diagnosing FP4 inference: layer-wise and block-wise sensitivity analysis | arXiv
#FP4양자화#BlackwellGPU#vLLM#llm-compressor#TorchAO#양자화#LLM추론최적화#MoE#PTQ#W4A4
Share

Table of Contents

Core ConceptsWhy FP4 Differs from INT4: Why Nonlinear Representation Is Advantageous for LLM Weight DistributionsW4A4: Why Throughput Jumps 4×Precision Format Comparison at a GlancePractical ApplicationExample 1: LLM Quantization — Llama 3 FP4 Serving with llm-compressor + vLLMExample 2: Selective MoE Model Quantization — NVIDIA ModelOptExample 3: Image Generation — Optimizing the FLUX Model with TorchAO + NVFP4Pros and Cons AnalysisAdvantagesDisadvantages and CaveatsThe Most Common Mistakes in PracticeClosing ThoughtsReferences

Recommended Posts

Why 88% of AI Agents Fail in Production: The 5-Layer Harness Architecture Is the Answer
AI

Why 88% of AI Agents Fail in Production: The 5-Layer Harness Architecture Is the Answer

When GPT-4 first came out, I—along with most developers around me—shared the same misconception: "Isn't a good model all you need?" We'd slap a few prompt lines...

May 29, 202628 min read
LangGraph Supervisor Pattern: How to Stay in Control in a Multi-Agent System
AI

LangGraph Supervisor Pattern: How to Stay in Control in a Multi-Agent System

The most common mistake when first designing a multi-agent system is connecting agents loosely under the vague expectation that "they'll figure out how to collaborate." I thought the same thing at first, and the result was always the same: you can't tell where the control flow is, you can't trace where it failed, and debugging inevitably leads you to redesign everything from scratch.

May 30, 202622 min read
Comparing Long-Term Memory for AI Agents: Mem0 vs Letta vs Zep — Three Philosophies and How to Choose
AI

Comparing Long-Term Memory for AI Agents: Mem0 vs Letta vs Zep — Three Philosophies and How to Choose

If you've ever built an LLM-based app, you've hit this wall. "How do I make it remember past conversations?" You might think you can just shove the entire conve...

May 30, 202629 min read
XGrammar-2: The Design Principles Behind 80x Faster Structured Output
AI

XGrammar-2: The Design Principles Behind 80x Faster Structured Output

When an LLM calls a tool or returns JSON, there's actually quite a heavy operation running behind the scenes. Every time the model emits a token, it must determ...

May 28, 202623 min read
Why Serving DeepSeek-V3 on 96 H100s Is Possible: SGLang Expert Parallelism's Communication Optimization and Memory Fragmentation Solutions
AI

Why Serving DeepSeek-V3 on 96 H100s Is Possible: SGLang Expert Parallelism's Communication Optimization and Memory Fragmentation Solutions

52,300 input tokens/s. This is the figure LMSYS announced in May 2025 when they became the first to openly deploy DeepSeek-V3 on 96 H100 GPUs. It was initially ...

May 28, 202622 min read
SGLang EPD Disaggregation and Pipeline Parallelism — An Architecture That Splits Vision-Language Model Serving into 3 Stages to Reduce TTFT by Up to 8x
AI

SGLang EPD Disaggregation and Pipeline Parallelism — An Architecture That Splits Vision-Language Model Serving into 3 Stages to Reduce TTFT by Up to 8x

Even if you've never directly served multimodal AI before, that's fine. These days, AI features that accept image input are becoming so widespread so quickly th...

May 27, 202623 min read