FP4 Quantization + Blackwell GPU: Conditions for 4× Throughput over H100 and When Not to Use It

`RTX 5090 / B200 / vLLM FP4 / llm-compressor Results*

When I first got my hands on the RTX 5090, I honestly thought, "Just another new GPU." But after seeing the FP4 quantization numbers, I changed my mind. In the MLPerf v5.0 official benchmark, a single B200 delivered 10,755 tokens per second on Llama 2 70B inference, and a DGX B200 8-GPU system recorded 3.1× the throughput of a DGX H200. The key point is that these are certified benchmark figures, not marketing numbers. And right at that moment, I started wondering, "Does this apply to my models too?" — so I ran the code myself to find out.

This article covers three things:

Why FP4 is a meaningful technology right now, and how it ties into the Blackwell architecture
Three scenarios validated with real code — LLM serving, MoE models, and image generation
When you should NOT use FP4 — accuracy loss figures by task and mitigation strategies

Rather than vague warnings like "quantization has accuracy loss, so be careful," this article goes all the way to showing how much loss appears in which tasks. FP4 only makes sense when paired with a Blackwell GPU, and knowing exactly those conditions and exceptions is the first step to actually using this technology. If your team is feeling the weight of GPU server costs, I'd recommend reading to the end.

Core Concepts

Why FP4 Differs from INT4: Why Nonlinear Representation Is Advantageous for LLM Weight Distributions

I initially thought, "What's the difference between two 4-bit formats?" The short answer is that the representation scheme itself is different. INT4 can only represent integer grid points in the range −8 to 7. NVFP4, by contrast, is a floating-point format with an exponent field, so it represents values at nonlinear intervals. Why does this matter? LLM weight distributions have most values clustered near zero, with rare extreme outliers mixed in. INT4 faces a dilemma: if you widen the scale to accommodate those outliers, precision near zero degrades. FP4 mitigates this problem to a meaningful degree thanks to its nonlinear spacing.

A brief breakdown of NVFP4's internal structure:

Component	Description
Per-block FP8 scale	One scale factor shared across every 16 values
Per-tensor FP32 scale	An additional scale applied across the entire tensor
Tensor Core accumulation	FP4 operations accumulated in FP16 to minimize error

Looking at this table naturally raises two questions.

Why accumulate in FP16 instead of FP32? FP4 operation results already have a limited value range, making an FP32 accumulator overkill. Accumulating in FP16 cuts memory bandwidth in half while still providing sufficient precision for LLM inference. It's a deliberate design choice favoring bandwidth reduction over precision.

Why is the memory ratio 0.29× instead of 0.25×? A naive bit ratio (4/16) would give 0.25×, but NVFP4 has block-scale overhead. 16 values (64 bits) + 1 FP8 scale (8 bits) = 72 bits. Storing the same 16 values in FP16 would require 256 bits. Therefore 72 ÷ 256 ≈ 0.28×, which rounds to 0.29×. INT4 has no such scale overhead and achieves 0.25×, but sacrifices that level of precision control in exchange.

For reference, MXFP4 (the format led by Microsoft/Intel) takes a block size of 32 to reduce overhead. The block-16 of NVFP4 vs. block-32 of MXFP4 reflects a design tradeoff between precision and compression efficiency. If you're using NVIDIA GPUs, NVFP4 is the go-to; if cross-platform compatibility matters, it's worth tracking MXFP4 standardization trends as well.

Term: PTQ (Post-Training Quantization) — A technique that quantizes a trained model without retraining. It can be applied quickly but requires a calibration dataset and may yield lower accuracy than quantization-aware training (QAT).

W4A4: Why Throughput Jumps 4×

What's important is that FP4 doesn't just compress weights. W4A4 reduces both weights and activations to FP4. Compared to W4A16 or W4A8, which compress only the weights, W4A4 simultaneously reduces memory bandwidth and computation, leading to far greater throughput gains. The prerequisite for this combination is that Blackwell's 5th-generation Tensor Cores handle FP4 operations natively.

Key point: FP4's speed advantage does not appear on pre-Blackwell GPUs. On Hopper (H100) or Ampere, it is handled via software emulation, eliminating or even reversing the speed benefit. Always verify the GPU architecture first.

Precision Format Comparison at a Glance

Format	Bits	Memory (vs FP16)	Blackwell Native
FP16	16	1×	✓
FP8	8	0.5×	✓
NVFP4	4	0.29× (including block scale)	✓ (exclusive)
INT4	4	0.25×	△ (emulated)

In this table, the last two rows are the most practically significant. NVFP4 compresses slightly less than INT4 at 0.29×, but runs natively on Blackwell — and that architectural difference is the premise for all the code examples that follow.

Practical Application

Example 1: LLM Quantization — Llama 3 FP4 Serving with llm-compressor + vLLM

If you're already in the vLLM ecosystem, you can build a W4A4 FP4 quantization pipeline with llm-compressor. The code is surprisingly simple once you have a calibration dataset ready. The first time I did it, I thought, "Is this really all there is?" — the boilerplate is minimal.

python

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
 
# Prepare calibration dataset (512 samples recommended)
recipe = QuantizationModifier(
    targets="Linear",
    scheme="NVFP4",          # Apply W4A4 FP4
    ignore=["lm_head"],      # Keep output layer in FP16
)
 
oneshot(
    model="meta-llama/Llama-3-8B-Instruct",
    dataset="open_platypus",
    recipe=recipe,
    num_calibration_samples=512,
    output_dir="./llama3-nvfp4",
)

bash

# Install FP4-supported version (as of May 2025)
pip install "llmcompressor>=0.3.0" "vllm>=0.4.2"
 
# Serve the quantized model with vLLM
vllm serve ./llama3-nvfp4 \
  --quantization nvfp4 \
  --dtype float16
  # ~2,400 tok/s on RTX 5090 (approximately 2.8× improvement over FP16)

Parameter	Description
`scheme="NVFP4"`	Apply W4A4 FP4 quantization
`ignore=["lm_head"]`	Keep final classification layer in FP16 (stabilizes accuracy)
`num_calibration_samples`	512–1024 recommended. Fewer is faster but hurts accuracy

Version note: NVFP4 support is only stable from certain versions onward. If you install via pip install llmcompressor and hit FP4-related errors, suspect a version conflict first.

Example 2: Selective MoE Model Quantization — NVIDIA ModelOpt

Once you've done LLM quantization, it's natural to want to apply it to MoE models like Mixtral or DeepSeek-V3. But this is where I made a mistake early on. I pushed all layers to FP4 at once and ended up with broken routing behavior.

The routing layer in an MoE model is the core component that decides which expert receives each token. Applying FP4 to this layer destabilizes the routing decisions themselves, degrading overall output quality. The nvfp4_experts_only setting in ModelOpt is the workaround for this problem.

python

import modelopt.torch.quantization as mtq
from torch.utils.data import DataLoader
 
# forward_loop: passes calibration data through the model to collect value statistics
# used to determine quantization scales
def forward_loop(model):
    for batch in calibration_dataloader:  # a DataLoader prepared in advance
        model(**batch)
 
# Expert layers only: NVFP4; everything else stays FP8
quant_config = {
    "quant_cfg": {
        "*experts*": {"num_bits": (4, 3), "axis": None},  # NVFP4
        "*": {"num_bits": (8, 7), "axis": None},           # FP8
    }
}
 
# Use a predefined config (recommended)
model = mtq.quantize(model, mtq.NVFP4_EXPERTS_ONLY_CFG, forward_loop)
mtq.print_quant_summary(model)  # Check per-layer quantization results

bash

pip install "nvidia-modelopt[torch]>=0.17.0"

nvfp4_experts_only applies FP4 only to the expert layers, while keeping attention and non-expert layers at FP8. I think it's a practically well-balanced setting for the memory-versus-accuracy tradeoff.

Example 3: Image Generation — Optimizing the FLUX Model with TorchAO + NVFP4

This example requires a brief explanation of one important design choice up front. Why use W4A16 (weights-only FP4) instead of the W4A4 described earlier? Diffusion models currently don't fully support activation quantization (A4). Transformer-based diffusion architectures have activation patterns that differ from LLMs, causing severe image quality degradation when A4 is applied, and the runtime support needed to stabilize this has not yet matured. That's why this scenario uses W4A16 — only the weights are brought down to FP4.

SVDQuant, developed by MIT HAN Lab, absorbs the outlier components of weights into a low-rank matrix via SVD before applying FP4. Once outliers are absorbed into the low-rank matrix, the remaining weights have a narrower value range, making them coverable by FP4's limited representational range. This approach achieved a PSNR of 21.5 on FLUX.1-dev. A PSNR above 20 is generally considered visually acceptable, and 21.5 is a level where the result is difficult to distinguish from the original 16-bit model for most use cases.

python

from torchao.quantization import quantize_, nvfp4_weight_only
from diffusers import FluxPipeline
import torch
 
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16,
)
 
# W4A16: apply NVFP4 to weights only (activations remain bfloat16)
quantize_(pipe.transformer, nvfp4_weight_only())
pipe.to("cuda")
 
image = pipe(
    "a photo of a cat sitting on a GPU",
    num_inference_steps=20,
).images[0]
image.save("output.png")
# ~3× speed improvement over FP8 and 4× memory reduction on RTX 5090

bash

pip install "torch>=2.5" "torchao>=0.6.0" diffusers transformers

Caution: For I2V (Image-to-Video) tasks, FP4 precision may be insufficient to preserve fine details from the reference image, causing quality to fall below practical thresholds. T2V (Text-to-Video) is comparatively tolerant, but for I2V, maintaining FP8 or higher is recommended.

Pros and Cons Analysis

Now that we've seen enough numbers and code, let's return to the most important question in practice: the table below should help you decide "Is FP4 right for our team?"

Advantages

Item	Figure	Note
Memory reduction	~3.5× vs FP16	Enables loading larger models into the same VRAM
Throughput gain	Up to 4× vs H100 FP8	MLPerf v5.0 certified figure
Accuracy loss	<1% (PTQ baseline)	Based on general LLM benchmarks. Task-specific variance exists (see disadvantages table)
Energy efficiency	Up to 2× vs FP8	Per NVIDIA announcement
Ecosystem	TensorRT-LLM, vLLM, TorchAO	SGLang support also planned

Disadvantages and Caveats

Of these, the item encountered most often in practice is undoubtedly "hardware lock-in." Teams using RTX 4090 sometimes try FP4 only to find it slower — and this is exactly why.

Item	Description	Mitigation
Hardware lock-in	Native execution requires Blackwell (SM 10.0+)	Use FP8 on Hopper/Ampere
Task-specific accuracy variance	Up to 8-point drop on code generation tasks vs INT4 baseline (different conditions than general benchmarks)	Keep sensitive layers at FP8 for code, math, and I2V
Calibration data required	A representative calibration dataset is essential	Use 512–1024 samples from real production queries
KV cache constraint	TRT-LLM FP4 KV cache requires offline ModelOpt	vLLM supports weights-only FP4
Unfinished standards	Divergence between NVFP4 and MXFP4 formats	Cross-platform porting incurs conversion costs
Extreme VRAM constraints	Environments where even FP4 is too large	Consider AWQ INT4 (see below)

Term: AWQ (Activation-aware Weight Quantization) — An INT4 quantization technique that preserves high precision for important weight channels by accounting for activation distributions. It works even without Blackwell and in extremely VRAM-constrained environments, and is broadly supported across the Transformers/llama.cpp ecosystems.

Term: MXFP4 — The FP4 variant of the MX (Microscaling) format led by Microsoft and Intel. It is not interchangeable with NVFP4 due to a different block-scaling approach (block size 32 vs. 16).

The Most Common Mistakes in Practice

Attempting NVFP4 on a Hopper GPU (H100) — It is handled via software emulation and ends up being slower than FP8. You can verify with torch.cuda.get_device_capability(): Blackwell returns (10, 0), Hopper (H100) returns (9, 0). Native FP4 execution requires (10, 0) or higher.
Building the calibration dataset from random samples — Calibrating on data with a different distribution than what your actual service receives leads to unexpected accuracy drops in production. Use real log-based samples whenever possible.
Applying FP4 uniformly to all layers — A mixed-precision strategy that keeps sensitive components like attention layers and MoE routing layers at FP8 can significantly reduce accuracy loss. Use ModelOpt's AutoQuantize or the nvfp4_experts_only setting to achieve this.

Closing Thoughts

Paired with Blackwell GPUs, FP4 has moved beyond "experimental technology" to the point where it deserves serious consideration for production workloads. That said, the right order is to first check your GPU architecture, task characteristics, and accuracy requirements. The decision flow below should help:

Blackwell GPU (RTX 50, B100/B200) + inference optimization  → Actively consider NVFP4
Hopper / Ampere GPU                                          → FP8 first
Code generation, math reasoning, I2V tasks                  → Keep sensitive layers at FP8, or use mixed precision
Extreme VRAM constraints (no Blackwell)                     → Consider AWQ INT4

Three steps you can start right now:

Check your GPU compute capability first — Run python -c "import torch; print(torch.cuda.get_device_capability())". If the result is (10, 0) or higher, you have Blackwell and native FP4 execution is available. If it's (9, 0), you have Hopper (H100) and FP8 is the better choice.
Get a feel for the speed with pre-quantized models from Hugging Face — Download a Llama 3 NVFP4 checkpoint from the nvidia/ namespace and serve it with vLLM. You can immediately see the throughput difference without doing any quantization work yourself.
If you need to quantize yourself, start with llm-compressor — In the example code above, just swap in your model path and calibration dataset. You can see first results with as few as 512 calibration samples.

References

#FP4양자화#BlackwellGPU#vLLM#llm-compressor#TorchAO#양자화#LLM추론최적화#MoE#PTQ#W4A4

FP4 Quantization + Blackwell GPU: Conditions for 4× Throughput over H100 and When Not to Use It

`RTX 5090 / B200 / vLLM FP4 / llm-compressor Results*

This article covers three things:

Why FP4 is a meaningful technology right now, and how it ties into the Blackwell architecture
Three scenarios validated with real code — LLM serving, MoE models, and image generation
When you should NOT use FP4 — accuracy loss figures by task and mitigation strategies

Core Concepts

Why FP4 Differs from INT4: Why Nonlinear Representation Is Advantageous for LLM Weight Distributions

A brief breakdown of NVFP4's internal structure:

Component	Description
Per-block FP8 scale	One scale factor shared across every 16 values
Per-tensor FP32 scale	An additional scale applied across the entire tensor
Tensor Core accumulation	FP4 operations accumulated in FP16 to minimize error

Looking at this table naturally raises two questions.

Term: PTQ (Post-Training Quantization) — A technique that quantizes a trained model without retraining. It can be applied quickly but requires a calibration dataset and may yield lower accuracy than quantization-aware training (QAT).

W4A4: Why Throughput Jumps 4×

Key point: FP4's speed advantage does not appear on pre-Blackwell GPUs. On Hopper (H100) or Ampere, it is handled via software emulation, eliminating or even reversing the speed benefit. Always verify the GPU architecture first.

Precision Format Comparison at a Glance

Format	Bits	Memory (vs FP16)	Blackwell Native
FP16	16	1×	✓
FP8	8	0.5×	✓
NVFP4	4	0.29× (including block scale)	✓ (exclusive)
INT4	4	0.25×	△ (emulated)

Practical Application

Example 1: LLM Quantization — Llama 3 FP4 Serving with llm-compressor + vLLM

python

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
 
# Prepare calibration dataset (512 samples recommended)
recipe = QuantizationModifier(
    targets="Linear",
    scheme="NVFP4",          # Apply W4A4 FP4
    ignore=["lm_head"],      # Keep output layer in FP16
)
 
oneshot(
    model="meta-llama/Llama-3-8B-Instruct",
    dataset="open_platypus",
    recipe=recipe,
    num_calibration_samples=512,
    output_dir="./llama3-nvfp4",
)

bash

# Install FP4-supported version (as of May 2025)
pip install "llmcompressor>=0.3.0" "vllm>=0.4.2"
 
# Serve the quantized model with vLLM
vllm serve ./llama3-nvfp4 \
  --quantization nvfp4 \
  --dtype float16
  # ~2,400 tok/s on RTX 5090 (approximately 2.8× improvement over FP16)

Parameter	Description
`scheme="NVFP4"`	Apply W4A4 FP4 quantization
`ignore=["lm_head"]`	Keep final classification layer in FP16 (stabilizes accuracy)
`num_calibration_samples`	512–1024 recommended. Fewer is faster but hurts accuracy

Version note: NVFP4 support is only stable from certain versions onward. If you install via pip install llmcompressor and hit FP4-related errors, suspect a version conflict first.

Example 2: Selective MoE Model Quantization — NVIDIA ModelOpt

python

import modelopt.torch.quantization as mtq
from torch.utils.data import DataLoader
 
# forward_loop: passes calibration data through the model to collect value statistics
# used to determine quantization scales
def forward_loop(model):
    for batch in calibration_dataloader:  # a DataLoader prepared in advance
        model(**batch)
 
# Expert layers only: NVFP4; everything else stays FP8
quant_config = {
    "quant_cfg": {
        "*experts*": {"num_bits": (4, 3), "axis": None},  # NVFP4
        "*": {"num_bits": (8, 7), "axis": None},           # FP8
    }
}
 
# Use a predefined config (recommended)
model = mtq.quantize(model, mtq.NVFP4_EXPERTS_ONLY_CFG, forward_loop)
mtq.print_quant_summary(model)  # Check per-layer quantization results

bash

pip install "nvidia-modelopt[torch]>=0.17.0"

Example 3: Image Generation — Optimizing the FLUX Model with TorchAO + NVFP4

python

from torchao.quantization import quantize_, nvfp4_weight_only
from diffusers import FluxPipeline
import torch
 
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16,
)
 
# W4A16: apply NVFP4 to weights only (activations remain bfloat16)
quantize_(pipe.transformer, nvfp4_weight_only())
pipe.to("cuda")
 
image = pipe(
    "a photo of a cat sitting on a GPU",
    num_inference_steps=20,
).images[0]
image.save("output.png")
# ~3× speed improvement over FP8 and 4× memory reduction on RTX 5090

bash

pip install "torch>=2.5" "torchao>=0.6.0" diffusers transformers

Caution: For I2V (Image-to-Video) tasks, FP4 precision may be insufficient to preserve fine details from the reference image, causing quality to fall below practical thresholds. T2V (Text-to-Video) is comparatively tolerant, but for I2V, maintaining FP8 or higher is recommended.

Pros and Cons Analysis

Now that we've seen enough numbers and code, let's return to the most important question in practice: the table below should help you decide "Is FP4 right for our team?"

Advantages

Item	Figure	Note
Memory reduction	~3.5× vs FP16	Enables loading larger models into the same VRAM
Throughput gain	Up to 4× vs H100 FP8	MLPerf v5.0 certified figure
Accuracy loss	<1% (PTQ baseline)	Based on general LLM benchmarks. Task-specific variance exists (see disadvantages table)
Energy efficiency	Up to 2× vs FP8	Per NVIDIA announcement
Ecosystem	TensorRT-LLM, vLLM, TorchAO	SGLang support also planned

Disadvantages and Caveats

Of these, the item encountered most often in practice is undoubtedly "hardware lock-in." Teams using RTX 4090 sometimes try FP4 only to find it slower — and this is exactly why.

Item	Description	Mitigation
Hardware lock-in	Native execution requires Blackwell (SM 10.0+)	Use FP8 on Hopper/Ampere
Task-specific accuracy variance	Up to 8-point drop on code generation tasks vs INT4 baseline (different conditions than general benchmarks)	Keep sensitive layers at FP8 for code, math, and I2V
Calibration data required	A representative calibration dataset is essential	Use 512–1024 samples from real production queries
KV cache constraint	TRT-LLM FP4 KV cache requires offline ModelOpt	vLLM supports weights-only FP4
Unfinished standards	Divergence between NVFP4 and MXFP4 formats	Cross-platform porting incurs conversion costs
Extreme VRAM constraints	Environments where even FP4 is too large	Consider AWQ INT4 (see below)

Term: AWQ (Activation-aware Weight Quantization) — An INT4 quantization technique that preserves high precision for important weight channels by accounting for activation distributions. It works even without Blackwell and in extremely VRAM-constrained environments, and is broadly supported across the Transformers/llama.cpp ecosystems.

Term: MXFP4 — The FP4 variant of the MX (Microscaling) format led by Microsoft and Intel. It is not interchangeable with NVFP4 due to a different block-scaling approach (block size 32 vs. 16).

The Most Common Mistakes in Practice

Attempting NVFP4 on a Hopper GPU (H100) — It is handled via software emulation and ends up being slower than FP8. You can verify with torch.cuda.get_device_capability(): Blackwell returns (10, 0), Hopper (H100) returns (9, 0). Native FP4 execution requires (10, 0) or higher.
Building the calibration dataset from random samples — Calibrating on data with a different distribution than what your actual service receives leads to unexpected accuracy drops in production. Use real log-based samples whenever possible.
Applying FP4 uniformly to all layers — A mixed-precision strategy that keeps sensitive components like attention layers and MoE routing layers at FP8 can significantly reduce accuracy loss. Use ModelOpt's AutoQuantize or the nvfp4_experts_only setting to achieve this.

Closing Thoughts

Blackwell GPU (RTX 50, B100/B200) + inference optimization  → Actively consider NVFP4
Hopper / Ampere GPU                                          → FP8 first
Code generation, math reasoning, I2V tasks                  → Keep sensitive layers at FP8, or use mixed precision
Extreme VRAM constraints (no Blackwell)                     → Consider AWQ INT4

Three steps you can start right now:

Check your GPU compute capability first — Run python -c "import torch; print(torch.cuda.get_device_capability())". If the result is (10, 0) or higher, you have Blackwell and native FP4 execution is available. If it's (9, 0), you have Hopper (H100) and FP8 is the better choice.
Get a feel for the speed with pre-quantized models from Hugging Face — Download a Llama 3 NVFP4 checkpoint from the nvidia/ namespace and serve it with vLLM. You can immediately see the throughput difference without doing any quantization work yourself.
If you need to quantize yourself, start with llm-compressor — In the example code above, just swap in your model path and calibration dataset. You can see first results with as few as 512 calibration samples.

References

#FP4양자화#BlackwellGPU#vLLM#llm-compressor#TorchAO#양자화#LLM추론최적화#MoE#PTQ#W4A4

Core Concepts

Why FP4 Differs from INT4: Why Nonlinear Representation Is Advantageous for LLM Weight Distributions

W4A4: Why Throughput Jumps 4×

Precision Format Comparison at a Glance

Practical Application

Example 1: LLM Quantization — Llama 3 FP4 Serving with llm-compressor + vLLM

Example 2: Selective MoE Model Quantization — NVIDIA ModelOpt

Example 3: Image Generation — Optimizing the FLUX Model with TorchAO + NVFP4

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

Why FP4 Differs from INT4: Why Nonlinear Representation Is Advantageous for LLM Weight Distributions

W4A4: Why Throughput Jumps 4×

Precision Format Comparison at a Glance

Practical Application

Example 1: LLM Quantization — Llama 3 FP4 Serving with llm-compressor + vLLM

Example 2: Selective MoE Model Quantization — NVIDIA ModelOpt

Example 3: Image Generation — Optimizing the FLUX Model with TorchAO + NVFP4

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Why 88% of AI Agents Fail in Production: The 5-Layer Harness Architecture Is the Answer

LangGraph Supervisor Pattern: How to Stay in Control in a Multi-Agent System

Comparing Long-Term Memory for AI Agents: Mem0 vs Letta vs Zep — Three Philosophies and How to Choose

XGrammar-2: The Design Principles Behind 80x Faster Structured Output

Why Serving DeepSeek-V3 on 96 H100s Is Possible: SGLang Expert Parallelism's Communication Optimization and Memory Fragmentation Solutions

SGLang EPD Disaggregation and Pipeline Parallelism — An Architecture That Splits Vision-Language Model Serving into 3 Stages to Reduce TTFT by Up to 8x