TensorRT-LLMs/docs/source/features/quantization.md
Cheng Hang 656c705ff1
[None][feat] sm100 weight-only kernel (#10190)
Signed-off-by: Cheng Hang <chang@nvidia.com>
2026-01-05 09:44:36 +08:00

140 lines
7.2 KiB
Markdown

# Quantization
## Quantization in TensorRT LLM
Quantization is a technique used to reduce memory footprint and computational cost by converting the model's weights and/or activations from high-precision floating-point numbers (like BF16) to lower-precision data types, such as INT8, FP8, or FP4.
TensorRT LLM offers a variety of quantization recipes to optimize LLM inference. These recipes can be broadly categorized as follows:
* FP4
* FP8 Per Tensor
* FP8 Block Scaling
* FP8 Rowwise
* FP8 KV Cache
* NVFP4 KV Cache
* W4A16 GPTQ
* W4A8 GPTQ
* W4A16 AWQ
* W4A8 AWQ
## Usage
The default PyTorch backend supports FP4 and FP8 quantization on the latest Blackwell and Hopper GPUs.
### Running Pre-quantized Models
TensorRT LLM can directly run [pre-quantized models](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4) generated with the [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer).
```python
from tensorrt_llm import LLM
llm = LLM(model='nvidia/Llama-3.1-8B-Instruct-FP8')
llm.generate("Hello, my name is")
```
#### FP8 KV Cache
```{note}
TensorRT LLM allows you to enable the FP8 KV cache manually, even for checkpoints that do not have it enabled by default.
```
Here is an example of how to set the FP8 KV Cache option:
```python
from tensorrt_llm import LLM
from tensorrt_llm.llmapi import KvCacheConfig
llm = LLM(model='/path/to/model',
kv_cache_config=KvCacheConfig(dtype='fp8'))
llm.generate("Hello, my name is")
```
#### NVFP4 KV Cache
To enable NVFP4 KV cache, offline quantization with ModelOpt is required. Please follow the below section for instructions.
After the quantization is done, the NVFP4 KV cache option can be set by:
```python
from tensorrt_llm import LLM
from tensorrt_llm.llmapi import KvCacheConfig
llm = LLM(model='/path/to/model',
kv_cache_config=KvCacheConfig(dtype='nvfp4'))
llm.generate("Hello, my name is")
```
### Offline Quantization with ModelOpt
If a pre-quantized model is not available on the [Hugging Face Hub](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4), you can quantize it offline using ModelOpt.
Follow this step-by-step guide to quantize a model:
```bash
git clone https://github.com/NVIDIA/Model-Optimizer.git
cd Model-Optimizer/examples/llm_ptq
scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8
```
#### NVFP4 KV Cache
To generate the checkpoint for NVFP4 KV cache:
```bash
git clone https://github.com/NVIDIA/Model-Optimizer.git
cd TensorRT-Model-Optimizer/examples/llm_ptq
scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8 --kv_cache_quant nvfp4
```
Note that currently TRT-LLM only supports FP8 weight/activation quantization when NVFP4 KV cache is enabled. Therefore, `--quant fp8` is required here.
## Model Supported Matrix
| Model | NVFP4 | MXFP4 | FP8(per tensor)| FP8(block scaling) | FP8(rowwise) | FP8 KV Cache | NVFP4 KV Cache | W4A8 AWQ | W4A16 AWQ | W4A8 GPTQ | W4A16 GPTQ |
| :------------- | :---: | :---: | :---: | :---: | :---: | :---: |:---:| :-------: | :-------: | :--------: | :--------: |
| BERT | . | . | . | . | . | Y | . | . | . | . | . |
| DeepSeek-R1 | Y | . | . | Y | . | Y | . | . | . | . | . |
| EXAONE | . | . | Y | . | . | Y | . | Y | Y | . | . |
| Gemma 3 | . | . | Y | . | . | Y | . | Y | Y | . | . |
| GPT-OSS | . | Y | . | . | . | Y | . | . | . | . | . |
| LLaMA | Y | . | Y | . | . | Y | . | . | Y | . | Y |
| LLaMA-v2 | Y | . | Y | . | . | Y | Y | Y | Y | . | Y |
| LLaMA 3 | . | . | . | . | Y | Y | Y | Y | . | . | . |
| LLaMA 4 | Y | . | Y | . | . | Y | . | . | . | . | . |
| Mistral | . | . | Y | . | . | Y | . | . | Y | . | . |
| Mixtral | Y | . | Y | . | . | Y | . | . | . | . | . |
| Phi | . | . | . | . | . | Y | . | Y | . | . | . |
| Qwen | . | . | . | . | . | Y | . | Y | Y | . | Y |
| Qwen-2/2.5 | Y | . | Y | . | . | Y | . | Y | Y | . | Y |
| Qwen-3 | Y | . | Y | . | . | Y | Y | . | Y | . | Y |
| BLIP2-OPT | . | . | . | . | . | Y | . | . | . | . | . |
| BLIP2-T5 | . | . | . | . | . | Y | . | . | . | . | . |
| LLaVA | . | . | Y | . | . | Y | . | . | Y | . | Y |
| VILA | . | . | Y | . | . | Y | . | . | Y | . | Y |
| Nougat | . | . | . | . | . | Y | . | . | . | . | . |
```{note}
The vision component of multi-modal models(BLIP2-OPT/BLIP2-T5/LLaVA/VILA/Nougat) uses FP16 by default.
The language component decides which quantization methods are supported by a given multi-modal model.
```
## Hardware Support Matrix
| Model | NVFP4 | MXFP4 | FP8(per tensor)| FP8(block scaling) | FP8(rowwise) | FP8 KV Cache | NVFP4 KV Cache | W4A8 AWQ | W4A16 AWQ | W4A8 GPTQ | W4A16 GPTQ |
| :------------- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :-------: | :-------: | :--------: | :--------: |
| Blackwell(sm120) | Y | Y | Y | . | . | Y | . | . | . | . | . |
| Blackwell(sm100/103) | Y | Y | Y | Y | . | Y | Y | Y | Y | Y | Y |
| Hopper | . | . | Y | Y | Y | Y | . | Y | Y | Y | Y |
| Ada Lovelace | . | . | Y | . | . | Y | . | Y | Y | Y | Y |
| Ampere | . | . | . | . | . | Y | . | . | Y | . | Y |
```{note}
FP8 block wise scaling GEMM kernels for sm100/103 are using MXFP8 recipe (E4M3 act/weight and UE8M0 act/weight scale), which is slightly different from SM90 FP8 recipe (E4M3 act/weight and FP32 act/weight scale).
```
## Quick Links
- [Pre-quantized Models by ModelOpt](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4)
- [ModelOpt Support Matrix](https://nvidia.github.io/Model-Optimizer/guides/0_support_matrix.html)