mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-24 12:42:54 +08:00
181 lines
7.9 KiB
Markdown
181 lines
7.9 KiB
Markdown
# TensorRT-LLM Quantization Toolkit Installation Guide
|
||
|
||
## Introduction
|
||
|
||
This document introduces:
|
||
|
||
- The steps to install the TensorRT-LLM quantization toolkit.
|
||
- The Python APIs to quantize the models.
|
||
|
||
The detailed LLM quantization recipe is distributed to the README.md of the corresponding model examples.
|
||
|
||
## Installation
|
||
|
||
The NVIDIA TensorRT Model Optimizer quantization toolkit is installed automatically as a dependency of TensorRT-LLM.
|
||
|
||
```bash
|
||
# Install the additional requirements
|
||
cd examples/quantization
|
||
pip install -r requirements.txt
|
||
```
|
||
|
||
## Usage
|
||
|
||
```bash
|
||
# FP8 quantization.
|
||
python quantize.py --model_dir $MODEL_PATH --qformat fp8 --kv_cache_dtype fp8 --output_dir $OUTPUT_PATH
|
||
|
||
# INT4_AWQ tp4 quantization.
|
||
python quantize.py --model_dir $MODEL_PATH --qformat int4_awq --awq_block_size 64 --tp_size 4 --output_dir $OUTPUT_PATH
|
||
|
||
# INT8 SQ with INT8 kv cache.
|
||
python quantize.py --model_dir $MODEL_PATH --qformat int8_sq --kv_cache_dtype int8 --output_dir $OUTPUT_PATH
|
||
|
||
# FP8 quantization for NeMo model.
|
||
python quantize.py --nemo_ckpt_path nemotron-3-8b-base-4k/Nemotron-3-8B-Base-4k.nemo \
|
||
--dtype bfloat16 \
|
||
--batch_size 64 \
|
||
--qformat fp8 \
|
||
--output_dir nemotron-3-8b/trt_ckpt/fp8/1-gpu
|
||
|
||
# FP8 quantization for Medusa model.
|
||
python quantize.py --model_dir $MODEL_PATH\
|
||
--dtype float16 \
|
||
--qformat fp8 \
|
||
--kv_cache_dtype fp8 \
|
||
--output_dir $OUTPUT_PATH \
|
||
--calib_size 512 \
|
||
--tp_size 1 \
|
||
--medusa_model_dir /path/to/medusa_head/ \
|
||
--num_medusa_heads 4
|
||
```
|
||
Checkpoint saved in `output_dir` can be directly passed to `trtllm-build`.
|
||
|
||
### Quantization Arguments:
|
||
|
||
- model_dir: Hugging Face model path.
|
||
- qformat: Specify the quantization algorithm applied to the checkpoint.
|
||
- fp8: Weights are quantized to FP8 tensor wise. Activation ranges are calibrated tensor wise.
|
||
- int8_sq: Weights are smoothed and quantized to INT8 channel wise. Activation ranges are calibrated tensor wise.
|
||
- int4_awq: Weights are re-scaled and block-wise quantized to INT4. Block size is specified by `awq_block_size`.
|
||
- w4a8_awq: Weights are re-scaled and block-wise quantized to INT4. Block size is specified by `awq_block_size`. Activation ranges are calibrated tensor wise.
|
||
- int8_wo: Actually nothing is applied to weights. Weights are quantized to INT8 channel wise when TRTLLM building the engine.
|
||
- int4_wo: Same as int8_wo but in INT4.
|
||
- full_prec: No quantization.
|
||
- output_dir Path to save the quantized checkpoint.
|
||
- dtype: Specify data type of model when loading from Hugging Face.
|
||
- kv_cache_dtype: Specify kv cache data type.
|
||
- int8: Use int8 kv cache.
|
||
- fp8: Use FP8 kv cache.
|
||
- None (default): Use kv cache as model dtype.
|
||
- batch_size: Batch size for calibration. Default is 1.
|
||
- calib_size: Number of samples. Default is 512.
|
||
- calib_max_seq_length: Max sequence length of calibration samples. Default is 512.
|
||
- tp_size: Checkpoint is tensor paralleled by tp_size. Default is 1.
|
||
- pp_size: Checkpoint is pipeline paralleled by pp_size. Default is 1.
|
||
- awq_block_size: AWQ algorithm specific parameter. Indicate the block size when quantizing weights. 64 and 128 are supported by TRTLLM.
|
||
|
||
#### NeMo model specific arguments:
|
||
|
||
- nemo_ckpt_path: NeMo checkpoint path.
|
||
- calib_tp_size: TP size for NeMo checkpoint calibration.
|
||
- calib_pp_size: PP size for NeMo checkpoint calibration.
|
||
|
||
#### Medusa specific arguments:
|
||
|
||
- medusa_model_dir: Model path of medusa.
|
||
- quant_medusa_head: Whether to quantize the weights of medusa heads.
|
||
- num_medusa_heads: Number of medusa heads.
|
||
- num_medusa_layers: Number of medusa layers.
|
||
- max_draft_len: Max length of draft.
|
||
- medusa_hidden_act: Activation function of medusa.
|
||
|
||
### Building Arguments:
|
||
|
||
There are several arguments for building stage which related to quantizaion.
|
||
- use_fp8_context_fmha: This is Hopper-only feature. Use FP8 Gemm to calculate the attention operation.
|
||
|
||
```python
|
||
qkv scale = 1.0
|
||
FP_O = quantize(softmax(FP8_Q * FP8_K), scale=1.0) * FP8_V
|
||
FP_O * output_scale = FP8_O
|
||
```
|
||
|
||
### Checkpoint Conversion Arguments (not supported by all models)
|
||
|
||
- FP8
|
||
- use_fp8_rowwise: Enable FP8 per-token per-channel quantization for linear layer. (FP8 from `quantize.py` is per-tensor).
|
||
- INT8
|
||
- smoothquant: Enable INT8 quantization for linear layer. Set the α parameter (see https://arxiv.org/pdf/2211.10438.pdf) to Smoothquant the model, and output int8 weights. A good first try is 0.5. Must be in [0, 1].
|
||
- per_channel: Using per-channel quantization for weight when `smoothquant` is enabled.
|
||
- per_token: Using per-token quantization for activation when `smoothquant` is enabled.
|
||
- Weight-Only
|
||
- use_weight_only: Weights are quantized to INT4 or INT8 channel wise.
|
||
- weight_only_precision: Indicate `int4` or `int8` when `use_weight_only` is enabled. Or `int4_gptq` when `quant_ckpt_path` is provided which means checkpoint is for GPTQ.
|
||
- quant_ckpt_path: Path of a GPTQ quantized model checkpoint in `.safetensors` format.
|
||
- group_size: Group size used in GPTQ quantization.
|
||
- per_group: Should be enabled when load from GPTQ.
|
||
- KV Cache
|
||
- int8_kv_cache: By default, we use dtype for KV cache. int8_kv_cache chooses int8 quantization for KV cache.
|
||
- fp8_kv_cache: By default, we use dtype for KV cache. fp8_kv_cache chooses fp8 quantization for KV cache.
|
||
|
||
## APIs
|
||
|
||
[`quantize.py`](./quantize.py) uses the quantization toolkit to calibrate the PyTorch models and export TensorRT-LLM checkpoints. Each TensorRT-LLM checkpoint contains a config file (in .json format) and one or several rank weight files (in .safetensors format). The checkpoints can be directly used by `trtllm-build` command to build TensorRT-LLM engines. See this [`doc`](../../docs/source/architecture/checkpoint.md) for more details on the TensorRT-LLM checkpoint format.
|
||
|
||
> *This quantization step may take a long time to finish and requires large GPU memory. Please use a server grade GPU if a GPU out-of-memory error occurs*
|
||
|
||
> *If the model is trained with multi-GPU with tensor parallelism, the PTQ calibration process requires the same amount of GPUs as the training time too.*
|
||
|
||
|
||
### PTQ (Post Training Quantization)
|
||
|
||
PTQ can be achieved with simple calibration on a small set of training or evaluation data (typically 128-512 samples) after converting a regular PyTorch model to a quantized model.
|
||
|
||
```python
|
||
import torch
|
||
from torch.utils.data import DataLoader
|
||
from transformers import AutoModelForCausalLM
|
||
import modelopt.torch.quantization as atq
|
||
|
||
model = AutoModelForCausalLM.from_pretrained(...)
|
||
|
||
# Select the quantization config, for example, FP8
|
||
config = atq.FP8_DEFAULT_CFG
|
||
|
||
|
||
# Prepare the calibration set and define a forward loop
|
||
calib_dataloader = DataLoader(...)
|
||
def calibrate_loop():
|
||
for data in calib_dataloader:
|
||
model(data)
|
||
|
||
|
||
# PTQ with in-place replacement to quantized modules
|
||
with torch.no_grad():
|
||
atq.quantize(model, config, forward_loop=calibrate_loop)
|
||
```
|
||
|
||
### Export Quantized Model
|
||
|
||
After the model is quantized, it can be exported to a TensorRT-LLM checkpoint, which includes
|
||
|
||
- One json file recording the model structure and metadata, and
|
||
- One or several rank weight files storing quantized model weights and scaling factors.
|
||
|
||
The export API is
|
||
|
||
```python
|
||
from modelopt.torch.export import export_tensorrt_llm_checkpoint
|
||
|
||
with torch.inference_mode():
|
||
export_tensorrt_llm_checkpoint(
|
||
model, # The quantized model.
|
||
decoder_type, # The type of the model as str, e.g gptj, llama or gptnext.
|
||
dtype, # The exported weights data type as torch.dtype.
|
||
export_dir, # The directory where the exported files will be stored.
|
||
inference_tensor_parallel=tp_size, # The tensor parallelism size for inference.
|
||
inference_pipeline_parallel=pp_size, # The pipeline parallelism size for inference.
|
||
)
|
||
```
|