mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
89 lines
3.3 KiB
Markdown
89 lines
3.3 KiB
Markdown
# TensorRT-LLM Quantization Toolkit Installation Guide
|
|
|
|
## Introduction
|
|
|
|
This document introduces:
|
|
|
|
- The steps to install the TensorRT-LLM quantization toolkit.
|
|
- The Python APIs to quantize the models.
|
|
|
|
The detailed LLM quantization recipe is distributed to the README.md of the corresponding model examples.
|
|
|
|
## Installation
|
|
|
|
1. If the dev environment is a docker container, please launch the docker with the following flags
|
|
|
|
```bash
|
|
docker run --gpus all --ipc=host --ulimit memlock=-1 --shm-size=20g -it <the docker image with TensorRT-LLM installed> bash
|
|
```
|
|
|
|
2. Install the quantization toolkit `modelopt` and the related dependencies on top of the TensorRT-LLM installation or docker file.
|
|
|
|
```bash
|
|
# Install Modelopt
|
|
pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-modelopt==0.9.3
|
|
# Install the additional requirements
|
|
cd <this example folder>
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
## APIs
|
|
|
|
[`quantize.py`](./quantize.py) uses the quantization toolkit to calibrate the PyTorch models and export TensorRT-LLM checkpoints. Each TensorRT-LLM checkpoint contains a config file (in .json format) and one or several rank weight files (in .safetensors format). The checkpoints can be directly used by `trtllm-build` command to build TensorRT-LLM engines. See this [`doc`](../../docs/source/checkpoint.md) for more details on the TensorRT-LLM checkpoint format.
|
|
|
|
> *This quantization step may take a long time to finish and requires large GPU memory. Please use a server grade GPU if a GPU out-of-memory error occurs*
|
|
|
|
> *If the model is trained with multi-GPU with tensor parallelism, the PTQ calibration process requires the same amount of GPUs as the training time too.*
|
|
|
|
|
|
### PTQ (Post Training Quantization)
|
|
|
|
PTQ can be achieved with simple calibration on a small set of training or evaluation data (typically 128-512 samples) after converting a regular PyTorch model to a quantized model.
|
|
|
|
```python
|
|
import torch
|
|
from torch.utils.data import DataLoader
|
|
from transformers import AutoModelForCausalLM
|
|
import modelopt.torch.quantization as atq
|
|
|
|
model = AutoModelForCausalLM.from_pretrained(...)
|
|
|
|
# Select the quantization config, for example, FP8
|
|
config = atq.FP8_DEFAULT_CFG
|
|
|
|
|
|
# Prepare the calibration set and define a forward loop
|
|
calib_dataloader = DataLoader(...)
|
|
def calibrate_loop():
|
|
for data in calib_dataloader:
|
|
model(data)
|
|
|
|
|
|
# PTQ with in-place replacement to quantized modules
|
|
with torch.no_grad():
|
|
atq.quantize(model, config, forward_loop=calibrate_loop)
|
|
```
|
|
|
|
### Export Quantized Model
|
|
|
|
After the model is quantized, it can be exported to a TensorRT-LLM checkpoint, which includes
|
|
|
|
- One json file recording the model structure and metadata, and
|
|
- One or several rank weight files storing quantized model weights and scaling factors.
|
|
|
|
The export API is
|
|
|
|
```python
|
|
from modelopt.torch.export import export_tensorrt_llm_checkpoint
|
|
|
|
with torch.inference_mode():
|
|
export_tensorrt_llm_checkpoint(
|
|
model, # The quantized model.
|
|
decoder_type, # The type of the model as str, e.g gptj, llama or gptnext.
|
|
dtype, # The exported weights data type as torch.dtype.
|
|
export_dir, # The directory where the exported files will be stored.
|
|
inference_tensor_parallel=tp_size, # The tensor parallelism size for inference.
|
|
inference_pipeline_parallel=pp_size, # The pipeline parallelism size for inference.
|
|
)
|
|
```
|