# TensorRT-LLM Quantization Toolkit Installation Guide ## Introduction This document introduces: - The steps to install the TensorRT-LLM quantization toolkit. - The Python APIs to quantize the models. The detailed LLM quantization recipe is distributed to the README.md of the corresponding model examples. ## Installation 1. If the dev environment is a docker container, please launch the docker with the following flags ```bash docker run --gpus all --ipc=host --ulimit memlock=-1 --shm-size=20g -it bash ``` 2. Install the quantization toolkit `modelopt` and the related dependencies on top of the TensorRT-LLM installation or docker file. ```bash # Install Modelopt pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-modelopt==0.9.3 # Install the additional requirements cd pip install -r requirements.txt ``` ## APIs [`quantize.py`](./quantize.py) uses the quantization toolkit to calibrate the PyTorch models and export TensorRT-LLM checkpoints. Each TensorRT-LLM checkpoint contains a config file (in .json format) and one or several rank weight files (in .safetensors format). The checkpoints can be directly used by `trtllm-build` command to build TensorRT-LLM engines. See this [`doc`](../../docs/source/checkpoint.md) for more details on the TensorRT-LLM checkpoint format. > *This quantization step may take a long time to finish and requires large GPU memory. Please use a server grade GPU if a GPU out-of-memory error occurs* > *If the model is trained with multi-GPU with tensor parallelism, the PTQ calibration process requires the same amount of GPUs as the training time too.* ### PTQ (Post Training Quantization) PTQ can be achieved with simple calibration on a small set of training or evaluation data (typically 128-512 samples) after converting a regular PyTorch model to a quantized model. ```python import torch from torch.utils.data import DataLoader from transformers import AutoModelForCausalLM import modelopt.torch.quantization as atq model = AutoModelForCausalLM.from_pretrained(...) # Select the quantization config, for example, FP8 config = atq.FP8_DEFAULT_CFG # Prepare the calibration set and define a forward loop calib_dataloader = DataLoader(...) def calibrate_loop(): for data in calib_dataloader: model(data) # PTQ with in-place replacement to quantized modules with torch.no_grad(): atq.quantize(model, config, forward_loop=calibrate_loop) ``` ### Export Quantized Model After the model is quantized, it can be exported to a TensorRT-LLM checkpoint, which includes - One json file recording the model structure and metadata, and - One or several rank weight files storing quantized model weights and scaling factors. The export API is ```python from modelopt.torch.export import export_tensorrt_llm_checkpoint with torch.inference_mode(): export_tensorrt_llm_checkpoint( model, # The quantized model. decoder_type, # The type of the model as str, e.g gptj, llama or gptnext. dtype, # The exported weights data type as torch.dtype. export_dir, # The directory where the exported files will be stored. inference_tensor_parallel=tp_size, # The tensor parallelism size for inference. inference_pipeline_parallel=pp_size, # The pipeline parallelism size for inference. ) ```