TensorRT-LLMs/examples/quantization/README.md

# AMMO Installation Guide

## What's AMMO?

NVIDIA AMMO (AlgorithMic Model Optimization) is a model optimization toolkit used in TensorRT-LLM for quantization.

This document introduces:

- The steps to install AMMO.
- The Python APIs to quantize the models.

The detailed LLM quantization recipe is distributed to the README.md of the corresponding model examples.

## Installation

1. If the dev environment is a docker container, please launch the docker with the following flags

```bash
docker run --gpus all --ipc=host --ulimit memlock=-1 --shm-size=20g -it <the docker image with TensorRT-LLM installed> bash
```

2. Install the quantization library `ammo` and the related dependencies on top of the TensorRT-LLM installation or docker file.

```bash
# Obtain the cuda version from the system. Assuming nvcc is available in path.
cuda_version=$(nvcc --version | grep 'release' | awk '{print $6}' | awk -F'[V.]' '{print $2$3}')
# Obtain the python version from the system.
python_version=$(python3 --version 2>&1 | awk '{print $2}' | awk -F. '{print $1$2}')
# Download and install the AMMO package from the DevZone.
wget https://developer.nvidia.com/downloads/assets/cuda/files/nvidia-ammo/nvidia_ammo-0.3.0.tar.gz
tar -xzf nvidia_ammo-0.3.0.tar.gz
pip install nvidia_ammo-0.3.0/nvidia_ammo-0.3.0+cu$cuda_version-cp$python_version-cp$python_version-linux_x86_64.whl
# Install the additional requirements
cd <this example folder>
pip install -r requirements.txt
```

## APIs

[`ammo.py`](../../tensorrt_llm/models/quantized/ammo.py) uses AMMO to calibrate the PyTorch models, and generate a model config, saved as a json (for the model structure) and npz files (for the model weights) that TensorRT-LLM could parse. The model config includes everything needed by TensorRT-LLM to build the TensorRT inference engine, as explained below.

> *This quantization step may take a long time to finish and requires large GPU memory. Please use a server grade GPU if a GPU out-of-memory error occurs*

> *If the model is trained with multi-GPU with tensor parallelism, the PTQ calibration process requires the same amount of GPUs as the training time too.*


### PTQ (Post Training Quantization)

PTQ can be achieved with simple calibration on a small set of training or evaluation data (typically 128-512 samples) after converting a regular PyTorch model to a quantized model.

```python
import ammo.torch.quantization as atq

model = AutoModelForCausalLM.from_pretrained("...")

# Select the quantization config, for example, FP8
config = atq.FP8_DEFAULT_CFG


# Prepare the calibration set and define a forward loop
def forward_loop():
    for data in calib_set:
        model(data)


# PTQ with in-place replacement to quantized modules
with torch.no_grad():
    atq.quantize(model, config, forward_loop)
```

### Export Quantized Model

After the model is quantized, the model config can be stored. The model config files include all the information needed by TensorRT-LLM to generate the deployable engine, including the quantized scaling factors.

The exported model config are stored as

- A single JSON file recording the model structure and metadata and
- A group of npz files each recording the model on a single tensor parallel rank (model weights, scaling factors per GPU).

The export API is

```python
from ammo.torch.export import export_model_config

with torch.inference_mode():
    export_model_config(
        model,  # The quantized model.
        decoder_type,  # The type of the model as str, e.g gptj, llama or gptnext.
        dtype,  # The exported weights data type as torch.dtype.
        quantization,  # The quantization algorithm applied, e.g. fp8 or int8_sq.
        export_dir,  # The directory where the exported files will be stored.
        inference_gpus,  # The number of GPUs used in the inference time for tensor parallelism.
    )
```