TensorRT-LLMs/docs/source/architecture/checkpoint.md

# TensorRT-LLM Checkpoint

## Overview

The earlier versions (pre-0.8 version) of TensorRT-LLM were developed with a very aggressive timeline. For those versions, emphasis was not put on defining a unified workflow. Now that TensorRT-LLM has reached some level of feature richness, the development team has decided to put more effort into unifying the APIs and workflow of TensorRT-LLM. This file documents the workflow around TensorRT-LLM checkpoint and the set of CLI tools to generate checkpoint, build engines, and evaluate engines.

There are three steps in the workflow:

1. Convert weights from different source frameworks into TensorRT-LLM checkpoint.
2. Build the TensorRT-LLM checkpoint into TensorRT engines with a unified build command.
3. Load the engines to TensorRT-LLM model runner and evaluate with different evaluation tasks.

```
NeMo -------------
                  |
HuggingFace ------
                  |   convert                             build                    load
Modelopt ---------  ----------> TensorRT-LLM Checkpoint --------> TensorRT Engine ------> TensorRT-LLM ModelRunner
                  |
JAX --------------
                  |
DeepSpeed --------
```

## Prepare the TensorRT-LLM Checkpoint

TensorRT-LLM aims at supporting different sources:

1. Trained models from NVIDIA NeMo, Microsoft DeepSpeed, and JAX
2. Quantized models from NVIDIA Modelopt
3. Popular models from HuggingFace

TensorRT-LLM defines its own checkpoint format. A checkpoint directory includes:

1. One config `json` file, which contains several model hyper-parameters.
2. One or several rank weights files, each file contains a dictionary of tensors (weights).
The different files are loaded by different ranks in a multi-GPU (multi-process) scenario.

### Config

| Field                                  | Type       | Default Value       |
| :------------------------------------- | :--------- | :------------------ |
| architecture                           | string     | mandatory           |
| dtype                                  | string     | mandatory           |
| logits_dtype                           | string     | 'float32'           |
| vocab_size                             | int        | mandatory           |
| max_position_embeddings                | int        | null                |
| hidden_size                            | int        | mandatory           |
| num_hidden_layers                      | int        | mandatory           |
| num_attention_heads                    | int        | mandatory           |
| num_key_value_heads                    | int        | num_attention_heads |
| hidden_act                             | string     | mandatory           |
| intermediate_size                      | int        | null                |
| norm_epsilon                           | float      | 1e-5                |
| position_embedding_type                | string     | 'learned_absolute'  |
| mapping.world_size                     | int        | 1                   |
| mapping.tp_size                        | int        | 1                   |
| mapping.pp_size                        | int        | 1                   |
| quantization.quant_algo                | str        | null                |
| quantization.kv_cache_quant_algo       | str        | null                |
| quantization.group_size                | int        | 64                  |
| quantization.has_zero_point            | bool       | False               |
| quantization.pre_quant_scale           | bool       | False               |
| quantization.exclude_modules           | list       | null                |

`mapping.world_size` means `mapping` is a dictionary containing the `world_size` sub field.

```json
{
    "architecture": "OPTForCausalLM",
    "mapping": {
        "world_size": 1
    }
}
```

Supported quantization algorithm list:

- W8A16
- W4A16
- W4A16_AWQ
- W4A8_AWQ
- W4A16_GPTQ
- FP8
- W8A8_SQ_PER_CHANNEL

Supported KV cache quantization algorithm list:

- FP8
- INT8

The config field is extensible, a model could add its own specific config fields.
For example, OPT model has a `do_layer_norm_before` field.

Here is the model specific config list:

| Field                                  | Type       | Default Value       |
| :------------------------------------- | :--------- | :------------------ |
| OPT                                    |            |                     |
| do_layer_norm_before                   | bool       | False               |
|                                        |            |                     |
| Falcon                                 |            |                     |
| bias                                   | bool       | True                |
| new_decoder_architecture               | bool       | False               |
| parallel_attention                     | bool       | False               |

### Rank Weights

Like PyTorch, the tensor (weight) name is a string containing hierarchical information,
which is uniquely mapped to a certain parameter of a TensorRT-LLM model.

For example, each transformer layer of the OPT model contains an `Attention` layer, an `MLP` layer. and two `LayerNorm` layers.

#### Attention Weights

The `Attention` layer contains two `Linear` layers, qkv and dense; each `Linear` layer contains one weight and one bias.
There are four tensors (weights) in total, whose names are:

- `transformer.layers.0.attention.qkv.weight`
- `transformer.layers.0.attention.qkv.bias`
- `transformer.layers.0.attention.dense.weight`
- `transformer.layers.0.attention.dense.bias`

where `transformer.layers.0.attention` is the prefix name, indicating that the weights/biases are in the Attention module of the 0-th transformer layer.

#### MLP Weights

The `MLP` layer also contains two `Linear` layers, fc and proj; each `Linear` layer contains one weight and one bias.
There are four tensors (weights) in total, whose names are:

- `transformer.layers.0.mlp.fc.weight`
- `transformer.layers.0.mlp.fc.bias`
- `transformer.layers.0.mlp.proj.weight`
- `transformer.layers.0.mlp.proj.bias`

where `transformer.layers.0.mlp` is the prefix name, indicating that the weights/biases are in the MLP module of the 0-th transformer layer.

#### LayerNorm Weights

Each of the two `LayerNorm` layers, namely `input_layernorm` and `post_layernorm`, contains one weight and one bias.
There are four tensors (weights) in total, whose names are:

- `transformer.layers.0.input_layernorm.weight`
- `transformer.layers.0.input_layernorm.bias`
- `transformer.layers.0.post_layernorm.weight`
- `transformer.layers.0.post_layernorm.bias`

where `transformer.layers.0.input_layernorm` and `transformer.layers.0.post_layernorm` are prefix names for the two `layernorm` modules.

#### KV Cache Quantization Scaling Factors

If we quantize the model, there will be different tensors (depending on the quantization method applied).
For example, if we quantize the KV cache, the `Attention` layer will have this extra scaling factor:

- `transformer.layers.0.attention.kv_cache_scaling_factor`

#### FP8 Quantization Scaling Factors

Here is the FP8 scaling factors of `attention.qkv` linear layer:

- `transformer.layers.0.attention.qkv.activation_scaling_factor`
- `transformer.layers.0.attention.qkv.weights_scaling_factor`

#### AWQ Quantization Scaling Factors

Here is the AWQ scaling factors of `mlp.fc` linear layer:

- `transformer.layers.0.mlp.fc.weights_scaling_factor`
- `transformer.layers.0.mlp.fc.prequant_scaling_factor`

    ```{note}
    The linear weights in TensorRT-LLM checkpoint always follows (`out_feature`, `in_feature`) shape, whereas some quantized linear in TensorRT-LLM implemented by plugin may use (`in_feature`, `out_fature`) shape. The `trtllm-build` command adds a transpose operation to post-process it.

### Example

Let's take OPT as an example and deploy the model with tensor parallelism 2:

```bash
cd examples/opt
python3 convert_checkpoint.py --model_dir ./opt-125m \
                --dtype float16 \
                --tp_size 2 \
                --output_dir ./opt/125M/trt_ckpt/fp16/2-gpu/
```

Here is the checkpoint directory:

```
./opt/125M/trt_ckpt/fp16/1-gpu/
    config.json
    rank0.safetensors
    rank1.safetensors
```

Here is the `config.json`:

```json
{
    "architecture": "OPTForCausalLM",
    "dtype": "float16",
    "logits_dtype": "float32",
    "num_hidden_layers": 12,
    "num_attention_heads": 12,
    "hidden_size": 768,
    "vocab_size": 50272,
    "position_embedding_type": "learned_absolute",
    "max_position_embeddings": 2048,
    "hidden_act": "relu",
    "mapping": {
        "world_size": 2,
        "tp_size": 2
    },
    "use_parallel_embedding": false,
    "embedding_sharding_dim": 0,
    "do_layer_norm_before": true,
}
```

## Build Checkpoint into TensorRT Engine

TensorRT-LLM provides a unified build command: `trtllm-build`. Before using it,
you may need to add it to the `PATH`.

```bash
export PATH=/usr/local/bin:$PATH

trtllm-build --checkpoint_dir ./opt/125M/trt_ckpt/fp16/2-gpu/ \
                --gemm_plugin float16 \
                --max_batch_size 8 \
                --max_input_len 924 \
                --max_seq_len 1024 \
                --output_dir ./opt/125M/trt_engines/fp16/2-gpu/
```

## Make Evaluation

```bash
mpirun -n 2 --allow-run-as-root \
    python3 ../summarize.py --engine_dir ./opt/125M/trt_engines/fp16/2-gpu/ \
                        --batch_size 1 \
                        --test_trt_llm \
                        --hf_model_dir opt-125m \
                        --data_type fp16 \
                        --check_accuracy \
                        --tensorrt_llm_rouge1_threshold=14
```