New Workflow

Overview

There are 3 steps in the new workflow:

convert weights from different source frameworks into TensorRT-LLM checkpoint
build the TensorRT-LLM checkpoint into TensorRT engine(s) with a unified build command
load the engine(s) to TensorRT-LLM model runner and make evaluation with different evaluation tasks

NeMo -------------
                  |
HuggingFace ------
                  |   convert                       build                load
AMMO -------------  ----------> TRT-LLM Checkpoint --------> TRT Engine ------> TRT-LLM ModelRunner
                  |
JAX --------------
                  |
DeepSpeed --------

Prepare TensorRT-LLM Checkpoint

There are different kinds of sources we want to support:

trained models from NeMo/DeepSpeed/JAX
quantized models from AMMO
popular models from HuggingFace

TensorRT-LLM defines its own checkpoint format. A checkpoint directory include:

One config json file, which contains several model hyper-parameters
One or several rank weights files, each rank file contains a dictionary of tensors(weights)

Config

Field	Type	Default Value
architecture	string	mandatory
dtype	string	mandatory
logits_dtype	string	‘float32’
vocab_size	int	mandatory
max_position_embeddings	int	null
hidden_size	int	mandatory
num_hidden_layers	int	mandatory
num_attention_heads	int	mandatory
num_key_value_heads	int	num_attention_heads
hidden_act	string	mandatory
intermediate_size	int	null
norm_epsilon	float	1e-5
position_embedding_type	string	‘learned_absolute’
use_prompt_tuning	bool	false
mapping.world_size	int	1
mapping.tp_size	int	1
mapping.pp_size	int	1
quantization.use_smooth_quant	bool	false
quantization.per_channel	bool	false
quantization.per_token	bool	false
quantization.per_group	bool	false
quantization.group_size	int	64
quantization.int8_kv_cache	bool	false
quantization.enable_fp8	bool	false
quantization.fp8_kv_cache	bool	false
quantization.use_weight_only	bool	false
quantization.weight_only_precision	string	‘int8’

The config field is extensible, a model could add its own specific config fields. For example, OPT model has a do_layer_norm_before field.

Rank Weights

Like PyTorch, the tensor(weight) name is a string containing hierarchical information, which is uniquely mapped to a certain parameter of a TensorRT-LLM model.

For example, the Attention layer contains 2 Linear layer, qkv and dense. Each linear layer contains one weight and one bias. So, there are 4 tensors(weights) in total, whose names are:

“xxx.qkv.weight”
“xxx.qkv.bias”
“xxx.dense.weight”
“xxx.dense.bias”

xxx is the prefix name. If we quantize the KV cache, we will have extra 2 scaling factors:

“xxx.kv_orig_quant_scale”
“xxx.kv_quant_orig_scale”

If we do FP8 quantize, we will have extra 4 scaling factors:

“xxx.qkv.activation_scaling_factor”
“xxx.qkv.weights_scaling_factor”
“xxx.dense.activation_scaling_factor”
“xxx.dense.weights_scaling_factor”

Example

Let’s take OPT as an example, say we want to deploy the model with tensor parallelism 2:

cd examples/opt
python3 convert_checkpoint.py --model_dir ./opt-125m \
                --dtype float16 \
                --world_size 2 \
                --output_dir ./opt/125M/trt_ckpt/fp16/2-gpu/

Here is the checkpoint directory:

./opt/125M/trt_ckpt/fp16/1-gpu/
    config.json
    rank0.safetensors
    rank1.safetensors

Here is the config.json:

{
    "architecture": "OPTForCausalLM",
    "dtype": "float16",
    "logits_dtype": "float32",
    "num_hidden_layers": 12,
    "num_attention_heads": 12,
    "hidden_size": 768,
    "vocab_size": 50272,
    "position_embedding_type": "learned_absolute",
    "max_position_embeddings": 2048,
    "hidden_act": "relu",
    "quantization": {
        "use_weight_only": false,
        "weight_only_precision": "int8"
    },
    "mapping": {
        "world_size": 2,
        "tp_size": 2
    },
    "use_parallel_embedding": false,
    "embedding_sharding_dim": 0,
    "share_embedding_table": false,
    "do_layer_norm_before": true,
    "use_prompt_tuning": false
}

Build Checkpoint into TensorRT Engine

TensorRT-LLM provides a unified build command: trtllm-llm. Before using it, you may need to add it to the PATH

export PATH=/usr/local/bin:$PATH

trtllm-build --checkpoint_dir ./opt/125M/trt_ckpt/fp16/2-gpu/ \
                --use_gemm_plugin float16 \
                --use_gpt_attention_plugin float16 \
                --max_batch_size 8 \
                --max_input_len 924 \
                --max_output_len 100 \
                --output_dir ./opt/125M/trt_engines/fp16/2-gpu/

Make Evaluation

mpirun -n 2 --allow-run-as-root \
    python3 ../summarize.py --engine_dir ./opt/125M/trt_engines/fp16/2-gpu/ \
                        --batch_size 1 \
                        --test_trt_llm \
                        --hf_model_dir opt-125m \
                        --data_type fp16 \
                        --check_accuracy \
                        --tensorrt_llm_rouge1_threshold=14