mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-02-04 10:11:47 +08:00
* Update TensorRT-LLM --------- Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
258 lines
9.7 KiB
Markdown
258 lines
9.7 KiB
Markdown
# New Workflow
|
|
|
|
## Overview
|
|
|
|
The first versions of TensorRT-LLM were developed with a very aggressive timeline. For those versions emphasis was not put
|
|
on defining a unified workflow. Now that TensorRT-LLM has reached some level of feature richness, the development team has
|
|
decided to put more efforts into unifying the APIs and workflow of TensorRT-LLM. This document summarises the new workflow
|
|
adopted by TensorRT-LLM at its core.
|
|
|
|
There are 3 steps in the new workflow:
|
|
|
|
1. Convert weights from different source frameworks into TensorRT-LLM checkpoint
|
|
2. Build the TensorRT-LLM checkpoint into TensorRT engine(s) with a unified build command
|
|
3. Load the engine(s) to TensorRT-LLM model runner and make evaluation with different evaluation tasks
|
|
|
|
```
|
|
NeMo -------------
|
|
|
|
|
HuggingFace ------
|
|
| convert build load
|
|
AMMO ------------- ----------> TensorRT-LLM Checkpoint --------> TensorRT Engine ------> TensorRT-LLM ModelRunner
|
|
|
|
|
JAX --------------
|
|
|
|
|
DeepSpeed --------
|
|
```
|
|
|
|
## Prepare the TensorRT-LLM Checkpoint
|
|
|
|
TensorRT-LLM aims at supporting different of sources:
|
|
|
|
1. Trained models from NeMo, DeepSpeed, JAX
|
|
2. Quantized models from AMMO
|
|
3. Popular models from HuggingFace
|
|
|
|
TensorRT-LLM defines its own checkpoint format. A checkpoint directory includes:
|
|
|
|
1. One config json file, which contains several model hyper-parameters
|
|
2. One or several rank weights files, each file contains a dictionary of tensors (weights).
|
|
The different files will be loaded by different ranks in a multi-GPU (multi-process) scenario
|
|
|
|
### Config
|
|
|
|
| Field | Type | Default Value |
|
|
| :------------------------------------- | :--------- | :------------------ |
|
|
| architecture | string | mandatory |
|
|
| dtype | string | mandatory |
|
|
| logits_dtype | string | 'float32' |
|
|
| vocab_size | int | mandatory |
|
|
| max_position_embeddings | int | null |
|
|
| hidden_size | int | mandatory |
|
|
| num_hidden_layers | int | mandatory |
|
|
| num_attention_heads | int | mandatory |
|
|
| num_key_value_heads | int | num_attention_heads |
|
|
| hidden_act | string | mandatory |
|
|
| intermediate_size | int | null |
|
|
| norm_epsilon | float | 1e-5 |
|
|
| position_embedding_type | string | 'learned_absolute' |
|
|
| use_prompt_tuning | bool | false |
|
|
| mapping.world_size | int | 1 |
|
|
| mapping.tp_size | int | 1 |
|
|
| mapping.pp_size | int | 1 |
|
|
| quantization.quant_aglo | str | null |
|
|
| quantization.kv_cache_quant_aglo | str | null |
|
|
| quantization.group_size | int | 64 |
|
|
| quantization.has_zero_point | bool | False |
|
|
| quantization.pre_quant_scale | bool | False |
|
|
| quantization.exclude_modules | list | null |
|
|
|
|
`mapping.world_size` means `mapping` is a dictionary containing the `world_size` sub field.
|
|
|
|
```json
|
|
{
|
|
"architecture": "OPTForCausalLM",
|
|
"mapping": {
|
|
"world_size": 1
|
|
}
|
|
}
|
|
```
|
|
|
|
Supported quantization algorithm list:
|
|
|
|
- W8A16
|
|
- W4A16
|
|
- W4A16_AWQ
|
|
- W4A8_AWQ
|
|
- W4A16_GPTQ
|
|
- FP8
|
|
- W8A8_SQ_PER_CHANNEL
|
|
|
|
Supported KV cache quantization algorithm list:
|
|
|
|
- FP8
|
|
- INT8
|
|
|
|
The config field is extensible, a model could add its own specific config fields.
|
|
For example, OPT model has a `do_layer_norm_before` field.
|
|
|
|
Here is the model specific config list:
|
|
|
|
| Field | Type | Default Value |
|
|
| :------------------------------------- | :--------- | :------------------ |
|
|
| OPT | | |
|
|
| do_layer_norm_before | bool | False |
|
|
| | | |
|
|
| Falcon | | |
|
|
| bias | bool | True |
|
|
| new_decoder_architecture | bool | False |
|
|
| parallel_attention | bool | False |
|
|
|
|
### Rank Weights
|
|
|
|
Like PyTorch, the tensor(weight) name is a string containing hierarchical information,
|
|
which is uniquely mapped to a certain parameter of a TensorRT-LLM model.
|
|
|
|
For example, each transformer layer of the OPT model contains an `Attention` layer, an `MLP` layer and two `LayerNorm` layers.
|
|
|
|
#### Attention Weights
|
|
|
|
The `Attention` layer contains two `Linear` layers, qkv and dense; each `Linear` layer contains one weight and one bias.
|
|
So, there are four tensors (weights) in total, whose names are:
|
|
|
|
- "transformer.layers.0.attention.qkv.weight"
|
|
- "transformer.layers.0.attention.qkv.bias"
|
|
- "transformer.layers.0.attention.dense.weight"
|
|
- "transformer.layers.0.attention.dense.bias"
|
|
|
|
where `transformer.layers.0.attention` is the prefix name, indicating that the weights/biases are in the attention module of the 0-th transformer layer.
|
|
|
|
#### MLP Weights
|
|
|
|
The `MLP` layer also contains two `Linear` layers, fc and proj; each `Linear` layer contains one weight and one bias.
|
|
So, there are four tensors (weights) in total, whose names are:
|
|
|
|
- "transformer.layers.0.mlp.fc.weight"
|
|
- "transformer.layers.0.mlp.fc.bias"
|
|
- "transformer.layers.0.mlp.proj.weight"
|
|
- "transformer.layers.0.mlp.proj.bias"
|
|
|
|
where `transformer.layers.0.mlp` is the prefix name, indicating that the weights/biases are in the mlp module of the 0-th transformer layer.
|
|
|
|
#### LayerNorm Weights
|
|
|
|
Each of the two `LayerNorm` layers, namely input_layernorm and post_layernorm, contains one weight and one bias.
|
|
So, there are four tensors (weights) in total, whose names are:
|
|
|
|
- "transformer.layers.0.input_layernorm.weight"
|
|
- "transformer.layers.0.input_layernorm.bias"
|
|
- "transformer.layers.0.post_layernorm.weight"
|
|
- "transformer.layers.0.post_layernorm.bias"
|
|
|
|
where `transformer.layers.0.input_layernorm` and `transformer.layers.0.post_layernorm` are prefix names for the two layernorm modules.
|
|
|
|
#### KV Cache Quantization Scaling Factors
|
|
|
|
Note that if we quantize the model, there will be different tensors (depending on the quantization method applied).
|
|
For example, if we quantize the KV cache, the `Attention` layer will have this extra scaling factor:
|
|
|
|
- "transformer.layers.0.attention.kv_cache_scaling_factor"
|
|
|
|
#### FP8 Quantization Scaling Factors
|
|
|
|
For example, here is the FP8 scaling factors of attention.qkv linear layer:
|
|
|
|
- "transformer.layers.0.attention.qkv.activation_scaling_factor"
|
|
- "transformer.layers.0.attention.qkv.weights_scaling_factor"
|
|
|
|
#### AWQ Quantization Scaling Factors
|
|
|
|
For example, here is the AWQ scaling factors of mlp.fc linear layer:
|
|
|
|
- "transformer.layers.0.mlp.fc.weights_scaling_factor"
|
|
- "transformer.layers.0.mlp.fc.prequant_scaling_factor"
|
|
|
|
**Note**: The linear weights in TensorRT-LLM checkpoint always follows (out_feature, in_feature) shape,
|
|
whereas some quantized linear in TensorRT-LLM implemented by plugin may use (in_feature, out_fature) shape.
|
|
`trtllm-build` command will add a transpose operation to post-process it.
|
|
|
|
### Example
|
|
|
|
Let's take OPT as an example, say we want to deploy the model with tensor parallelism 2:
|
|
|
|
```bash
|
|
cd examples/opt
|
|
python3 convert_checkpoint.py --model_dir ./opt-125m \
|
|
--dtype float16 \
|
|
--world_size 2 \
|
|
--output_dir ./opt/125M/trt_ckpt/fp16/2-gpu/
|
|
```
|
|
|
|
Here is the checkpoint directory:
|
|
|
|
```
|
|
./opt/125M/trt_ckpt/fp16/1-gpu/
|
|
config.json
|
|
rank0.safetensors
|
|
rank1.safetensors
|
|
```
|
|
|
|
Here is the `config.json`:
|
|
|
|
```json
|
|
{
|
|
"architecture": "OPTForCausalLM",
|
|
"dtype": "float16",
|
|
"logits_dtype": "float32",
|
|
"num_hidden_layers": 12,
|
|
"num_attention_heads": 12,
|
|
"hidden_size": 768,
|
|
"vocab_size": 50272,
|
|
"position_embedding_type": "learned_absolute",
|
|
"max_position_embeddings": 2048,
|
|
"hidden_act": "relu",
|
|
"quantization": {
|
|
"use_weight_only": false,
|
|
"weight_only_precision": "int8"
|
|
},
|
|
"mapping": {
|
|
"world_size": 2,
|
|
"tp_size": 2
|
|
},
|
|
"use_parallel_embedding": false,
|
|
"embedding_sharding_dim": 0,
|
|
"share_embedding_table": false,
|
|
"do_layer_norm_before": true,
|
|
"use_prompt_tuning": false
|
|
}
|
|
```
|
|
|
|
## Build Checkpoint into TensorRT Engine
|
|
|
|
TensorRT-LLM provides a unified build command: `trtllm-build`. Before using it,
|
|
you may need to add it to the `PATH`
|
|
|
|
```bash
|
|
export PATH=/usr/local/bin:$PATH
|
|
|
|
trtllm-build --checkpoint_dir ./opt/125M/trt_ckpt/fp16/2-gpu/ \
|
|
--gemm_plugin float16 \
|
|
--max_batch_size 8 \
|
|
--max_input_len 924 \
|
|
--max_output_len 100 \
|
|
--output_dir ./opt/125M/trt_engines/fp16/2-gpu/
|
|
```
|
|
|
|
## Make Evaluation
|
|
|
|
```bash
|
|
mpirun -n 2 --allow-run-as-root \
|
|
python3 ../summarize.py --engine_dir ./opt/125M/trt_engines/fp16/2-gpu/ \
|
|
--batch_size 1 \
|
|
--test_trt_llm \
|
|
--hf_model_dir opt-125m \
|
|
--data_type fp16 \
|
|
--check_accuracy \
|
|
--tensorrt_llm_rouge1_threshold=14
|
|
```
|