mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-28 22:56:13 +08:00
177 lines
6.3 KiB
Markdown
177 lines
6.3 KiB
Markdown
# New Workflow
|
|
|
|
## Overview
|
|
|
|
There are 3 steps in the new workflow:
|
|
|
|
1. convert weights from different source frameworks into TensorRT-LLM checkpoint
|
|
2. build the TensorRT-LLM checkpoint into TensorRT engine(s) with a unified build command
|
|
3. load the engine(s) to TensorRT-LLM model runner and make evaluation with different evaluation tasks
|
|
|
|
```txt
|
|
NeMo -------------
|
|
|
|
|
HuggingFace ------
|
|
| convert build load
|
|
AMMO ------------- ----------> TRT-LLM Checkpoint --------> TRT Engine ------> TRT-LLM ModelRunner
|
|
|
|
|
JAX --------------
|
|
|
|
|
DeepSpeed --------
|
|
```
|
|
|
|
## Prepare TensorRT-LLM Checkpoint
|
|
|
|
There are different kinds of sources we want to support:
|
|
|
|
1. trained models from NeMo/DeepSpeed/JAX
|
|
2. quantized models from AMMO
|
|
3. popular models from HuggingFace
|
|
|
|
TensorRT-LLM defines its own checkpoint format. A checkpoint directory include:
|
|
|
|
1. One config json file, which contains several model hyper-parameters
|
|
2. One or several rank weights files, each rank file contains a dictionary of tensors(weights)
|
|
|
|
### Config
|
|
|
|
| Field | Type | Default Value |
|
|
| :------------------------------------- | :--------- | :------------------ |
|
|
| architecture | string | mandatory |
|
|
| dtype | string | mandatory |
|
|
| logits_dtype | string | 'float32' |
|
|
| vocab_size | int | mandatory |
|
|
| max_position_embeddings | int | null |
|
|
| hidden_size | int | mandatory |
|
|
| num_hidden_layers | int | mandatory |
|
|
| num_attention_heads | int | mandatory |
|
|
| num_key_value_heads | int | num_attention_heads |
|
|
| hidden_act | string | mandatory |
|
|
| intermediate_size | int | null |
|
|
| norm_epsilon | float | 1e-5 |
|
|
| position_embedding_type | string | 'learned_absolute' |
|
|
| use_prompt_tuning | bool | false |
|
|
| mapping.world_size | int | 1 |
|
|
| mapping.tp_size | int | 1 |
|
|
| mapping.pp_size | int | 1 |
|
|
| quantization.use_smooth_quant | bool | false |
|
|
| quantization.per_channel | bool | false |
|
|
| quantization.per_token | bool | false |
|
|
| quantization.per_group | bool | false |
|
|
| quantization.group_size | int | 64 |
|
|
| quantization.int8_kv_cache | bool | false |
|
|
| quantization.enable_fp8 | bool | false |
|
|
| quantization.fp8_kv_cache | bool | false |
|
|
| quantization.use_weight_only | bool | false |
|
|
| quantization.weight_only_precision | string | 'int8' |
|
|
|
|
The config field is extensible, a model could add its own specific config fields.
|
|
For example, OPT model has a `do_layer_norm_before` field.
|
|
|
|
### Rank Weights
|
|
|
|
Like PyTorch, the tensor(weight) name is a string containing hierarchical information,
|
|
which is uniquely mapped to a certain parameter of a TensorRT-LLM model.
|
|
|
|
For example, the `Attention` layer contains 2 `Linear` layer, qkv and dense.
|
|
Each linear layer contains one weight and one bias.
|
|
So, there are 4 tensors(weights) in total, whose names are:
|
|
|
|
- "xxx.qkv.weight"
|
|
- "xxx.qkv.bias"
|
|
- "xxx.dense.weight"
|
|
- "xxx.dense.bias"
|
|
|
|
`xxx` is the prefix name. If we quantize the KV cache, we will have extra 2 scaling factors:
|
|
|
|
- "xxx.kv_orig_quant_scale"
|
|
- "xxx.kv_quant_orig_scale"
|
|
|
|
If we do FP8 quantize, we will have extra 4 scaling factors:
|
|
|
|
- "xxx.qkv.activation_scaling_factor"
|
|
- "xxx.qkv.weights_scaling_factor"
|
|
- "xxx.dense.activation_scaling_factor"
|
|
- "xxx.dense.weights_scaling_factor"
|
|
|
|
### Example
|
|
|
|
Let's take OPT as an example, say we want to deploy the model with tensor parallelism 2:
|
|
|
|
```bash
|
|
cd examples/opt
|
|
python3 convert_checkpoint.py --model_dir ./opt-125m \
|
|
--dtype float16 \
|
|
--world_size 2 \
|
|
--output_dir ./opt/125M/trt_ckpt/fp16/2-gpu/
|
|
```
|
|
|
|
Here is the checkpoint directory:
|
|
|
|
```txt
|
|
./opt/125M/trt_ckpt/fp16/1-gpu/
|
|
config.json
|
|
rank0.safetensors
|
|
rank1.safetensors
|
|
```
|
|
|
|
Here is the `config.json`:
|
|
|
|
```json
|
|
{
|
|
"architecture": "OPTForCausalLM",
|
|
"dtype": "float16",
|
|
"logits_dtype": "float32",
|
|
"num_hidden_layers": 12,
|
|
"num_attention_heads": 12,
|
|
"hidden_size": 768,
|
|
"vocab_size": 50272,
|
|
"position_embedding_type": "learned_absolute",
|
|
"max_position_embeddings": 2048,
|
|
"hidden_act": "relu",
|
|
"quantization": {
|
|
"use_weight_only": false,
|
|
"weight_only_precision": "int8"
|
|
},
|
|
"mapping": {
|
|
"world_size": 2,
|
|
"tp_size": 2
|
|
},
|
|
"use_parallel_embedding": false,
|
|
"embedding_sharding_dim": 0,
|
|
"share_embedding_table": false,
|
|
"do_layer_norm_before": true,
|
|
"use_prompt_tuning": false
|
|
}
|
|
```
|
|
|
|
## Build Checkpoint into TensorRT Engine
|
|
|
|
TensorRT-LLM provides a unified build command: `trtllm-llm`. Before using it,
|
|
you may need to add it to the `PATH`
|
|
|
|
```bash
|
|
export PATH=/usr/local/bin:$PATH
|
|
|
|
trtllm-build --checkpoint_dir ./opt/125M/trt_ckpt/fp16/2-gpu/ \
|
|
--use_gemm_plugin float16 \
|
|
--use_gpt_attention_plugin float16 \
|
|
--max_batch_size 8 \
|
|
--max_input_len 924 \
|
|
--max_output_len 100 \
|
|
--output_dir ./opt/125M/trt_engines/fp16/2-gpu/
|
|
```
|
|
|
|
## Make Evaluation
|
|
|
|
```bash
|
|
mpirun -n 2 --allow-run-as-root \
|
|
python3 ../summarize.py --engine_dir ./opt/125M/trt_engines/fp16/2-gpu/ \
|
|
--batch_size 1 \
|
|
--test_trt_llm \
|
|
--hf_model_dir opt-125m \
|
|
--data_type fp16 \
|
|
--check_accuracy \
|
|
--tensorrt_llm_rouge1_threshold=14
|
|
```
|