# New Workflow ## Overview There are 3 steps in the new workflow: 1. convert weights from different source frameworks into TensorRT-LLM checkpoint 2. build the TensorRT-LLM checkpoint into TensorRT engine(s) with a unified build command 3. load the engine(s) to TensorRT-LLM model runner and make evaluation with different evaluation tasks ```txt NeMo ------------- | HuggingFace ------ | convert build load AMMO ------------- ----------> TRT-LLM Checkpoint --------> TRT Engine ------> TRT-LLM ModelRunner | JAX -------------- | DeepSpeed -------- ``` ## Prepare TensorRT-LLM Checkpoint There are different kinds of sources we want to support: 1. trained models from NeMo/DeepSpeed/JAX 2. quantized models from AMMO 3. popular models from HuggingFace TensorRT-LLM defines its own checkpoint format. A checkpoint directory include: 1. One config json file, which contains several model hyper-parameters 2. One or several rank weights files, each rank file contains a dictionary of tensors(weights) ### Config | Field | Type | Default Value | | :------------------------------------- | :--------- | :------------------ | | architecture | string | mandatory | | dtype | string | mandatory | | logits_dtype | string | 'float32' | | vocab_size | int | mandatory | | max_position_embeddings | int | null | | hidden_size | int | mandatory | | num_hidden_layers | int | mandatory | | num_attention_heads | int | mandatory | | num_key_value_heads | int | num_attention_heads | | hidden_act | string | mandatory | | intermediate_size | int | null | | norm_epsilon | float | 1e-5 | | position_embedding_type | string | 'learned_absolute' | | use_prompt_tuning | bool | false | | mapping.world_size | int | 1 | | mapping.tp_size | int | 1 | | mapping.pp_size | int | 1 | | quantization.use_smooth_quant | bool | false | | quantization.per_channel | bool | false | | quantization.per_token | bool | false | | quantization.per_group | bool | false | | quantization.group_size | int | 64 | | quantization.int8_kv_cache | bool | false | | quantization.enable_fp8 | bool | false | | quantization.fp8_kv_cache | bool | false | | quantization.use_weight_only | bool | false | | quantization.weight_only_precision | string | 'int8' | The config field is extensible, a model could add its own specific config fields. For example, OPT model has a `do_layer_norm_before` field. ### Rank Weights Like PyTorch, the tensor(weight) name is a string containing hierarchical information, which is uniquely mapped to a certain parameter of a TensorRT-LLM model. For example, the `Attention` layer contains 2 `Linear` layer, qkv and dense. Each linear layer contains one weight and one bias. So, there are 4 tensors(weights) in total, whose names are: - "xxx.qkv.weight" - "xxx.qkv.bias" - "xxx.dense.weight" - "xxx.dense.bias" `xxx` is the prefix name. If we quantize the KV cache, we will have extra 2 scaling factors: - "xxx.kv_orig_quant_scale" - "xxx.kv_quant_orig_scale" If we do FP8 quantize, we will have extra 4 scaling factors: - "xxx.qkv.activation_scaling_factor" - "xxx.qkv.weights_scaling_factor" - "xxx.dense.activation_scaling_factor" - "xxx.dense.weights_scaling_factor" ### Example Let's take OPT as an example, say we want to deploy the model with tensor parallelism 2: ```bash cd examples/opt python3 convert_checkpoint.py --model_dir ./opt-125m \ --dtype float16 \ --world_size 2 \ --output_dir ./opt/125M/trt_ckpt/fp16/2-gpu/ ``` Here is the checkpoint directory: ```txt ./opt/125M/trt_ckpt/fp16/1-gpu/ config.json rank0.safetensors rank1.safetensors ``` Here is the `config.json`: ```json { "architecture": "OPTForCausalLM", "dtype": "float16", "logits_dtype": "float32", "num_hidden_layers": 12, "num_attention_heads": 12, "hidden_size": 768, "vocab_size": 50272, "position_embedding_type": "learned_absolute", "max_position_embeddings": 2048, "hidden_act": "relu", "quantization": { "use_weight_only": false, "weight_only_precision": "int8" }, "mapping": { "world_size": 2, "tp_size": 2 }, "use_parallel_embedding": false, "embedding_sharding_dim": 0, "share_embedding_table": false, "do_layer_norm_before": true, "use_prompt_tuning": false } ``` ## Build Checkpoint into TensorRT Engine TensorRT-LLM provides a unified build command: `trtllm-llm`. Before using it, you may need to add it to the `PATH` ```bash export PATH=/usr/local/bin:$PATH trtllm-build --checkpoint_dir ./opt/125M/trt_ckpt/fp16/2-gpu/ \ --use_gemm_plugin float16 \ --use_gpt_attention_plugin float16 \ --max_batch_size 8 \ --max_input_len 924 \ --max_output_len 100 \ --output_dir ./opt/125M/trt_engines/fp16/2-gpu/ ``` ## Make Evaluation ```bash mpirun -n 2 --allow-run-as-root \ python3 ../summarize.py --engine_dir ./opt/125M/trt_engines/fp16/2-gpu/ \ --batch_size 1 \ --test_trt_llm \ --hf_model_dir opt-125m \ --data_type fp16 \ --check_accuracy \ --tensorrt_llm_rouge1_threshold=14 ```