mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-13 22:18:36 +08:00

Dan Blanaru 16d2467ea8 Update TensorRT-LLM (#2755 )

* Update TensorRT-LLM

---------

Co-authored-by: Denis Kayshev <topenkoff@gmail.com>
Co-authored-by: akhoroshev <arthoroshev@gmail.com>
Co-authored-by: Patrick Reiter Horn <patrick.horn@gmail.com>

Update

2025-02-11 03:01:00 +00:00

3.3 KiB

Raw Permalink Blame History

DoRA

This document shows how to run a model using DoRA adapters. DoRA is a PEFT strategy extending LoRA. It is fully supported in the Huggingface peft library. For a more detailed description please refer to the DoRA paper or official repo.

Support Matrix

FP16/BF16 (over arbitrary precision of the base model).
Supports adapters from Huggingface peft or from the official NVlabs checkpoints.
Multiple adapters (+ mixed LoRA/DoRA setups).
inflight loading of new adapters to a preloaded base model.
C++ and python runtime.
Tensor parallelism and Pipeline parallelism.

Usage

Using DoRA is almost exactly the same as using LoRA in TRTLLM, with an additional preprocessing step. While the official DoRA paper describes the magnitude normalization as part of the execution flow, it can be performed once beforehand to boost inference performance.

Start by obtaining a local copy of your desired DoRA adapter and your base model. We'll use the official NVlabs checkpoint for LLaMA3-8B as an example:

git clone https://huggingface.co/sliuau/DoRA-weights
git clone https://huggingface.co/meta-llama/Meta-Llama-3-8B

Next, use the normalize_weights.py script to normalize the DoRA magnitude vectors in the adapter checkpoint. The script requires access to both the local adapter weights and the local base model weights:

export NORMALIZED_DORA_ADAPTER=path/to/normalized/adapter/ckpt

python ./normalize_weights.py -i DoRA-weights/llama_dora_commonsense_checkpoints/LLama3-8B/dora_r32 -b Meta-Llama-3-8B -o $NORMALIZED_DORA_ADAPTER

The script will create a new adapter checkpoint, with normalized DoRA vectors, in the provided path.

Now we may convert our Llama checkpoint and build our TRT engine as described in the Llama examples. When doing so, ensure you pass --dora_plugin=enable to the trtllm-build command, as well as enabling the lora plugin:

export CHECKPOINT_DIR=path/to/trtllm/ckpt
export ENGIRE_DIR=path/to/trtllm/engine

python ../llama/convert_checkpoint.py --model_dir Meta-Llama-3-8B \
                                      --output_dir $CHECKPOINT_DIR \
                                      --dtype float16

trtllm-build --checkpoint_dir $CHECKPOINT_DIR \
             --output_dir $ENGINE_DIR \
             --gemm_plugin=auto \
             --lora_plugin=auto \
             --dora_plugin=enable \
             --lora_dir $NORMALIZED_DORA_ADAPTER

If you wish, you may provide additional LoRA / DoRA adapters to trtllm-build.

NOTE: if you omit --dora_plugin=enable, you will not receive any warning even if you provide a DoRA adapter to --lora_dir. In such a case the DoRA magnitudes will simply be ignored during inference and you may receive wrong output.

Proceed to execute the engine as you would a normal LoRA engine:

python ../run.py --engine_dir $ENGINE_DIR --tokenizer_dir Meta-Llama-3-8B --lora_task_uids 0 --max_output_len 32 --input_text ...

Usage with Triton Server

Using DoRA over Triton is the same as using LoRA, but before using hf_lora_convert.py, make sure you call normalize_weights.py and use the resulting normalized adapter.

3.3 KiB Raw Permalink Blame History

DoRA

Support Matrix

Usage

Usage with Triton Server

3.3 KiB

Raw Permalink Blame History