mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

* Update TensorRT-LLM

---------

Co-authored-by: aotman <chenhangatm@gmail.com>

2025-03-04 18:44:00 +08:00

2.8 KiB

Raw Blame History

Guide to Qwen2-Audio deployment pipeline

Download the Qwen2-Audio model.

git lfs install
export MODEL_PATH="tmp/Qwen2-Audio-7B-Instruct"
git clone https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct $MODEL_PATH

Generate the TensorRT engine of audio encoder.

export ENGINE_DIR="./trt_engines/qwen2audio/fp16"
python3 ../multimodal/build_multimodal_engine.py --model_type qwen2_audio --model_path $MODEL_PATH --max_batch_size 32 --output_dir ${ENGINE_DIR}/audio

The TensorRT engine will be generated under ${ENGINE_DIR}/audio.

Build Qwen2 LLM TensorRT engine.

Convert checkpoint

Install packages

pip install -r requirements.txt

Convert 2.1 FP16 checkpoint

python3 ../qwen/convert_checkpoint.py --model_dir=$MODEL_PATH \
        --dtype=float16 \
        --output_dir=./tllm_checkpoint_1gpu_fp16

2.2 (Optional) INT8 Weight Only checkpoint

python3 ../qwen/convert_checkpoint.py --model_dir=$MODEL_PATH \
        --dtype=float16 \
        --use_weight_only \
        --weight_only_precision=int8 \
        --output_dir=./tllm_checkpoint_1gpu_fp16_wo8

Build TensorRT-LLM engine

NOTE: max_prompt_embedding_table_size = query_token_num * max_batch_size, therefore, if you change max_batch_size, --max_prompt_embedding_table_size must be reset accordingly.
```
trtllm-build --checkpoint_dir=./tllm_checkpoint_1gpu_fp16 \
             --gemm_plugin=float16 --gpt_attention_plugin=float16 \
             --max_batch_size=1 --max_prompt_embedding_table_size=4096 \
             --output_dir=${ENGINE_DIR}/llm
```
The built Qwen engines are located in ${ENGINE_DIR}/llm.

You can replace the --checkpoint_dir with INT8 Weight Only checkpoint to build INT8 Weight Only engine as well. For more information about Qwen, refer to the README.md in example/qwen.

Assemble everything into the Qwen2-Audio pipeline.

4.1 Run with FP16 LLM engine

python3 run.py \
    --tokenizer_dir=$MODEL_PATH \
    --engine_dir=${ENGINE_DIR}/llm \
    --audio_engine_path=${ENGINE_DIR}/audio/model.engine \
    --audio_url='./audio/glass-breaking-151256.mp3'

4.2 (Optional) For multiple rounds of dialogue, you can run:

python3 run_chat.py \
    --tokenizer_dir=$MODEL_PATH \
    --engine_dir=${ENGINE_DIR}/llm \
    --audio_engine_path=${ENGINE_DIR}/audio/model.engine \
    --max_new_tokens=256

Note:

This example supports reusing the KV Cache for audio segments by assigning unique audio IDs.
To further optimize performance, users can also cache the audio features (encoder output) to bypass the audio encoder if the original audio data remains unchanged.

2.8 KiB Raw Blame History

Guide to Qwen2-Audio deployment pipeline

2.8 KiB

Raw Blame History