mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-02-12 05:53:33 +08:00

History

Kaiyu Xie 4bb65f216f Update TensorRT-LLM (#1274 ) * Update TensorRT-LLM --------- Co-authored-by: meghagarwal <16129366+megha95@users.noreply.github.com> Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>		2024-03-12 18:15:52 +08:00
..
pics	Update TensorRT-LLM (#941 )	2024-01-23 23:22:35 +08:00
gptq_convert.py	Update TensorRT-LLM (#1019 )	2024-01-31 21:55:32 +08:00
README.md	Update TensorRT-LLM (#1274 )	2024-03-12 18:15:52 +08:00
requirements.txt	Update TensorRT-LLM (#1274 )	2024-03-12 18:15:52 +08:00
run_chat.py	Update TensorRT-LLM (#1274 )	2024-03-12 18:15:52 +08:00
run.py	Update TensorRT-LLM (#1274 )	2024-03-12 18:15:52 +08:00
show_pic.py	Update TensorRT-LLM (#941 )	2024-01-23 23:22:35 +08:00
vit_onnx_trt.py	Update TensorRT-LLM (#1168 )	2024-02-27 17:37:34 +08:00

README.md

Guide to Qwen-VL deployment pipeline

Download the Qwen vision-language model (Qwen-VL).

git lfs install
git clone https://huggingface.co/Qwen/Qwen-VL-Chat

Generate the Vision Transformer (ViT) ONNX model and the TensorRT engine.

If you don't have ONNX file, run:
```
python3 vit_onnx_trt.py --pretrained_model_path ./Qwen-VL-Chat
```
The ONNX and TensorRT engine will be generated under ./onnx/visual_encoder and ./plan/visual_encoder respectively.
If you already have an ONNX file under ./onnx/visual_encoder and want to build a TensorRT engine with it, run:
```
python3 vit_onnx_trt.py --pretrained_model_path ./Qwen-VL-Chat --only_trt
```
This command saves the test image tensor to image.pt for later pipeline inference.

Build INT4-GPTQ Qwen TensorRT engine.

Quantize the weights to INT4 with GPTQ

Install packages

pip install -r requirements.txt

Weight quantization to INT4 with GPTQ

python3 gptq_convert.py --pretrained_model_dir ./Qwen-VL-Chat \
        --quantized_model_dir ./Qwen-VL-Chat-4bit

Build TensorRT-LLM engine

NOTE: max_prompt_embedding_table_size = query_token_num * max_batch_size, therefore, if you change max_batch_size, --max_prompt_embedding_table_size must be reset accordingly.

python3 ../qwen/build.py --model_dir=Qwen-VL-Chat \
        --quant_ckpt_path=./Qwen-VL-Chat-4bit/gptq_model-4bit-128g.safetensors \
        --dtype float16 \
        --max_batch_size 8 \
        --max_input_len 2048 \
        --max_output_len 1024 \
        --remove_input_padding \
        --use_gpt_attention_plugin float16 \
        --use_gemm_plugin float16 \
        --use_weight_only \
        --weight_only_precision int4_gptq \
        --per_group \
        --enable_context_fmha \
        --log_level verbose \
        --use_lookup_plugin float16 \
        --max_prompt_embedding_table_size 2048 \
        --output_dir=./trt_engines/Qwen-VL-7B-Chat-int4-gptq

        # --max_prompt_embedding_table_size 2048 = 256 (query_token number) * 8 (max_batch_size)

The built Qwen engines are located in ./trt_engines/Qwen-VL-7B-Chat-int4-gptq. For more information about Qwen, refer to the README.md in example/qwen.

Assemble everything into the Qwen-VL pipeline.

4.1 Run with INT4 GPTQ weight-only quantization engine

python3 run.py \
    --tokenizer_dir=./Qwen-VL-Chat \
    --qwen_engine_dir=./trt_engines/Qwen-VL-7B-Chat-int4-gptq \
    --vit_engine_dir=./plan \
    --images_path='{"image": "./pics/demo.jpeg"}' \
    --input_dir='{"image": "image.pt"}'

4.2 (Optional) For multiple rounds of dialogue, you can run:

python3 run_chat.py \
    --tokenizer_dir=./Qwen-VL-Chat \
    --qwen_engine_dir=./trt_engines/Qwen-VL-7B-Chat-int4-gptq \
    --vit_engine_dir=./plan \
    --images_path='{"image": "./pics/demo.jpeg"}' \
    --input_dir='{"image": "image.pt"}'

4.3 (Optional) To show the bounding box result in the demo picture, install OpenCV, ZMQ, and request:

pip install opencv-python==4.5.5.64
pip install opencv-python-headless==4.5.5.64
pip install zmq
pip install request

4.3.1 If the current program is executed on a remote machine, run the following command on a local machine:

python3 show_pic.py --ip=127.0.0.1 --port=8006

Replace the ip and port values, where ip is your remote machine IP address.

Run the following command on the remote machine:

python3 run_chat.py \
    --tokenizer_dir=./Qwen-VL-Chat \
    --qwen_engine_dir=./trt_engines/Qwen-VL-7B-Chat-int4-gptq \
    --vit_engine_dir=./plan \
    --display \
    --port=8006

Replace the port value.

4.3.2 If the current program is executed on the local machine, run the following command:

python3 run_chat.py \
    --tokenizer_dir=./Qwen-VL-Chat \
    --qwen_engine_dir=./trt_engines/Qwen-VL-7B-Chat-int4-gptq \
    --vit_engine_dir=./plan \
    --display \
    --local_machine

The question "Print the bounding box of the girl" is displayed. You should see the following image: