TensorRT-LLMs/examples/qwenvl
Kaiyu Xie 77d7fe1eb2
Update TensorRT-LLM (#2849)
* Update TensorRT-LLM

---------

Co-authored-by: aotman <chenhangatm@gmail.com>
2025-03-04 18:44:00 +08:00
..
pics Update TensorRT-LLM (#941) 2024-01-23 23:22:35 +08:00
README.md Update TensorRT-LLM (#2532) 2024-12-04 21:16:56 +08:00
requirements.txt Update TensorRT-LLM (#2849) 2025-03-04 18:44:00 +08:00
run_chat.py open source 7f370deb0090d885d7518c2b146399ba3933c004 (#2273) 2024-09-30 13:51:19 +02:00
run.py Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
show_pic.py Update TensorRT-LLM (#941) 2024-01-23 23:22:35 +08:00
vit_onnx_trt.py Update TensorRT-LLM (#2215) 2024-09-10 18:21:22 +08:00

Guide to Qwen-VL deployment pipeline

  1. Download the Qwen vision-language model (Qwen-VL).
    git lfs install
    git clone https://huggingface.co/Qwen/Qwen-VL-Chat
    
  2. Generate the Vision Transformer (ViT) ONNX model and the TensorRT engine.
  • If you don't have ONNX file, run:

    python3 vit_onnx_trt.py --pretrained_model_path ./Qwen-VL-Chat
    

    The ONNX and TensorRT engine will be generated under ./onnx/visual_encoder and ./plan/visual_encoder respectively.

  • If you already have an ONNX file under ./onnx/visual_encoder and want to build a TensorRT engine with it, run:

    python3 vit_onnx_trt.py --pretrained_model_path ./Qwen-VL-Chat --only_trt
    

    This command saves the test image tensor to image.pt for later pipeline inference.

  1. Build Qwen TensorRT engine.
  • Convert checkpoint

    1. Install packages
    pip install -r requirements.txt
    
    1. Convert
    python3 ./examples/qwen/convert_checkpoint.py --model_dir=./Qwen-VL-Chat \
            --output_dir=./tllm_checkpoint_1gpu \
            --dtype float16
    
  • Build TensorRT-LLM engine

    NOTE: max_prompt_embedding_table_size = query_token_num * max_batch_size, therefore, if you change max_batch_size, --max_prompt_embedding_table_size must be reset accordingly.

    trtllm-build --checkpoint_dir=./tllm_checkpoint_1gpu \
                 --gemm_plugin=float16 --gpt_attention_plugin=float16 \
                 --max_input_len=2048 --max_seq_len=3072 \
                 --max_batch_size=8 --max_prompt_embedding_table_size=2048 \
                 --remove_input_padding=enable \
                 --output_dir=./trt_engines/Qwen-VL-7B-Chat
    

    The built Qwen engines are located in ./trt_engines/Qwen-VL-7B-Chat. For more information about Qwen, refer to the README.md in example/qwen.

  1. Assemble everything into the Qwen-VL pipeline.

    4.1 Run with INT4 GPTQ weight-only quantization engine

    python3 run.py \
        --tokenizer_dir=./Qwen-VL-Chat \
        --qwen_engine_dir=./trt_engines/Qwen-VL-7B-Chat \
        --vit_engine_path=./plan/visual_encoder/visual_encoder_fp16.plan \
        --images_path='{"image": "./pics/demo.jpeg"}'
    

    4.2 (Optional) For multiple rounds of dialogue, you can run:

    python3 run_chat.py \
        --tokenizer_dir=./Qwen-VL-Chat \
        --qwen_engine_dir=./trt_engines/Qwen-VL-7B-Chat \
        --vit_engine_path=./plan/visual_encoder/visual_encoder_fp16.plan \
        --images_path='{"image": "./pics/demo.jpeg"}'
    

    4.3 (Optional) To show the bounding box result in the demo picture, install OpenCV, ZMQ, and request:

    pip install opencv-python==4.5.5.64
    pip install opencv-python-headless==4.5.5.64
    pip install zmq
    pip install request
    

      4.3.1 If the current program is executed on a remote machine, run the following command on a local machine:

    python3 show_pic.py --ip=127.0.0.1 --port=8006
    

      Replace the ip and port values, where ip is your remote machine IP address.

      Run the following command on the remote machine:

    python3 run_chat.py \
        --tokenizer_dir=./Qwen-VL-Chat \
        --qwen_engine_dir=./trt_engines/Qwen-VL-7B-Chat \
        --vit_engine_path=./plan/visual_encoder/visual_encoder_fp16.plan \
        --display \
        --port=8006
    

      Replace the port value.

      4.3.2 If the current program is executed on the local machine, run the following command:

    python3 run_chat.py \
        --tokenizer_dir=./Qwen-VL-Chat \
        --qwen_engine_dir=./trt_engines/Qwen-VL-7B-Chat \
        --vit_engine_path=./plan/visual_encoder/visual_encoder_fp16.plan \
        --display \
        --local_machine
    

      The question "Print the bounding box of the girl" is displayed. You should see the following image:

    image