mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-24 20:52:48 +08:00

History

dominicshanshan 6345074686 [None][chore] Weekly mass integration of release/1.1 -- rebase (#9522 ) Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com> Signed-off-by: qgai <qgai@nvidia.com> Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com> Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com> Signed-off-by: Simeng Liu <simengl@nvidia.com> Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Signed-off-by: Vincent Zhang <vinczhang@nvidia.com> Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com> Signed-off-by: Michal Guzek <mguzek@nvidia.com> Signed-off-by: Michal Guzek <moraxu@users.noreply.github.com> Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com> Signed-off-by: leslie-fang25 <leslief@nvidia.com> Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Co-authored-by: yunruis <205571022+yunruis@users.noreply.github.com> Co-authored-by: sunnyqgg <159101675+sunnyqgg@users.noreply.github.com> Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com> Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Co-authored-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com> Co-authored-by: Simeng Liu <109828133+SimengLiu-nv@users.noreply.github.com> Co-authored-by: Guoming Zhang <137257613+nv-guomingz@users.noreply.github.com> Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com> Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Co-authored-by: Vincent Zhang <vcheungyi@163.com> Co-authored-by: peaceh-nv <103117813+peaceh-nv@users.noreply.github.com> Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com> Co-authored-by: Chang Liu <9713593+chang-l@users.noreply.github.com> Co-authored-by: Leslie Fang <leslief@nvidia.com> Co-authored-by: Shunkangz <182541032+Shunkangz@users.noreply.github.com> Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>		2025-11-29 21:48:48 +08:00
..
pics	move the reset models into `examples/models/core` directory (#3555 )	2025-04-19 20:48:59 -07:00
README.md	[None][doc] Rename TensorRT-LLM to TensorRT LLM for homepage and the … (#7850 )	2025-09-25 21:02:35 +08:00
requirements.txt	move the reset models into `examples/models/core` directory (#3555 )	2025-04-19 20:48:59 -07:00
run_chat.py	[None][feat] Use ruff for formatting and linting new files by default (#8629 )	2025-11-01 16:11:40 +01:00
run.py	[TRTLLM-8684][chore] Migrate BuildConfig to Pydantic, add a Python wrapper for KVCacheType enum (#8330 )	2025-10-28 09:17:26 -07:00
show_pic.py	move the reset models into `examples/models/core` directory (#3555 )	2025-04-19 20:48:59 -07:00
vit_onnx_trt.py	[None][chore] Weekly mass integration of release/1.1 -- rebase (#9522 )	2025-11-29 21:48:48 +08:00

README.md

Guide to Qwen-VL deployment pipeline

Download the Qwen vision-language model (Qwen-VL).

git lfs install
git clone https://huggingface.co/Qwen/Qwen-VL-Chat

Generate the Vision Transformer (ViT) ONNX model and the TensorRT engine.

If you don't have ONNX file, run:
```
python3 vit_onnx_trt.py --pretrained_model_path ./Qwen-VL-Chat
```
The ONNX and TensorRT engine will be generated under ./onnx/visual_encoder and ./plan/visual_encoder respectively.
If you already have an ONNX file under ./onnx/visual_encoder and want to build a TensorRT engine with it, run:
```
python3 vit_onnx_trt.py --pretrained_model_path ./Qwen-VL-Chat --only_trt
```
This command saves the test image tensor to image.pt for later pipeline inference.

Build Qwen TensorRT engine.

Convert checkpoint

Install packages

pip install -r requirements.txt

Convert

python3 ./examples/models/core/qwen/convert_checkpoint.py --model_dir=./Qwen-VL-Chat \
        --output_dir=./tllm_checkpoint_1gpu \
        --dtype float16

Build TensorRT LLM engine

NOTE: max_prompt_embedding_table_size = query_token_num * max_batch_size, therefore, if you change max_batch_size, --max_prompt_embedding_table_size must be reset accordingly.

trtllm-build --checkpoint_dir=./tllm_checkpoint_1gpu \
             --gemm_plugin=float16 --gpt_attention_plugin=float16 \
             --max_input_len=2048 --max_seq_len=3072 \
             --max_batch_size=8 --max_prompt_embedding_table_size=2048 \
             --remove_input_padding=enable \
             --output_dir=./trt_engines/Qwen-VL-7B-Chat

The built Qwen engines are located in ./trt_engines/Qwen-VL-7B-Chat. For more information about Qwen, refer to the README.md in example/qwen.

Assemble everything into the Qwen-VL pipeline.

4.1 Run with INT4 GPTQ weight-only quantization engine

python3 run.py \
    --tokenizer_dir=./Qwen-VL-Chat \
    --qwen_engine_dir=./trt_engines/Qwen-VL-7B-Chat \
    --vit_engine_path=./plan/visual_encoder/visual_encoder_fp16.plan \
    --images_path='{"image": "./pics/demo.jpeg"}'

4.2 (Optional) For multiple rounds of dialogue, you can run:

python3 run_chat.py \
    --tokenizer_dir=./Qwen-VL-Chat \
    --qwen_engine_dir=./trt_engines/Qwen-VL-7B-Chat \
    --vit_engine_path=./plan/visual_encoder/visual_encoder_fp16.plan \
    --images_path='{"image": "./pics/demo.jpeg"}'

4.3 (Optional) To show the bounding box result in the demo picture, install OpenCV, ZMQ, and request:

pip install opencv-python==4.5.5.64
pip install opencv-python-headless==4.5.5.64
pip install zmq
pip install request

4.3.1 If the current program is executed on a remote machine, run the following command on a local machine:

python3 show_pic.py --ip=127.0.0.1 --port=8006

Replace the ip and port values, where ip is your remote machine IP address.

Run the following command on the remote machine:

python3 run_chat.py \
    --tokenizer_dir=./Qwen-VL-Chat \
    --qwen_engine_dir=./trt_engines/Qwen-VL-7B-Chat \
    --vit_engine_path=./plan/visual_encoder/visual_encoder_fp16.plan \
    --display \
    --port=8006

Replace the port value.

4.3.2 If the current program is executed on the local machine, run the following command:

python3 run_chat.py \
    --tokenizer_dir=./Qwen-VL-Chat \
    --qwen_engine_dir=./trt_engines/Qwen-VL-7B-Chat \
    --vit_engine_path=./plan/visual_encoder/visual_encoder_fp16.plan \
    --display \
    --local_machine

The question "Print the bounding box of the girl" is displayed. You should see the following image: