* feat: adding multimodal (only image for now) support in trtllm-bench Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * fix: add in load_dataset() calls to maintain the v2.19.2 behavior Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * re-adding prompt_token_ids and using that for prompt_len Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * updating the datasets version in examples as well Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * api changes are not needed Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * moving datasets requirement and removing a missed api change Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * addressing review comments Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * refactoring the quickstart example Signed-off-by: Rakib Hasan <rhasan@nvidia.com> --------- Signed-off-by: Rakib Hasan <rhasan@nvidia.com> |
||
|---|---|---|
| .. | ||
| pics | ||
| README.md | ||
| requirements.txt | ||
| run_chat.py | ||
| run.py | ||
| show_pic.py | ||
| vit_onnx_trt.py | ||
Guide to Qwen-VL deployment pipeline
- Download the Qwen vision-language model (Qwen-VL).
git lfs install git clone https://huggingface.co/Qwen/Qwen-VL-Chat - Generate the Vision Transformer (ViT) ONNX model and the TensorRT engine.
-
If you don't have ONNX file, run:
python3 vit_onnx_trt.py --pretrained_model_path ./Qwen-VL-ChatThe ONNX and TensorRT engine will be generated under
./onnx/visual_encoderand./plan/visual_encoderrespectively. -
If you already have an ONNX file under
./onnx/visual_encoderand want to build a TensorRT engine with it, run:python3 vit_onnx_trt.py --pretrained_model_path ./Qwen-VL-Chat --only_trtThis command saves the test image tensor to
image.ptfor later pipeline inference.
- Build Qwen TensorRT engine.
-
Convert checkpoint
- Install packages
pip install -r requirements.txt- Convert
python3 ./examples/qwen/convert_checkpoint.py --model_dir=./Qwen-VL-Chat \ --output_dir=./tllm_checkpoint_1gpu \ --dtype float16 -
Build TensorRT-LLM engine
NOTE:
max_prompt_embedding_table_size = query_token_num * max_batch_size, therefore, if you changemax_batch_size,--max_prompt_embedding_table_sizemust be reset accordingly.trtllm-build --checkpoint_dir=./tllm_checkpoint_1gpu \ --gemm_plugin=float16 --gpt_attention_plugin=float16 \ --max_input_len=2048 --max_seq_len=3072 \ --max_batch_size=8 --max_prompt_embedding_table_size=2048 \ --remove_input_padding=enable \ --output_dir=./trt_engines/Qwen-VL-7B-ChatThe built Qwen engines are located in
./trt_engines/Qwen-VL-7B-Chat. For more information about Qwen, refer to the README.md inexample/qwen.
-
Assemble everything into the Qwen-VL pipeline.
4.1 Run with INT4 GPTQ weight-only quantization engine
python3 run.py \ --tokenizer_dir=./Qwen-VL-Chat \ --qwen_engine_dir=./trt_engines/Qwen-VL-7B-Chat \ --vit_engine_path=./plan/visual_encoder/visual_encoder_fp16.plan \ --images_path='{"image": "./pics/demo.jpeg"}'4.2 (Optional) For multiple rounds of dialogue, you can run:
python3 run_chat.py \ --tokenizer_dir=./Qwen-VL-Chat \ --qwen_engine_dir=./trt_engines/Qwen-VL-7B-Chat \ --vit_engine_path=./plan/visual_encoder/visual_encoder_fp16.plan \ --images_path='{"image": "./pics/demo.jpeg"}'4.3 (Optional) To show the bounding box result in the demo picture, install OpenCV, ZMQ, and request:
pip install opencv-python==4.5.5.64 pip install opencv-python-headless==4.5.5.64 pip install zmq pip install request4.3.1 If the current program is executed on a remote machine, run the following command on a local machine:
python3 show_pic.py --ip=127.0.0.1 --port=8006Replace the
ipandportvalues, whereipis your remote machine IP address.Run the following command on the remote machine:
python3 run_chat.py \ --tokenizer_dir=./Qwen-VL-Chat \ --qwen_engine_dir=./trt_engines/Qwen-VL-7B-Chat \ --vit_engine_path=./plan/visual_encoder/visual_encoder_fp16.plan \ --display \ --port=8006Replace the
portvalue.4.3.2 If the current program is executed on the local machine, run the following command:
python3 run_chat.py \ --tokenizer_dir=./Qwen-VL-Chat \ --qwen_engine_dir=./trt_engines/Qwen-VL-7B-Chat \ --vit_engine_path=./plan/visual_encoder/visual_encoder_fp16.plan \ --display \ --local_machineThe question "Print the bounding box of the girl" is displayed. You should see the following image:
