* Update TensorRT-LLM --------- Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| build_visual_engine.py | ||
| README.md | ||
| requirements.txt | ||
| run.py | ||
Multi-Modal
This document shows how to run multimodal pipelines with TensorRT-LLM, e.g. from image+text input modalities to text output.
BLIP + T5
-
Download Huggingface weights and convert original checkpoint to TRT-LLM checkpoint format following example in
examples/enc_dec/README.md.export MODEL_NAME=flan-t5-xl git clone https://huggingface.co/google/${MODEL_NAME} tmp/hf_models/${MODEL_NAME} python ../enc_dec/t5/convert.py -i tmp/hf_models/${MODEL_NAME} \ -o tmp/hf_models/${MODEL_NAME} --weight_data_type float32 \ --inference_tensor_para_size 1 -
Build TRT-LLM engine from TRT-LLM checkpoint
Add an additional parameter
--max_prompt_embedding_table_sizecompared to LLM build commands.max_prompt_embedding_table_size = visual_feature_dim * max_batch_size, so if you change max_batch_size, prompt table size must be reset accordingly.python ../enc_dec/build.py --model_type t5 \ --weight_dir tmp/hf_models/${MODEL_NAME} \ --output_dir trt_engines/${MODEL_NAME}/1-gpu \ --engine_name ${MODEL_NAME} \ --remove_input_padding \ --use_bert_attention_plugin \ --use_gpt_attention_plugin \ --use_gemm_plugin \ --use_rmsnorm_plugin \ --dtype bfloat16 \ --max_beam_width 1 \ --max_batch_size 8 \ --max_prompt_embedding_table_size 256 # 32 (visual_feature_dim) * 8 (max_batch_size)The built T5 engines are located in
./trt_engines/${MODEL_NAME}/1-gpu/bfloat16/tp1. -
Build TensorRT engines for visual components
python build_visual_engine.py --model_name ${MODEL_NAME} --model_path tmp/hf_models/${MODEL_NAME}The built engines are located in
./visual_engines/${MODEL_NAME}. -
Assemble everything into BLIP pipeline
python run.py \ --max_new_tokens 30 \ --input_text "Question: which city is this? Answer:" \ --hf_model_dir tmp/hf_models/${MODEL_NAME} \ --visual_engine_dir visual_engines/${MODEL_NAME} \ --llm_engine_dir trt_engines/${MODEL_NAME}/1-gpu/bfloat16/tp1
BLIP + OPT
OPT pipeline needs few minor changes from T5 pipeline
-
Convert Huggingface weights to TRT-LLM checkpoint format following
examples/opt/README.md. -
Use
trtllm-buildcommand to build TRT-LLM engine for OPT. -
Add
--decoder-llmargument to inference script, since OPT is a decoder-only LLM. -
The full list of commands is as follows:
export MODEL_NAME=opt-2.7b git clone https://huggingface.co/facebook/${MODEL_NAME} tmp/hf_models/${MODEL_NAME} python ../opt/convert_checkpoint.py \ --model_dir tmp/hf_models/${MODEL_NAME} \ --dtype float16 \ --output_dir tmp/hf_models/${MODEL_NAME}/c-model/fp16/1-gpu trtllm-build \ --checkpoint_dir tmp/hf_models/${MODEL_NAME}/c-model/fp16/1-gpu \ --output_dir trt_engines/${MODEL_NAME}/fp16/1-gpu \ --use_gpt_attention_plugin float16 \ --use_gemm_plugin float16 \ --max_input_len 924 \ --max_output_len 100 \ --max_beam_width 1 \ --max_batch_size 8 \ --max_prompt_embedding_table_size 256 python build_visual_engine.py --model_name ${MODEL_NAME} --model_path tmp/hf_models/${MODEL_NAME} python run.py \ --max_new_tokens 30 \ --input_text "Question: which city is this? Answer:" \ --hf_model_dir tmp/hf_models/${MODEL_NAME} \ --visual_engine_dir visual_engines/${MODEL_NAME} \ --llm_engine_dir trt_engines/${MODEL_NAME}/fp16/1-gpu \ --decoder_llm
LLaVA
-
Download and install LLaVA library. Rebuild TRT-LLM after installing LLaVA.
git clone https://github.com/haotian-liu/LLaVA.git sudo pip install -e LLaVA -
Download Huggingface model weights. This model has both LLM and visual components unlike BLIP example which downloads only LLM components from Huggingface.
export MODEL_NAME="llava-v1.5-7b" git clone https://huggingface.co/liuhaotian/${MODEL_NAME} tmp/hf_models/${MODEL_NAME} -
Generate TRT-LLM engine for LLaMA following example in
examples/llama/README.mdpython ../llama/build.py \ --model_dir tmp/hf_models/${MODEL_NAME} \ --output_dir trt_engines/${MODEL_NAME}/fp16/1-gpu --dtype float16 \ --remove_input_padding \ --use_gpt_attention_plugin float16 \ --enable_context_fmha \ --use_gemm_plugin float16 \ --max_batch_size 1 \ --max_prompt_embedding_table_size 576 # 576 (visual_feature_dim) * 1 (max_batch_size) -
Build TensorRT engines for visual components
python build_visual_engine.py --model_name ${MODEL_NAME} --model_path tmp/hf_models/${MODEL_NAME} -
Add
--decoder-llmargument to inference script, since LLaMA is a decoder-only LLM.python run.py \ --max_new_tokens 30 \ --input_text "Question: which city is this? Answer:" \ --hf_model_dir tmp/hf_models/${MODEL_NAME} \ --visual_engine_dir visual_engines/${MODEL_NAME} \ --llm_engine_dir trt_engines/${MODEL_NAME}/fp16/1-gpu \ --decoder_llm