TensorRT-LLMs/examples/multimodal/README.md
Dan Blanaru 48686bca3a
open source 7f370deb0090d885d7518c2b146399ba3933c004 (#2273)
* Update TensorRT-LLM

---------
Co-authored-by: Qingquan Song <ustcsqq@gmail.com>
2024-09-30 13:51:19 +02:00

741 lines
29 KiB
Markdown

<!-- omit from toc -->
# Multi-Modal
This document shows how to run multimodal pipelines with TensorRT-LLM, e.g. from image+text input modalities to text output.
Multimodal models' LLM part has an additional parameter `--max_multimodal_len` compared to LLM-only build commands. Under the hood, `max_multimodal_len` and `max_prompt_embedding_table_size` are effectively the same concept, i.e., prepended/concatenated embeddings (either multimodal feature embeddings or prompt tuning embeddings) to the LLM input embeddings. The multimodal features from the visual encoder of shape `[batch_size, num_visual_features, visual_hidden_dim]` is flattened as `[batch_size * num_visual_features, visual_hidden_dim]` and passed like a prompt embedding table.
We first describe how to run each model on a single GPU. We then provide general guidelines on using tensor parallelism for the LLM part of the pipeline.
- [BLIP2](#blip2)
- [CogVLM](#cogvlm)
- [Deplot](#deplot)
- [Fuyu](#fuyu)
- [Kosmos-2](#kosmos-2)
- [LLaVA, LLaVa-NeXT and VILA](#llava-llava-next-and-vila)
- [NeVA](#neva)
- [Nougat](#nougat)
- [Phi-3-vision](#phi-3-vision)
- [Video NeVA](#video-neva)
- [Enabling tensor parallelism for multi-GPU](#enabling-tensor-parallelism-for-multi-gpu)
## BLIP2
This BLIP section covers both BLIP2-OPT and BLIP2-T5, with minor changes needed when switching the LLM backbone.
1. Download Huggingface weights and convert original checkpoint to TRT-LLM checkpoint format
following example in `examples/opt/README.md` and `examples/enc_dec/README.md`.
```bash
export MODEL_NAME="blip2-opt-2.7b" # options: blip2-opt-6.7b, blip2-flan-t5-xl, blip2-flan-t5-xxl
git clone https://huggingface.co/Salesforce/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
```
For BLIP2-OPT family,
```bash
python ../opt/convert_checkpoint.py --model_type blip2 \
--model_dir tmp/hf_models/${MODEL_NAME} \
--output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--dtype float16
```
For BLIP2-T5 family,
```bash
python ../enc_dec/convert_checkpoint.py --model_type blip2 \
--model_dir tmp/hf_models/${MODEL_NAME} \
--output_dir tmp/trt_models/${MODEL_NAME}/bfloat16 \
--tp_size 1 \
--pp_size 1 \
--dtype bfloat16
```
2. Build TRT-LLM engine from TRT-LLM checkpoint
For BLIP2-OPT family,
```bash
trtllm-build \
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu \
--gemm_plugin float16 \
--max_beam_width 1 \
--max_batch_size 8 \
--max_seq_len 1024 \
--max_input_len 924 \
--max_multimodal_len 256 # 8 (max_batch_size) * 32 (num_visual_features)
```
For BLIP2-T5 family,
```bash
trtllm-build --checkpoint_dir tmp/trt_models/${MODEL_NAME}/bfloat16/encoder \
--output_dir tmp/trt_engines/${MODEL_NAME}/bfloat16/encoder \
--paged_kv_cache disable \
--moe_plugin disable \
--enable_xqa disable \
--gemm_plugin bfloat16 \
--bert_attention_plugin bfloat16 \
--gpt_attention_plugin bfloat16 \
--remove_input_padding enable \
--context_fmha disable \
--max_beam_width 1 \
--max_batch_size 8 \
--max_input_len 924 \
--max_multimodal_len 256 # 8 (max_batch_size) * 32 (num_visual_features)
trtllm-build --checkpoint_dir tmp/trt_models/${MODEL_NAME}/bfloat16/decoder \
--output_dir tmp/trt_engines/${MODEL_NAME}/bfloat16/decoder \
--paged_kv_cache disable \
--moe_plugin disable \
--enable_xqa disable \
--gemm_plugin bfloat16 \
--bert_attention_plugin bfloat16 \
--gpt_attention_plugin bfloat16 \
--remove_input_padding enable \
--context_fmha disable \
--max_beam_width 1 \
--max_batch_size 8 \
--max_seq_len 1024 \
--max_encoder_input_len 924 \
--max_input_len 1 # Same command for decoder but don't set --max_multimodal_len
```
**NOTE**: `max_multimodal_len = max_batch_size * num_visual_features`, so if you change max_batch_size, max multimodal length **MUST** be changed accordingly.
3. Build TensorRT engines for vision encoders
```bash
python build_visual_engine.py --model_type blip2 --model_path tmp/hf_models/${MODEL_NAME} --max_batch_size 8
```
The built engines are located in `tmp/trt_engines/${MODEL_NAME}/vision_encoder`.
To run the BLIP2 pipeline with batch size > 1, change `--max_batch_size` argument to `build_visual_engine.py` accordingly.
4. Assemble everything into BLIP2 pipeline
For BLIP2-OPT family,
```bash
python run.py \
--max_new_tokens 30 \
--input_text "Question: which city is this? Answer:" \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--visual_engine_dir tmp/trt_engines/${MODEL_NAME}/vision_encoder \
--llm_engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu
```
For BLIP2-T5 family,
```bash
python run.py \
--max_new_tokens 30 \
--input_text "Question: which city is this? Answer:" \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--visual_engine_dir tmp/trt_engines/${MODEL_NAME}/vision_encoder \
--llm_engine_dir tmp/trt_engines/${MODEL_NAME}/bfloat16
```
5. (Optional) INT8/INT4 weight-only quantization for OPT can be enabled using commands as follows (take `INT4` as an example, while `INT8` is the default precision for weight-only quantization):
```bash
python ../opt/convert_checkpoint.py \
--model_dir tmp/hf_models/${MODEL_NAME} \
--dtype float16 \
--output_dir tmp/trt_models/${MODEL_NAME}/int4_weightonly/1-gpu \
--use_weight_only \
--weight_only_precision int4
trtllm-build \
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/int4_weightonly/1-gpu \
--output_dir tmp/trt_engines/${MODEL_NAME}/int4_weightonly/1-gpu \
--gemm_plugin float16 \
--max_beam_width 1 \
--max_batch_size 8 \
--max_multimodal_len 256 \
--max_input_len 924 \
--max_seq_len 1024
```
The built OPT engines lie in `tmp/trt_engines/${MODEL_NAME}/int4_weightonly/1-gpu`.
You should use this directory as `--llm_engine_dir` argument to `run.py`
**NOTE:** INT8/INT4 option is not supported for BLIP2-T5, because quantization support has not been
added for encoder-decoder models yet.
## CogVLM
Currently, CogVLM only support bfloat16 precision.
1. Download Huggingface weights
```bash
export MODEL_NAME="cogvlm-chat-hf"
git clone https://huggingface.co/THUDM/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
export TOKENIZER_NAME="vicuna-7b-v1.5"
git clone https://huggingface.co/lmsys/${TOKENIZER_NAME} tmp/hf_models/${TOKENIZER_NAME}
```
Because currently onnx doesn't support `xops.memory_efficient_attention`, we need to modify some source code of the huggingface CogVLM.
```
cd tmp/hf_models/${MODEL_NAME}
sed -i '4s/.*//;40s/.*/ out = self.attention(q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2)).transpose(1, 2).contiguous()/;41s/.*//;42s/.*//' visual.py # It will replace memory_efficient_attention with some basic ops
```
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/cogvlm`
CogVLM uses a Vit encoder as LLM encoder and a modified Llama as decoder.
```bash
python ../cogvlm/convert_checkpoint.py --model_dir tmp/hf_models/${MODEL_NAME} --output_dir tmp/trt_models/${MODEL_NAME} --dtype bfloat16 --use_prompt_tuning
trtllm-build --checkpoint_dir tmp/trt_models/${MODEL_NAME} \
--output_dir tmp/trt_engines/${MODEL_NAME}/bf16/1-gpu \
--gemm_plugin bfloat16 \
--gpt_attention_plugin bfloat16 \
--remove_input_padding enable \
--max_batch_size 48 \
--max_input_len 2048 \
--max_seq_len 3076 \
--paged_kv_cache enable \
--enable_xqa disable \
--bert_attention_plugin disable \
--moe_plugin disable \
--max_multimodal_len 61440 # 48 (max_batch_size) * 1280 (max_num_visual_features)
```
3. Generate TensorRT engines for visual components and combine everything into final pipeline.
```bash
python build_visual_engine.py --model_type cogvlm --model_path tmp/hf_models/${MODEL_NAME} --max_batch_size 48
python run.py \
--max_new_tokens 1000 \
--input_text " [INST] please describe this image in detail [/INST] " \
--hf_model_dir tmp/hf_models/${TOKENIZER_NAME} \
--visual_engine_dir tmp/trt_engines/${MODEL_NAME}/vision_encoder \
--llm_engine_dir tmp/trt_engines/${MODEL_NAME}/bf16/1-gpu \
--batch_size 1 \
--top_p 0.4 \
--top_k 1 \
--temperature 0.2 \
--repetition_penalty 1.2 \
--enable_context_fmha_fp32_acc
CogVLM uses model_runner_cpp by default. To switch to model_runner, set `--use_py_session` in the command mentioned above.
```
## Deplot
1. Download Huggingface weights and convert original checkpoint to TRT-LLM checkpoint format
following example in `examples/enc_dec/README.md`.
```bash
export MODEL_NAME="deplot"
git clone https://huggingface.co/google/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
python ../enc_dec/convert_checkpoint.py --model_type pix2struct \
--model_dir tmp/hf_models/${MODEL_NAME} \
--output_dir tmp/trt_models/${MODEL_NAME}/float16 \
--tp_size 1 \
--pp_size 1 \
--dtype float16
```
2. Build TRT-LLM engine from TRT-LLM checkpoint
```bash
trtllm-build --checkpoint_dir tmp/trt_models/${MODEL_NAME}/float16/decoder \
--output_dir tmp/trt_engines/${MODEL_NAME}/1-gpu/float16/decoder \
--paged_kv_cache disable \
--moe_plugin disable \
--enable_xqa disable \
--gemm_plugin float16 \
--bert_attention_plugin float16 \
--gpt_attention_plugin float16 \
--remove_input_padding enable \
--context_fmha disable \
--max_beam_width 1 \
--max_batch_size 8 \
--max_seq_len 2558 \
--max_encoder_input_len 2048 \
--max_input_len 1
```
The built deplot engines are located in `tmp/trt_engines/${MODEL_NAME}/1-gpu/float16`.
3. Build TensorRT engines for visual components
```bash
python build_visual_engine.py --model_type pix2struct --model_path tmp/hf_models/${MODEL_NAME} --max_batch_size 8
```
The built visual engines are located in `tmp/trt_engines/${MODEL_NAME}/vision_encoder`.
To run the deplot pipeline with batch size > 1, change `--max_batch_size` argument to `build_visual_engine.py` accordingly.
4. Assemble everything into deplot pipeline
```bash
python run.py \
--max_new_tokens 100 \
--input_text "" \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--visual_engine_dir tmp/trt_engines/${MODEL_NAME}/vision_encoder \
--llm_engine_dir tmp/trt_engines/${MODEL_NAME}/1-gpu/float16
```
## Fuyu
1. Download Huggingface weights
```bash
export MODEL_NAME="fuyu-8b"
git clone https://huggingface.co/adept/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
```
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/gpt`.
The LLM portion of Fuyu uses a Persimmon model
```bash
python ../gpt/convert_checkpoint.py \
--model_dir tmp/hf_models/${MODEL_NAME} \
--output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--dtype float16 \
--gpt_variant persimmon
trtllm-build \
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu \
--gemm_plugin float16 \
--use_fused_mlp=enable \
--max_batch_size 1 \
--max_input_len 2048 \
--max_seq_len 2560 \
--max_multimodal_len 2048
```
3. Generate TensorRT engines for visual components and combine everything into final pipeline.
```bash
python build_visual_engine.py --model_type fuyu --model_path tmp/hf_models/${MODEL_NAME}
python run.py \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--visual_engine_dir tmp/trt_engines/${MODEL_NAME}/vision_encoder \
--llm_engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu
```
## Kosmos-2
1. Download Huggingface weights
```bash
export MODEL_NAME="kosmos-2"
git clone https://huggingface.co/microsoft/kosmos-2-patch14-224 tmp/hf_models/${MODEL_NAME}
```
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/gpt`.
```bash
python ../gpt/convert_checkpoint.py \
--model_dir tmp/hf_models/${MODEL_NAME} \
--output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--dtype float16 \
--gpt_variant ${MODEL_NAME}
trtllm-build \
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu \
--gpt_attention_plugin float16 \
--gemm_plugin float16 \
--max_batch_size 1 \
--max_input_len 512 \
--max_seq_len 1024 \
--max_multimodal_len 64 # 1 (max_batch_size) * 64 (num_visual_features)
```
3. Generate TensorRT engines for visual components and combine everything into final pipeline.
```bash
python build_visual_engine.py --model_type kosmos-2 --model_path tmp/hf_models/${MODEL_NAME}
python run.py \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--visual_engine_dir tmp/trt_engines/${MODEL_NAME}/vision_encoder \
--llm_engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu
```
## LLaVA, LLaVa-NeXT and VILA
[LLaVA](https://github.com/haotian-liu/LLaVA) and [VILA](https://github.com/Efficient-Large-Model/VILA) are both visual language models (VLM) that can be deployed in TensorRT-LLM with many quantization options. [LLaVA-NeXT](https://huggingface.co/collections/llava-hf/llava-next-65f75c4afac77fd37dbbe6cf) is an extension of LLaVA. TRT-LLM currently supports [Mistral-7b](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) and [ Nous-Hermes-2-Yi-34B](https://huggingface.co/llava-hf/llava-v1.6-34b-hf) variant of LLaVA-NeXT.
1. Download Huggingface model weights. These models have both visual and LLM components
unlike BLIP2 example which downloads only LLM components from Huggingface.
For LLaVA,
```bash
export MODEL_NAME="llava-1.5-7b-hf" # also llava-1.5-13b-hf
git clone https://huggingface.co/llava-hf/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
```
For LLaVA-NeXT,
```bash
export MODEL_NAME="llava-v1.6-mistral-7b-hf" #for 34b variant "llava-v1.6-34b-hf"
git clone https://huggingface.co/llava-hf/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
```
For VILA, we need a few more steps until it is added to HF model zoo
```bash
# install the following dependency
pip install -r requirements-vila.txt
# clone original VILA repo
export VILA_PATH="tmp/hf_models/VILA"
git clone https://github.com/Efficient-Large-Model/VILA.git ${VILA_PATH}
# download VILA checkpoints
export MODEL_NAME="vila1.5-3b"
git clone https://huggingface.co/Efficient-Large-Model/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
```
2. Generate TRT-LLM engine for LLaMA following example in `examples/llama/README.md`
```bash
python ../llama/convert_checkpoint.py \
--model_dir tmp/hf_models/${MODEL_NAME} \
--output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--dtype float16
# for LLaVA
trtllm-build \
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu \
--gemm_plugin float16 \
--use_fused_mlp=enable \
--max_batch_size 1 \
--max_input_len 2048 \
--max_seq_len 2560 \
--max_multimodal_len 576 # 1 (max_batch_size) * 576 (num_visual_features)
# for LLaVA-NeXT
trtllm-build \
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu \
--gpt_attention_plugin float16 \
--gemm_plugin float16 \
--use_fused_mlp=enable \
--max_batch_size 1 \
--max_input_len 4096 \
--max_seq_len 5120 \
--max_num_tokens 4096 \ # 1 (max_batch_size) * 4096 (max_input_len)
--max_multimodal_len 4096 # 1 (max_batch_size) * 4096 (max_input_len)
# for VILA
trtllm-build \
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu \
--gemm_plugin float16 \
--use_fused_mlp=enable \
--max_batch_size 1 \
--max_input_len 2048 \
--max_seq_len 2560 \
--max_multimodal_len 4096 # 1 (max_batch_size) * 4096 (num_visual_features)
```
3. Build TensorRT engines for visual components
```bash
python build_visual_engine.py --model_path tmp/hf_models/${MODEL_NAME} --model_type llava # for LLaVA
python build_visual_engine.py --model_path tmp/hf_models/${MODEL_NAME} --model_type llava_next --model_path tmp/hf_models/${MODEL_NAME} --max_batch_size 5 # 1 (max_batch_size) * 5 (because LLAVA-NeXT visual encoder can have at most 5 patches) # for LLaVA-NeXT
python build_visual_engine.py --model_path tmp/hf_models/${MODEL_NAME} --model_type vila --vila_path ${VILA_PATH} # for VILA
```
```bash
python run.py \
--max_new_tokens 30 \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--visual_engine_dir tmp/trt_engines/${MODEL_NAME}/vision_encoder \
--llm_engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu \
--input_text "Question: which city is this? Answer:" # for LLaVA and for LLaVA-NeXT
```
For VILA, you can use either local file or web url as input images.
Suppose you have a local image `av.png` downloaded from `https://github.com/Efficient-Large-Model/VILA/blob/main/demo_trt_llm/av.png` and the url of `merlion.png`
```bash
wget -O av.png https://raw.githubusercontent.com/Efficient-Large-Model/VILA/main/demo_images/av.png
python run.py \
--max_new_tokens 100 \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--visual_engine_dir tmp/trt_engines/${MODEL_NAME}/vision_encoder \
--llm_engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu \
--image_path=av.png,https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png \
--input_text="<image>\n<image>\n Please elaborate what you see in the images?" \
--batch_size=1 # for VILA mode 1
python run.py \
--max_new_tokens 100 \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--visual_engine_dir tmp/trt_engines/${MODEL_NAME}/vision_encoder \
--llm_engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu \
--image_path=av.png,https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png \
--input_text="<image>\n Please elaborate what you see in the images?" \
--batch_size=2 # for VILA mode 2
```
Note that VILA can support different modes in terms of batching:
- Mode 1: if you want to query N images as a whole using a prompt, `--batch_size=1` should be used (which is the default value). Example is given above.
- Mode 2: if you want to query N images individually using the same prompt (replicated), `--batch_size=N` should be used. Don't forget to set the `--max_batch_size` and `--max_multimodal_len` during engine building.
Note: use `--run_profiling` for performance measurement, use `--check_accuracy` for accuracy check.
4. (Optional) Different quantization methods supported in LLaMA can be applied to LLaVA/VILA as well, such as INT4/INT8 weight-only, SmoothQuant, and INT4 Activation-Aware Quantization (AWQ). Detailed instructions can be found in LLaMA [README](../llama/README.md).
For example,
```bash
# INT4 weight only
python ../llama/convert_checkpoint.py \
--model_dir tmp/hf_models/${MODEL_NAME} \
--dtype float16 \
--output_dir tmp/trt_models/${MODEL_NAME}/int4_weightonly/1-gpu \
--use_weight_only \
--weight_only_precision int4
# INT4 AWQ
python ../quantization/quantize.py \
--model_dir tmp/hf_models/${MODEL_NAME} \
--output_dir tmp/trt_models/${MODEL_NAME}/int4_awq/1-gpu \
--dtype float16 \
--qformat int4_awq \
--calib_size 32
```
Then follow the same `trtllm-build` and `run.py` steps as before. NOTE: for `trtllm-build` command, do not use `--use_fused_mlp=enable` in these quantization modes.
## NeVA
[NeVA](https://docs.nvidia.com/nemo-framework/user-guide/latest/multimodalmodels/neva/index.html) is a groundbreaking addition to the NeMo Multimodal ecosystem. This model seamlessly integrates large language-centric models with a vision encoder, that can be deployed in TensorRT-LLM.
1. Generate TRT-LLM engine for NVGPT following example in `examples/gpt/README.md`. To adhere to the NVGPT conventions of the conversion script, some layer keys have to be remapped using `--nemo_rename_key`.
```bash
export MODEL_NAME="neva"
python ../gpt/convert_checkpoint.py \
--nemo_ckpt_path ./${MODEL_NAME}.nemo \
--dtype bfloat16 \
--output_dir tmp/trt_models/${MODEL_NAME} \
--nemo_rename_key model:model.language_model \
attention.linear_qkv.layer_norm_bias:input_layernorm.bias \
attention.linear_qkv.layer_norm_weight:input_layernorm.weight \
mlp.linear_fc1.layer_norm_bias:post_attention_layernorm.bias \
mlp.linear_fc1.layer_norm_weight:post_attention_layernorm.weight \
linear_qkv:query_key_value \
linear_fc1:dense_h_to_4h \
linear_fc2:dense_4h_to_h \
linear_proj:dense \
decoder:encoder
trtllm-build \
--checkpoint_dir tmp/trt_models/${MODEL_NAME} \
--output_dir tmp/trt_engines/${MODEL_NAME}/bf16/1-gpu \
--gpt_attention_plugin bfloat16 \
--gemm_plugin bfloat16 \
--max_batch_size 1 \
--max_input_len 2048 \
--max_seq_len 2560 \
--max_multimodal_len 729 # 1 (max_batch_size) * 729 (num_visual_features)
```
2. Build TensorRT engines for visual components
```bash
python build_visual_engine.py --model_path ./${MODEL_NAME}.nemo --model_type neva
```
```bash
python run.py \
--max_new_tokens 30 \
--hf_model_dir tmp/trt_models/${MODEL_NAME} \
--visual_engine_dir tmp/trt_engines/${MODEL_NAME}/vision_encoder \
--llm_engine_dir tmp/trt_engines/${MODEL_NAME}/bf16/1-gpu \
--input_text "Question: which city is this? Answer:"
```
Note: use `--run_profiling` for performance measurement, use `--check_accuracy` for accuracy check.
## Nougat
1. Download Huggingface weights
```bash
export MODEL_NAME="nougat-base" # also nougat-small
git clone https://huggingface.co/facebook/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
```
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/enc_dec`
Nougat uses mBART architecture but replaces the LLM encoder with a Swin Transformer encoder.
To achieve this, we add an extra `--nougat` flag (over mBART example) to
`convert_checkpoint.py` in `examples/enc_dec` and `trtllm-build`.
```bash
python ../enc_dec/convert_checkpoint.py --model_type bart \
--model_dir tmp/hf_models/${MODEL_NAME} \
--output_dir tmp/trt_models/${MODEL_NAME}/bfloat16 \
--tp_size 1 \
--pp_size 1 \
--dtype bfloat16 \
--nougat
trtllm-build --checkpoint_dir tmp/trt_models/${MODEL_NAME}/bfloat16/decoder \
--output_dir tmp/trt_engines/${MODEL_NAME}/1-gpu/bfloat16/decoder \
--paged_kv_cache disable \
--moe_plugin disable \
--enable_xqa disable \
--gemm_plugin bfloat16 \
--bert_attention_plugin bfloat16 \
--gpt_attention_plugin bfloat16 \
--remove_input_padding enable \
--max_beam_width 1 \
--max_batch_size 1 \
--max_seq_len 101 \
--max_input_len 1 \
--max_encoder_input_len 588 # 1 (max_batch_size) * 588 (num_visual_features)
```
3. Generate TensorRT engines for visual components and combine everything into final pipeline.
```bash
python build_visual_engine.py --model_type nougat --model_path tmp/hf_models/${MODEL_NAME}
python run.py \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--visual_engine_dir tmp/trt_engines/${MODEL_NAME}/vision_encoder \
--llm_engine_dir tmp/trt_engines/${MODEL_NAME}/1-gpu/bfloat16
```
Note: Nougat models usually do not need a text prompt.
## Phi-3-vision
1. Download Huggingface weights
```bash
export MODEL_NAME="Phi-3-vision-128k-instruct" # or Phi-3.5-vision-instruct
git clone https://huggingface.co/microsoft/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
```
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/phi`.
```bash
python ../phi/convert_checkpoint.py \
--model_dir tmp/hf_models/${MODEL_NAME} \
--output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--dtype float16
trtllm-build \
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu \
--gpt_attention_plugin float16 \
--gemm_plugin float16 \
--max_batch_size 1 \
--max_input_len 4096 \
--max_seq_len 4608 \
--max_multimodal_len 4096
```
3. Generate TensorRT engines for visual components and combine everything into final pipeline.
```bash
python build_visual_engine.py --model_type phi-3-vision --model_path tmp/hf_models/${MODEL_NAME}
python run.py \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--visual_engine_dir tmp/trt_engines/${MODEL_NAME}/vision_encoder \
--llm_engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/ \
--image_path=https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png
```
## Video NeVA
[Video NeVA](https://github.com/NVIDIA/NeMo/blob/main/docs/source/multimodal/mllm/video_neva.rst) is a groundbreaking addition to the NeMo Multimodal ecosystem that could work with video modality. This model seamlessly integrates large language-centric models with a vision encoder, that can be deployed in TensorRT-LLM.
1. Generate TRT-LLM engine for Nemotron model following example in `examples/nemotron/README.md`. To adhere to the NVGPT conventions of the conversion script. This will be used as our base LM for inference.
```bash
pip install decord # used for loading video
python3 ../quantization/quantize.py \
--nemo_ckpt_path /path/to/nemotron/model.nemo \
--dtype bfloat16 \
--batch_size 64 \
--qformat full_prec \
--output_dir nemotron-3/trt_ckpt/bf16/1-gpu
trtllm-build \
--checkpoint_dir nemotron-3/trt_ckpt/bf16/1-gpu \
--output_dir tmp/trt_engines/nemotron-3/bf16/1-gpu \
--gpt_attention_plugin bfloat16 \
--gemm_plugin bfloat16 \
--max_batch_size 1 \
--max_input_len 4096 \
--max_seq_len 4352 \
--max_multimodal_len 3072 # 1 (max_batch_size) * (12 num_frames) * (256 image_token_len)
```
2. Build TensorRT engines for visual components
```bash
python build_visual_engine.py --model_path /path/to/video/neva/projector.nemo --model_type video-neva --output_dir tmp/trt_engines/nemotron-3/visual_encoder
```
```bash
python run.py \
--max_new_tokens 30 \
--hf_model_dir nemotron-3/trt_ckpt/bf16/1-gpu \
--visual_engine_dir tmp/trt_engines/nemotron-3/visual_encoder \
--llm_engine_dir tmp/trt_engines/nemotron-3/bf16/1-gpu \
--input_text "Question: what is in the video? Answer:" \
--video_path /path/to/your/local/video/file
```
Note: use `--run_profiling` for performance measurement, use `--check_accuracy` for accuracy check.
## Enabling tensor parallelism for multi-GPU
The LLM part of the pipeline can be run on multiple GPUs using tensor parallelism.
The visual encoder will be replicated on each GPU and operate in a data parallel fashion.
To enable tensor parallelism, both weight conversion step (from Huggingface to FT format)
and engine building step should use additional arguments. Finally `run.py` should be prefixed
with `mpirun -n NUM_GPUS --allow-run-as-root`.
The full set of commands to enable 2-way tensor parallelism for LLaVA is:
```bash
export MODEL_NAME="llava-1.5-7b-hf"
python ../llama/convert_checkpoint.py \
--model_dir tmp/hf_models/${MODEL_NAME} \
--output_dir tmp/trt_models/${MODEL_NAME}/fp16/2-gpu \
--dtype float16 --tp_size 2
trtllm-build \
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/2-gpu \
--output_dir tmp/trt_engines/${MODEL_NAME}/fp16/2-gpu \
--gemm_plugin float16 \
--max_batch_size 1 \
--max_input_len 2048 \
--max_seq_len 2560 \
--max_multimodal_len 576
python build_visual_engine.py --model_type llava --model_path tmp/hf_models/${MODEL_NAME}
mpirun -n 2 --allow-run-as-root \
python run.py \
--max_new_tokens 30 \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--visual_engine_dir tmp/trt_engines/${MODEL_NAME}/vision_encoder \
--llm_engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/2-gpu \
```