TensorRT-LLMs/examples/multimodal/README.md
Kaiyu Xie 77d7fe1eb2
Update TensorRT-LLM (#2849)
* Update TensorRT-LLM

---------

Co-authored-by: aotman <chenhangatm@gmail.com>
2025-03-04 18:44:00 +08:00

1179 lines
47 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!-- omit from toc -->
# Multi-Modal
This document shows how to run multimodal pipelines with TensorRT-LLM, e.g. from image+text input modalities to text output.
Multimodal models' LLM part has an additional parameter `--max_multimodal_len` compared to LLM-only build commands. Under the hood, `max_multimodal_len` and `max_prompt_embedding_table_size` are effectively the same concept, i.e., prepended/concatenated embeddings (either multimodal feature embeddings or prompt tuning embeddings) to the LLM input embeddings. The multimodal features from the visual encoder of shape `[batch_size, num_visual_features, visual_hidden_dim]` is flattened as `[batch_size * num_visual_features, visual_hidden_dim]` and passed like a prompt embedding table.
We first describe three runtime modes for running multimodal models and how to run each model on a single GPU. We then provide general guidelines on using tensor parallelism for the LLM part of the pipeline.
- [Runtime Mode](#runtime-modes)
- [BLIP2](#blip2)
- [CogVLM](#cogvlm)
- [Deplot](#deplot)
- [Fuyu](#fuyu)
- [InternLM-XComposer2](#internlm-xcomposer2)
- [InternVL2](#internvl2)
- [Kosmos-2](#kosmos-2)
- [LLaVA, LLaVa-NeXT, LLaVA-OneVision and VILA](#llava-llava-next-llava-onevision-and-vila)
- [MLLaMA](#mllama)
- [NeVA](#neva)
- [Nougat](#nougat)
- [Phi-3-vision](#phi-3-vision)
- [Qwen2-VL](#qwen2-vl)
- [Video NeVA](#video-neva)
- [Dataset Evaluation](#dataset-evaluation)
- [Enabling Tensor Parallelism for multi-GPU](#enabling-tensor-parallelism-for-multi-gpu)
## Runtime Modes
TensorRT-LLM supports three runtime modes for running multimodal models.
- `cpp_llm_only` (default): vision engine runs in python runtime, LLM in pybind C++ runtime
- `python`: everything runs in python runtime
- `cpp`: everything runs in C++ runtime
This can be specified by the `--session RUNTIME_MODE` argument in `run.py` (see instructions of each model below).
Not all models supports end-to-end `cpp` mode, the checked ones below are supported. See footnotes for reasons models are unsupported
- [ ] BLIP-2-T5 [^1]
- [x] BLIP-2-OPT
- [x] CogVLM
- [ ] Deplot [^1]
- [ ] Pix2Struct [^1]
- [x] Fuyu
- [x] InternVL2-2b
- [x] Kosmos-2
- [x] LLaVA
- [ ] LLaVA-NeXT / OneVision [^2]
- [x] VILA [^3]
- [ ] Mllama [^1]
- [x] NeVA
- [ ] Nougat [^1]
- [ ] Phi-3-Vision [^2]
- [ ] Qwen2-VL [^4]
- [x] Video-NeVA
[^1]: Model uses cross attention to feed visiual features to LLM decoder, which is not supported
[^2]: Model requires post processing its encoder output features, which is not supported
[^3]: Currently C++ runtime only supports single image per request (VILA mode 2)
[^4]: Vision encoder requires additional inputs not supported by the C++ runtime
## BLIP2
This BLIP section covers both BLIP2-OPT and BLIP2-T5, with minor changes needed when switching the LLM backbone.
1. Download Huggingface weights and convert original checkpoint to TRT-LLM checkpoint format
following example in `examples/opt/README.md` and `examples/enc_dec/README.md`.
```bash
export MODEL_NAME="blip2-opt-2.7b" # options: blip2-opt-6.7b, blip2-flan-t5-xl, blip2-flan-t5-xxl
git clone https://huggingface.co/Salesforce/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
```
For BLIP2-OPT family,
```bash
python ../opt/convert_checkpoint.py --model_type blip2 \
--model_dir tmp/hf_models/${MODEL_NAME} \
--output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--dtype float16
```
For BLIP2-T5 family,
```bash
python ../enc_dec/convert_checkpoint.py --model_type blip2 \
--model_dir tmp/hf_models/${MODEL_NAME} \
--output_dir tmp/trt_models/${MODEL_NAME}/bfloat16 \
--tp_size 1 \
--pp_size 1 \
--dtype bfloat16
```
2. Build TRT-LLM engine from TRT-LLM checkpoint
For BLIP2-OPT family,
```bash
trtllm-build \
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/llm \
--gemm_plugin float16 \
--max_beam_width 1 \
--max_batch_size 8 \
--max_seq_len 1024 \
--max_input_len 924 \
--max_multimodal_len 256 # 8 (max_batch_size) * 32 (num_visual_features)
```
For BLIP2-T5 family,
```bash
trtllm-build --checkpoint_dir tmp/trt_models/${MODEL_NAME}/bfloat16/encoder \
--output_dir tmp/trt_engines/${MODEL_NAME}/bfloat16/llm/encoder \
--paged_kv_cache disable \
--moe_plugin disable \
--gemm_plugin bfloat16 \
--bert_attention_plugin bfloat16 \
--gpt_attention_plugin bfloat16 \
--remove_input_padding enable \
--context_fmha disable \
--max_beam_width 1 \
--max_batch_size 8 \
--max_input_len 924 \
--max_multimodal_len 256 # 8 (max_batch_size) * 32 (num_visual_features)
trtllm-build --checkpoint_dir tmp/trt_models/${MODEL_NAME}/bfloat16/decoder \
--output_dir tmp/trt_engines/${MODEL_NAME}/bfloat16/llm/decoder \
--paged_kv_cache disable \
--moe_plugin disable \
--gemm_plugin bfloat16 \
--bert_attention_plugin bfloat16 \
--gpt_attention_plugin bfloat16 \
--remove_input_padding enable \
--context_fmha disable \
--max_beam_width 1 \
--max_batch_size 8 \
--max_seq_len 1024 \
--max_encoder_input_len 924 \
--max_input_len 1 # Same command for decoder but don't set --max_multimodal_len
```
**NOTE**: `max_multimodal_len = max_batch_size * num_visual_features`, so if you change max_batch_size, max multimodal length **MUST** be changed accordingly.
3. Build TensorRT engines for vision encoders
```bash
python build_multimodal_engine.py --model_type blip2 --model_path tmp/hf_models/${MODEL_NAME} --output_dir tmp/trt_engines/${MODEL_NAME}/bfloat16/vision --max_batch_size 8
```
The built engines are located in `tmp/trt_engines/${MODEL_NAME}/bfloat16/vision` for BLIP2-T5, similarly for BLIP-OPT.
To run the BLIP2 pipeline with batch size > 1, change `--max_batch_size` argument to `build_multimodal_engine.py` accordingly.
4. Assemble everything into BLIP2 pipeline
For BLIP2-OPT family,
```bash
python run.py \
--max_new_tokens 30 \
--input_text "Question: which city is this? Answer:" \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu
```
For BLIP2-T5 family,
```bash
python run.py \
--max_new_tokens 30 \
--input_text "Question: which city is this? Answer:" \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--engine_dir tmp/trt_engines/${MODEL_NAME}/bfloat16
```
5. (Optional) INT8/INT4 weight-only quantization for OPT can be enabled using commands as follows (take `INT4` as an example, while `INT8` is the default precision for weight-only quantization):
```bash
python ../opt/convert_checkpoint.py \
--model_dir tmp/hf_models/${MODEL_NAME} \
--dtype float16 \
--output_dir tmp/trt_models/${MODEL_NAME}/int4_weightonly/1-gpu \
--use_weight_only \
--weight_only_precision int4
trtllm-build \
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/int4_weightonly/1-gpu \
--output_dir tmp/trt_engines/${MODEL_NAME}/int4_weightonly/1-gpu/llm \
--gemm_plugin float16 \
--max_beam_width 1 \
--max_batch_size 8 \
--max_multimodal_len 256 \
--max_input_len 924 \
--max_seq_len 1024
```
The built OPT engines lie in `tmp/trt_engines/${MODEL_NAME}/int4_weightonly/1-gpu/llm`.
You should use this directory without the `llm` part as `--engine_dir` argument to `run.py`
**NOTE:** INT8/INT4 option is not supported for BLIP2-T5, because quantization support has not been
added for encoder-decoder models yet.
## CogVLM
Currently, CogVLM only support bfloat16 precision.
1. Download Huggingface weights
```bash
export MODEL_NAME="cogvlm-chat-hf"
git clone https://huggingface.co/THUDM/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
export TOKENIZER_NAME="vicuna-7b-v1.5"
git clone https://huggingface.co/lmsys/${TOKENIZER_NAME} tmp/hf_models/${TOKENIZER_NAME}
```
Because currently onnx doesn't support `xops.memory_efficient_attention`, we need to modify some source code of the huggingface CogVLM.
```
cd tmp/hf_models/${MODEL_NAME}
sed -i '4s/.*//;40s/.*/ out = self.attention(q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2)).transpose(1, 2).contiguous()/;41s/.*//;42s/.*//' visual.py # It will replace memory_efficient_attention with some basic ops
```
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/cogvlm`
CogVLM uses a Vit encoder as LLM encoder and a modified Llama as decoder.
```bash
python ../cogvlm/convert_checkpoint.py --model_dir tmp/hf_models/${MODEL_NAME} --output_dir tmp/trt_models/${MODEL_NAME} --dtype bfloat16 --use_prompt_tuning
trtllm-build --checkpoint_dir tmp/trt_models/${MODEL_NAME} \
--output_dir tmp/trt_engines/${MODEL_NAME}/bf16/1-gpu/llm \
--gemm_plugin bfloat16 \
--gpt_attention_plugin bfloat16 \
--remove_input_padding enable \
--max_batch_size 48 \
--max_input_len 2048 \
--max_seq_len 3076 \
--paged_kv_cache enable \
--bert_attention_plugin disable \
--moe_plugin disable \
--max_multimodal_len 61440 # 48 (max_batch_size) * 1280 (max_num_visual_features)
```
3. Generate TensorRT engines for visual components and combine everything into final pipeline.
```bash
python build_multimodal_engine.py --model_type cogvlm --model_path tmp/hf_models/${MODEL_NAME} --max_batch_size 48 --output_dir tmp/trt_engines/${MODEL_NAME}/bf16/1-gpu/vision
python run.py \
--max_new_tokens 1000 \
--input_text " [INST] please describe this image in detail [/INST] " \
--hf_model_dir tmp/hf_models/${TOKENIZER_NAME} \
--engine_dir tmp/trt_engines/${MODEL_NAME}/bf16/1-gpu \
--batch_size 1 \
--top_p 0.4 \
--top_k 1 \
--temperature 0.2 \
--repetition_penalty 1.2 \
--enable_context_fmha_fp32_acc
CogVLM uses model_runner_cpp for its LLM decoder by default. To switch to model_runner, set `--session python` in the command mentioned above.
```
## Deplot
1. Download Huggingface weights and convert original checkpoint to TRT-LLM checkpoint format
following example in `examples/enc_dec/README.md`.
```bash
export MODEL_NAME="deplot"
git clone https://huggingface.co/google/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
python ../enc_dec/convert_checkpoint.py --model_type pix2struct \
--model_dir tmp/hf_models/${MODEL_NAME} \
--output_dir tmp/trt_models/${MODEL_NAME}/float16 \
--tp_size 1 \
--pp_size 1 \
--dtype float16
```
2. Build TRT-LLM engine from TRT-LLM checkpoint
```bash
trtllm-build --checkpoint_dir tmp/trt_models/${MODEL_NAME}/float16/decoder \
--output_dir tmp/trt_engines/${MODEL_NAME}/1-gpu/float16/llm/decoder \
--paged_kv_cache disable \
--moe_plugin disable \
--gemm_plugin float16 \
--bert_attention_plugin float16 \
--gpt_attention_plugin float16 \
--remove_input_padding enable \
--context_fmha disable \
--max_beam_width 1 \
--max_batch_size 8 \
--max_seq_len 2558 \
--max_encoder_input_len 2048 \
--max_input_len 1
```
The built deplot engines are located in `tmp/trt_engines/${MODEL_NAME}/1-gpu/float16`.
3. Build TensorRT engines for visual components
```bash
python build_multimodal_engine.py --model_type pix2struct --model_path tmp/hf_models/${MODEL_NAME} --max_batch_size 8 --output_dir tmp/trt_engines/${MODEL_NAME}/1-gpu/float16/vision
```
The built visual engines are located in `tmp/trt_engines/${MODEL_NAME}/1-gpu/float16/vision`.
To run the deplot pipeline with batch size > 1, change `--max_batch_size` argument to `build_multimodal_engine.py` accordingly.
4. Assemble everything into deplot pipeline
```bash
python run.py \
--max_new_tokens 100 \
--input_text "" \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--engine_dir tmp/trt_engines/${MODEL_NAME}/1-gpu/float16
```
## Fuyu
1. Download Huggingface weights
```bash
export MODEL_NAME="fuyu-8b"
git clone https://huggingface.co/adept/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
```
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/gpt`.
The LLM portion of Fuyu uses a Persimmon model
```bash
python ../gpt/convert_checkpoint.py \
--model_dir tmp/hf_models/${MODEL_NAME} \
--output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--dtype float16 \
--gpt_variant persimmon
trtllm-build \
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/llm \
--gemm_plugin float16 \
--use_fused_mlp=enable \
--max_batch_size 1 \
--max_input_len 2048 \
--max_seq_len 2560 \
--max_multimodal_len 2048
```
3. Generate TensorRT engines for visual components and combine everything into final pipeline.
```bash
python build_multimodal_engine.py --model_type fuyu --model_path tmp/hf_models/${MODEL_NAME} --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/vision
python run.py \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu
```
## InternLM-XComposer2
**NOTE: We only support InternLM-XComposer-VL-7b for now**
Firstly, please install transformers with 4.45.2
```bash
pip install -r requirements-internlm-xcomposer2.txt
```
1. Convert Huggingface weights to TRT-LLM checkpoint format using `examples/internlm/README.md`.
2. Use `trtllm-build` command to build TRT-LLM engine for OPT.
3. The full list of commands is as follows:
```bash
export MODEL_NAME=internlm-xcomposer2-vl-7b
git lfs clone https://huggingface.co/internlm/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
python ../internlm2/convert_checkpoint.py \
--model_dir tmp/hf_models/${MODEL_NAME} \
--dtype float16 \
--output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu
trtllm-build \
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--output_dir trt_engines/${MODEL_NAME}/fp16/1-gpu/llm \
--gemm_plugin float16 \
--lora_plugin float16 \
--lora_dir . \
--max_lora_rank 256 \
--max_input_len 1536 \
--max_batch_size 48 \
--max_multimodal_len 58800 # 58800 = 1225(visual token/img) * 48 (max_batch_size), as each image corresponds to 1225 visual tokens in the ViT here
python build_multimodal_engine.py \
--model_type internlm-xcomposer2 \
--model_path tmp/hf_models/${MODEL_NAME} \
--output_dir trt_engines/${MODEL_NAME}/fp16/1-gpu/vision \
--max_batch_size 48
python run.py \
--max_new_tokens 200 \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--engine_dir trt_engines/${MODEL_NAME}/fp16/1-gpu \
--batch_size 1
```
## InternVL2
[InternVL Family](https://github.com/OpenGVLab/InternVL): Closing the Gap to Commercial Multimodal Models with Open-Source Suites —— A Pioneering Open-Source Alternative to GPT-4o. Here we show how to deploy InternVL21B/InternVL22B/InternVL24B/InternVL28B/InternVL226B in TensorRT-LLM.
Firstly, please install transformers with 4.37.2
```bash
pip install transformers==4.37.2
```
1. Download Huggingface weights
- For InternVL2-1B
```bash
export MODEL_NAME="InternVL2-1B"
git clone https://huggingface.co/OpenGVLab/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
export LLM_MODEL_NAME="qwen"
```
- For InternVL2-2B/InternVL28B/InternVL226B
```bash
export MODEL_NAME="InternVL2-2B" # or InternVL28B, InternVL226B
git clone https://huggingface.co/OpenGVLab/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
export LLM_MODEL_NAME="internlm2"
```
- For InternVL2-4B
```bash
export MODEL_NAME="InternVL2-4B"
git clone https://huggingface.co/OpenGVLab/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
export LLM_MODEL_NAME="phi"
```
2. Convert Huggingface weights into TRT-LLM checkpoints
```bash
python ../${LLM_MODEL_NAME}/convert_checkpoint.py \
--model_dir tmp/hf_models/${MODEL_NAME} \
--output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--dtype float16
```
3. Build TRT engines
```bash
trtllm-build \
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/llm \
--gemm_plugin auto \
--max_batch_size 1 \
--max_input_len 4096 \
--max_seq_len 4608 \
--max_multimodal_len 3328
```
4. Generate TensorRT engines for visual components and combine everything into final pipeline.
```bash
python build_multimodal_engine.py --model_type internvl --model_path tmp/hf_models/${MODEL_NAME} --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/vision
python run.py \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/ \
--image_path tmp/hf_models/${MODEL_NAME}/examples/image1.jpg
```
5. (Optional) FP8 and INT8 SmoothQuant quantization is supported for the InternVL2-4B variant (LLM model only).
```bash
# FP8 quantization
python ../quantization/quantize.py \
--model_dir tmp/hf_models/${MODEL_NAME} \
--output_dir tmp/trt_models/${MODEL_NAME}/fp8/1-gpu \
--dtype bfloat16 \
--qformat fp8 \
--kv_cache_dtype fp8
# INT8 SmoothQuant quantization
python ../quantization/quantize.py \
--model_dir tmp/hf_models/${MODEL_NAME} \
--output_dir tmp/trt_models/${MODEL_NAME}/int8/1-gpu \
--dtype bfloat16 \
--qformat int8_sq
```
Then follow the same `trtllm-build`, `build_multimodal_engine.py` and `run.py` steps as before.
## Kosmos-2
1. Download Huggingface weights
```bash
export MODEL_NAME="kosmos-2"
git clone https://huggingface.co/microsoft/kosmos-2-patch14-224 tmp/hf_models/${MODEL_NAME}
```
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/gpt`.
```bash
python ../gpt/convert_checkpoint.py \
--model_dir tmp/hf_models/${MODEL_NAME} \
--output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--dtype float16 \
--gpt_variant ${MODEL_NAME}
trtllm-build \
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/llm \
--gpt_attention_plugin float16 \
--gemm_plugin float16 \
--max_batch_size 1 \
--max_input_len 512 \
--max_seq_len 1024 \
--max_multimodal_len 64 # 1 (max_batch_size) * 64 (num_visual_features)
```
3. Generate TensorRT engines for visual components and combine everything into final pipeline.
```bash
python build_multimodal_engine.py --model_type kosmos-2 --model_path tmp/hf_models/${MODEL_NAME} --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/vision
python run.py \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu
```
## LLaVA, LLaVa-NeXT, LLaVA-OneVision and VILA
[LLaVA](https://github.com/haotian-liu/LLaVA) and [VILA](https://github.com/Efficient-Large-Model/VILA) are both visual language models (VLM) that can be deployed in TensorRT-LLM with many quantization options. [LLaVA-NeXT](https://huggingface.co/collections/llava-hf/llava-next-65f75c4afac77fd37dbbe6cf) is an extension of LLaVA. TRT-LLM currently supports [Mistral-7b](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) and [ Nous-Hermes-2-Yi-34B](https://huggingface.co/llava-hf/llava-v1.6-34b-hf) variant of LLaVA-NeXT. [LLaVA-OneVision](https://huggingface.co/collections/llava-hf/llava-onevision-66bb1e9ce8856e210a7ed1fe) is another extension of LLaVA.
1. Download Huggingface model weights. These models have both visual and LLM components
unlike BLIP2 example which downloads only LLM components from Huggingface.
For LLaVA,
```bash
export MODEL_NAME="llava-1.5-7b-hf" # also llava-1.5-13b-hf
git clone https://huggingface.co/llava-hf/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
```
For LLaVA-NeXT,
```bash
export MODEL_NAME="llava-v1.6-mistral-7b-hf" #for 34b variant "llava-v1.6-34b-hf"
git clone https://huggingface.co/llava-hf/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
```
For LLaVA-OneVision,
```bash
export MODEL_NAME="llava-onevision-qwen2-7b-ov-hf" # also llava-onevision-qwen2-0.5b-ov-hf, llava-onevision-qwen2-72b-ov-hf, etc
git clone https://huggingface.co/llava-hf/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
```
For VILA, we need a few more steps until it is added to HF model zoo
```bash
# install the following dependency
pip install -r requirements-vila.txt
# clone original VILA repo
export VILA_PATH="tmp/hf_models/VILA"
git clone https://github.com/Efficient-Large-Model/VILA.git ${VILA_PATH}
# download VILA checkpoints
export MODEL_NAME="vila1.5-3b" # NOTE: name must contain vila or VILA! it's used to identify whether we need to register the non-HF VILA codebase in HF Auto class
git clone https://huggingface.co/Efficient-Large-Model/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
```
2. Generate TRT-LLM engine for LLaMA following example in `examples/llama/README.md` and `examples/qwen/README.md`
```bash
python ../llama/convert_checkpoint.py \
--model_dir tmp/hf_models/${MODEL_NAME} \
--output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--dtype float16
# for LLaVA
trtllm-build \
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/llm \
--gemm_plugin float16 \
--use_fused_mlp=enable \
--max_batch_size 1 \
--max_input_len 2048 \
--max_seq_len 2560 \
--max_multimodal_len 576 # 1 (max_batch_size) * 576 (num_visual_features)
# for LLaVA-NeXT
trtllm-build \
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/llm \
--gpt_attention_plugin float16 \
--gemm_plugin float16 \
--use_fused_mlp=enable \
--max_batch_size 1 \
--max_input_len 4096 \
--max_seq_len 5120 \
--max_num_tokens 4096 \
--max_multimodal_len 4096 # 1 (max_batch_size) * 4096 (max_input_len)
# for VILA
trtllm-build \
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/llm \
--gemm_plugin float16 \
--use_fused_mlp=enable \
--max_batch_size 1 \
--max_input_len 2048 \
--max_seq_len 2560 \
--max_multimodal_len 196 # 1 (max_batch_size) * 196 (num_visual_features)
# for LLaVA-OneVision
python ../qwen/convert_checkpoint.py \
--model_dir tmp/hf_models/${MODEL_NAME} \
--output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--dtype float16
trtllm-build \
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/llm \
--gemm_plugin float16 \
--use_fused_mlp=enable \
--max_batch_size 1 \
--max_input_len 7228 \
--max_seq_len 7328 \
--max_multimodal_len 7128 # max_batch_size * num_visual_features(depends on the image size or the specified video num frame)
```
3. Build TensorRT engines for visual components
```bash
python build_multimodal_engine.py --model_path tmp/hf_models/${MODEL_NAME} --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/vision --model_type llava # for LLaVA
python build_multimodal_engine.py --model_path tmp/hf_models/${MODEL_NAME} --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/vision --model_type llava_next --max_batch_size 5 # 1 (max_batch_size) * 5 (because LLAVA-NeXT visual encoder can have at most 5 patches) # for LLaVA-NeXT
python build_multimodal_engine.py --model_path tmp/hf_models/${MODEL_NAME} --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/vision --model_type llava_onevision --max_batch_size 32 # max_batch_size * patch for image or frame for video # for LLaVA-OneVision
python build_multimodal_engine.py --model_path tmp/hf_models/${MODEL_NAME} --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/vision --model_type vila --vila_path ${VILA_PATH} # for VILA
```
```bash
python run.py \
--max_new_tokens 30 \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu \
--input_text "\n Which city is this?" # for LLaVA and for LLaVA-NeXT
python run.py \
--max_new_tokens 30 \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu \
--image_path=https://github.com/Efficient-Large-Model/VILA/raw/main/demo_images/av.png,https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png \
--input_text "\n Please elaborate what you see in the images?","\n Which city is this?" \
--batch_size=2 # for LLaVA
```
Note that Llava can support N <one image, one prompt text> pairs inference batching, `--batch_size=N` should be used. There should be N images listed under `--image_path` and N text prompts listed under `--input_text`. Don't forget to set the `--max_batch_size` and `--max_multimodal_len` during engine building.
For LLaVA-OneVision, you can use either image or video as inputs.
```bash
python run.py \
--max_new_tokens 30 \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu \
--input_text "What is shown in this image?" \
--image_path image.png
python run.py \
--max_new_tokens 30 \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu \
--input_text "Why is this video funny?" \
--video_path video.mp4
--video_num_frames 8 # sample uniformly 8 frames from the video, up to 32 frames
```
For VILA, you can use either local file or web url as input images.
Suppose you have a local image `av.png` downloaded from `https://github.com/Efficient-Large-Model/VILA/blob/main/demo_trt_llm/av.png` and the url of `merlion.png`
```bash
wget -O av.png https://raw.githubusercontent.com/Efficient-Large-Model/VILA/main/demo_images/av.png
python run.py \
--max_new_tokens 30 \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu \
--image_path=av.png,https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png \
--input_text="<image>\n<image>\n Please elaborate what you see in the images?" \
--batch_size=1 # for VILA mode 1
python run.py \
--max_new_tokens 30 \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu \
--image_path=av.png,https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png \
--input_text="<image>\n Please elaborate what you see in the images?","<image>\n Which city is this?" \
--batch_size=2 \
--check_accuracy # for VILA mode 2
```
Note that VILA can support different modes in terms of batching:
- Mode 1: if you want to query N images as a whole using a prompt, `--batch_size=1` should be used (which is the default value). Example is given above.
- Mode 2: if you want to query N <one image, one prompt text> pairs, `--batch_size=N` should be used. There should be N images listed under `--image_path` and N text prompts listed under `--input_text`. Don't forget to set the `--max_batch_size` and `--max_multimodal_len` during engine building.
Note: use `--run_profiling` for performance measurement, use `--check_accuracy` for accuracy check.
4. (Optional) Different quantization methods supported in LLaMA and Qwen can be applied to LLaVA/VILA/LLaVA-OneVision as well, such as INT4/INT8 weight-only, SmoothQuant, and INT4 Activation-Aware Quantization (AWQ). Detailed instructions can be found in LLaMA [README](../llama/README.md) and Qwen [README](../qwen/README.md).
For example,
```bash
# INT4 weight only
python ../llama/convert_checkpoint.py \
--model_dir tmp/hf_models/${MODEL_NAME} \
--dtype float16 \
--output_dir tmp/trt_models/${MODEL_NAME}/int4_weightonly/1-gpu \
--use_weight_only \
--weight_only_precision int4
# INT4 AWQ
python ../quantization/quantize.py \
--model_dir tmp/hf_models/${MODEL_NAME} \
--output_dir tmp/trt_models/${MODEL_NAME}/int4_awq/1-gpu \
--dtype float16 \
--qformat int4_awq \
--calib_size 32
```
Then follow the same `trtllm-build` and `run.py` steps as before. NOTE: for `trtllm-build` command, do not use `--use_fused_mlp=enable` in these quantization modes.
## MLLaMA
This section shows how to build and run a LLaMA-3.2 Vision model in TensorRT-LLM. We use [Llama-3.2-11B-Vision/](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision) as an example.
For LLaMA-3.2 text model, please refer to the [examples/llama/README.md](../llama/README.md) because it shares the model architecture of llama.
### Support data types <!-- omit from toc -->
* BF16
* Tensor Parallel
* INT8 & INT4 Weight-Only
* FP8
### Build and run vision model <!-- omit from toc -->
* build engine of vision encoder model
```bash
python examples/multimodal/build_multimodal_engine.py --model_type mllama \
--model_path Llama-3.2-11B-Vision/ \
--output_dir /tmp/mllama/trt_engines/vision/
```
* build engine of decoder model
```bash
python examples/mllama/convert_checkpoint.py --model_dir Llama-3.2-11B-Vision/ \
--output_dir /tmp/mllama/trt_ckpts \
--dtype bfloat16
trtllm-build --checkpoint_dir /tmp/mllama/trt_ckpts \
--output_dir /tmp/mllama/trt_engines/llm/ \
--max_num_tokens 4096 \
--max_seq_len 2048 \
--workers 1 \
--gemm_plugin auto \
--max_batch_size 4 \
--max_encoder_input_len 4100 \
--input_timing_cache model.cache
```
Note that for instruct Vision model, please set the `max_encoder_input_len` as `6404`.
* Run test on multimodal/run.py with C++ runtime (LLM part only)
```bash
python3 examples/multimodal/run.py --engine_dir /tmp/mllama/trt_engines/ \
--hf_model_dir Llama-3.2-11B-Vision/ \
--image_path https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg \
--input_text "<|image|><|begin_of_text|>If I had to write a haiku for this one" \
--max_new_tokens 50 \
--batch_size 2
Use model_runner_cpp by default. To switch to model_runner, set `--session python` in the command mentioned above.
python3 examples/multimodal/eval.py \
--engine_dir /tmp/mllama/trt_engines/ \
--hf_model_dir Llama-3.2-11B-Vision/ \
--test_trtllm \
--accuracy_threshold 65 \
--eval_task lmms-lab/ai2d
```
### Run MLLaMA decoder part by FP8 <!-- omit from toc -->
```bash
# install modelopt 0.21.0
pip install nvidia-modelopt[torch]~=0.21.0
python ./examples/quantization/quantize.py --model_dir Llama-3.2-11B-Vision/ \
--dtype bfloat16 \
--qformat fp8 \
--output_dir /tmp/llama-3.2-11B-Vision/fp8/ \
--kv_cache_dtype fp8 \
--calib_size 512 \
--calib_dataset scienceqa
trtllm-build --checkpoint_dir /tmp/llama-3.2-11B-Vision/fp8/ \
--output_dir /tmp/trt_engines/llama-3.2-11B-Vision/fp8/llm \
--max_num_tokens 4096 \
--max_seq_len 2048 \
--workers 1 \
--gemm_plugin auto \
--max_batch_size 4 \
--max_encoder_input_len 4100 \
--input_timing_cache model.cache \
--use_paged_context_fmha enable \
--use_fp8_context_fmha enable
# copy visiual engine directory `/tmp/mllama/trt_engines/vision/` to fp8 engine directory `/tmp/trt_engines/llama-3.2-11B-Vision/fp8/vision`
python3 examples/multimodal/run.py --engine_dir /tmp/trt_engines/llama-3.2-11B-Vision/fp8/ \
--hf_model_dir Llama-3.2-11B-Vision/ \
--image_path https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg \
--input_text "<|image|><|begin_of_text|>If I had to write a haiku for this one" \
--max_new_tokens 50 \
--batch_size 2
python3 examples/multimodal/eval.py --engine_dir /tmp/trt_engines/llama-3.2-11B-Vision/fp8/ \
--hf_model_dir Llama-3.2-11B-Vision/ \
--test_trtllm \
--accuracy_threshold 65 \
--eval_task lmms-lab/ai2d
```
Note that for instruct Vision model, please set the `max_encoder_input_len` as `6404`.
## NeVA
[NeVA](https://docs.nvidia.com/nemo-framework/user-guide/latest/multimodalmodels/neva/index.html) is a groundbreaking addition to the NeMo Multimodal ecosystem. This model seamlessly integrates large language-centric models with a vision encoder, that can be deployed in TensorRT-LLM.
1. Generate TRT-LLM engine for NVGPT following example in `examples/gpt/README.md`. To adhere to the NVGPT conventions of the conversion script, some layer keys have to be remapped using `--nemo_rename_key`.
```bash
export MODEL_NAME="neva"
python ../gpt/convert_checkpoint.py \
--nemo_ckpt_path ./${MODEL_NAME}.nemo \
--dtype bfloat16 \
--output_dir tmp/trt_models/${MODEL_NAME} \
--nemo_rename_key model:model.language_model \
attention.linear_qkv.layer_norm_bias:input_layernorm.bias \
attention.linear_qkv.layer_norm_weight:input_layernorm.weight \
mlp.linear_fc1.layer_norm_bias:post_attention_layernorm.bias \
mlp.linear_fc1.layer_norm_weight:post_attention_layernorm.weight \
linear_qkv:query_key_value \
linear_fc1:dense_h_to_4h \
linear_fc2:dense_4h_to_h \
linear_proj:dense \
decoder:encoder
trtllm-build \
--checkpoint_dir tmp/trt_models/${MODEL_NAME} \
--output_dir tmp/trt_engines/${MODEL_NAME}/bf16/1-gpu/llm \
--gpt_attention_plugin bfloat16 \
--gemm_plugin bfloat16 \
--max_batch_size 1 \
--max_input_len 2048 \
--max_seq_len 2560 \
--max_multimodal_len 729 # 1 (max_batch_size) * 729 (num_visual_features)
```
2. Build TensorRT engines for visual components
```bash
python build_multimodal_engine.py --model_path ./${MODEL_NAME}.nemo --model_type neva --output_dir tmp/trt_engines/${MODEL_NAME}/bf16/1-gpu/vision
```
```bash
python run.py \
--max_new_tokens 30 \
--hf_model_dir tmp/trt_models/${MODEL_NAME} \
--engine_dir tmp/trt_engines/${MODEL_NAME}/bf16/1-gpu \
--input_text "Question: which city is this? Answer:"
```
Note: use `--run_profiling` for performance measurement, use `--check_accuracy` for accuracy check.
## Nougat
1. Download Huggingface weights
```bash
export MODEL_NAME="nougat-base" # also nougat-small
git clone https://huggingface.co/facebook/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
```
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/enc_dec`
Nougat uses mBART architecture but replaces the LLM encoder with a Swin Transformer encoder.
To achieve this, we add an extra `--nougat` flag (over mBART example) to
`convert_checkpoint.py` in `examples/enc_dec` and `trtllm-build`.
```bash
python ../enc_dec/convert_checkpoint.py --model_type bart \
--model_dir tmp/hf_models/${MODEL_NAME} \
--output_dir tmp/trt_models/${MODEL_NAME}/bfloat16 \
--tp_size 1 \
--pp_size 1 \
--dtype bfloat16 \
--nougat
trtllm-build --checkpoint_dir tmp/trt_models/${MODEL_NAME}/bfloat16/decoder \
--output_dir tmp/trt_engines/${MODEL_NAME}/1-gpu/bfloat16/llm/decoder \
--paged_kv_cache disable \
--moe_plugin disable \
--gemm_plugin bfloat16 \
--bert_attention_plugin bfloat16 \
--gpt_attention_plugin bfloat16 \
--remove_input_padding enable \
--max_beam_width 1 \
--max_batch_size 1 \
--max_seq_len 101 \
--max_input_len 1 \
--max_encoder_input_len 588 # 1 (max_batch_size) * 588 (num_visual_features)
```
3. Generate TensorRT engines for visual components and combine everything into final pipeline.
```bash
python build_multimodal_engine.py --model_type nougat --model_path tmp/hf_models/${MODEL_NAME} --output_dir tmp/trt_engines/${MODEL_NAME}/1-gpu/bfloat16/vision
python run.py \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--engine_dir tmp/trt_engines/${MODEL_NAME}/1-gpu/bfloat16
```
Note: Nougat models usually do not need a text prompt.
## Phi-3-vision
1. Download Huggingface weights
```bash
export MODEL_NAME="Phi-3-vision-128k-instruct" # or Phi-3.5-vision-instruct
git clone https://huggingface.co/microsoft/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
```
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/phi`.
```bash
python ../phi/convert_checkpoint.py \
--model_dir tmp/hf_models/${MODEL_NAME} \
--output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--dtype float16
trtllm-build \
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/llm \
--gpt_attention_plugin float16 \
--gemm_plugin float16 \
--max_batch_size 1 \
--max_input_len 4096 \
--max_seq_len 4608 \
--max_multimodal_len 4096
```
3. Generate TensorRT engines for visual components and combine everything into final pipeline.
```bash
python build_multimodal_engine.py --model_type phi-3-vision --model_path tmp/hf_models/${MODEL_NAME} --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/vision
python run.py \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--kv_cache_free_gpu_memory_fraction 0.7 \
--engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/ \
--image_path=https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png
```
## Qwen2-VL
[Qwen2-VL Family](https://github.com/QwenLM/Qwen2-VL): is the latest version of the vision language models in the Qwen model families. Here we show how to deploy Qwen2-VL 2B and 7B in TensorRT-LLM.
Firstly, please install transformers and qwen-vl-utils
```bash
pip install -r requirements-qwen2vl.txt
```
### Support data types <!-- omit from toc -->
* FP16
* FP8
### Build and run vision model <!-- omit from toc -->
* Download Huggingface weights
```bash
export MODEL_NAME="Qwen2-VL-7B-Instruct" # or Qwen2-VL-2B-Instruct
git clone https://huggingface.co/Qwen/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
```
* Build engine of decoder model
```bash
python3 ../qwen/convert_checkpoint.py \
--model_dir=tmp/hf_models/${MODEL_NAME} \
--output_dir=tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--dtype float16
trtllm-build --checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/llm \
--gemm_plugin=float16 \
--gpt_attention_plugin=float16 \
--max_batch_size=4 \
--max_input_len=2048 \
--max_seq_len=3072 \
--max_multimodal_len=1296 #(max_batch_size) * 324 (num_visual_features), this's for image_shape=[504,504]
```
* Build engine of vision encoder model
```bash
python build_multimodal_engine.py --model_type qwen2_vl --model_path tmp/hf_models/${MODEL_NAME} --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/vision
```
* Run test on multimodal/run.py with C++ runtime (LLM part only)
```bash
python3 run.py \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/
```
### Run Qwen2-VL decoder part by FP8 <!-- omit from toc -->
* Build engine
```bash
python ./examples/quantization/quantize.py \
--model_dir tmp/hf_models/${MODEL_NAME} \
--dtype float16 \
--qformat fp8 \
--kv_cache_dtype fp8 \
--output_dir tmp/trt_models/${MODEL_NAME}/fp8/1-gpu \
--calib_size 512
trtllm-build --checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp8/1-gpu \
--output_dir tmp/trt_engines/${MODEL_NAME}/fp8/1-gpu/llm \
--max_input_len=2048 \
--max_seq_len 3072 \
--gemm_plugin auto \
--max_batch_size 4 \
--max_multimodal_len=1296
# copy visiual engine directory `tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/vision/` to fp8 engine directory `tmp/trt_engines/${MODEL_NAME}/fp8/1-gpu/vision`
```
* Run test on multimodal/run.py with C++ runtime (LLM part only)
```bash
python3 run.py \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--engine_dir tmp/trt_engines/${MODEL_NAME}/fp8/1-gpu/
```
## Video NeVA
[Video NeVA](https://github.com/NVIDIA/NeMo/blob/main/docs/source/multimodal/mllm/video_neva.rst) is a groundbreaking addition to the NeMo Multimodal ecosystem that could work with video modality. This model seamlessly integrates large language-centric models with a vision encoder, that can be deployed in TensorRT-LLM.
1. Generate TRT-LLM engine for Nemotron model following example in `examples/nemotron/README.md`. To adhere to the NVGPT conventions of the conversion script. This will be used as our base LM for inference.
```bash
pip install decord # used for loading video
python3 ../quantization/quantize.py \
--nemo_ckpt_path /path/to/nemotron/model.nemo \
--dtype bfloat16 \
--batch_size 64 \
--qformat full_prec \
--output_dir nemotron-3/trt_ckpt/bf16/1-gpu
trtllm-build \
--checkpoint_dir nemotron-3/trt_ckpt/bf16/1-gpu \
--output_dir tmp/trt_engines/nemotron-3/bf16/1-gpu/llm \
--gpt_attention_plugin bfloat16 \
--gemm_plugin bfloat16 \
--max_batch_size 1 \
--max_input_len 4096 \
--max_seq_len 4352 \
--max_multimodal_len 3072 # 1 (max_batch_size) * (12 num_frames) * (256 image_token_len)
```
2. Build TensorRT engines for visual components
```bash
python build_multimodal_engine.py --model_path /path/to/video/neva/projector.nemo --model_type video-neva --output_dir tmp/trt_engines/nemotron-3/visual_encoder --output_dir tmp/trt_engines/nemotron-3/bf16/1-gpu/vision
```
```bash
python run.py \
--max_new_tokens 30 \
--hf_model_dir nemotron-3/trt_ckpt/bf16/1-gpu \
--engine_dir tmp/trt_engines/nemotron-3/bf16/1-gpu \
--input_text "Question: what is in the video? Answer:" \
--video_path /path/to/your/local/video/file
```
Note: use `--run_profiling` for performance measurement, use `--check_accuracy` for accuracy check.
## Dataset Evaluation
This section explains how to evaluate datasets using our provided script, including supported models and configurations.
### Evaluation Command <!-- omit from toc -->
To run an evaluation, use the following command:
```bash
python ./examples/multimodal/eval.py \
--model_type <model_type> \
--engine_dir <engine_dir> \
--hf_model_dir <hf_model_dir> \
--dataset_dir <dataset_dir> \
--test_trtllm (or --test_hf, or both) \
--accuracy_threshold <threshold> \
--eval_task <eval_task> \
--max_ite 20 \
--visual_engine_name <visual_engine_name>
```
### Parameters <!-- omit from toc -->
- `--model_type`: Specify the model type to evaluate.
- `--engine_dir`: Path to the model engines directory.
- `--hf_model_dir`: Path to the Hugging Face model directory.
- `--dataset_dir`: Path to the dataset directory. If not specified, will load the dataset from HF with the `--eval_task` as dataset tag.
- `--test_trtllm` or `--test_hf`: Specify which evaluation framework to use. Both can be used simultaneously.
- `--accuracy_threshold`: Set the accuracy threshold for evaluation.
- `--eval_task`: Specify the evaluation task. Supported tasks: `['lmms-lab/ai2d', 'lmms-lab/VQAv2', 'lmms-lab/MME']`. Default to `'lmms-lab/VQAv2'`.
- `--max_ite`: Maximum number of iterations, default to 20.
- `--visual_engine_name`: Name of the visual engine.
### Supported Evaluation Tasks <!-- omit from toc -->
The following evaluation tasks are supported:
- `lmms-lab/ai2d`
- `lmms-lab/VQAv2`
- `lmms-lab/MME`
### Supported Model Types <!-- omit from toc -->
The script supports the following model types:
- `blip2`
- `fuyu`
- `kosmos-2`
- `llava`
- `llava_next`
- `llava_onevision`
- `phi-3-vision`
- `qwen2_vl`
- `mllama`
- `vila`
- `cogvlm`
- `neva`
- `internvl`
**Note:** The models `vila`, `cogvlm`, `neva`, and `internvl` do not support the `--test_hf` evaluation framework.
## Enabling Tensor Parallelism for multi-GPU
The LLM part of the pipeline can be run on multiple GPUs using tensor parallelism.
The visual encoder will be replicated on each GPU and operate in a data parallel fashion.
To enable tensor parallelism, both weight conversion step (from Huggingface to FT format)
and engine building step should use additional arguments. Finally `run.py` should be prefixed
with `mpirun -n NUM_GPUS --allow-run-as-root`.
The full set of commands to enable 2-way tensor parallelism for LLaVA is:
```bash
export MODEL_NAME="llava-1.5-7b-hf"
python ../llama/convert_checkpoint.py \
--model_dir tmp/hf_models/${MODEL_NAME} \
--output_dir tmp/trt_models/${MODEL_NAME}/fp16/2-gpu \
--dtype float16 --tp_size 2
trtllm-build \
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/2-gpu \
--output_dir tmp/trt_engines/${MODEL_NAME}/fp16/2-gpu/llm \
--gemm_plugin float16 \
--max_batch_size 1 \
--max_input_len 2048 \
--max_seq_len 2560 \
--max_multimodal_len 576
python build_multimodal_engine.py --model_type llava --model_path tmp/hf_models/${MODEL_NAME} --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/2-gpu/vision
mpirun -n 2 --allow-run-as-root \
python run.py \
--max_new_tokens 30 \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/2-gpu \
```