TensorRT-LLMs/examples/multimodal/README.md

<!-- omit from toc -->
# Multi-Modal

This document shows how to run multimodal pipelines with TensorRT-LLM, e.g. from image+text input modalities to text output.

Multimodal models' LLM part has an additional parameter `--max_multimodal_len` compared to LLM-only build commands. Under the hood, `max_multimodal_len` and `max_prompt_embedding_table_size` are effectively the same concept, i.e., prepended/concatenated embeddings (either multimodal feature embeddings or prompt tuning embeddings) to the LLM input embeddings. The multimodal features from the visual encoder of shape `[batch_size, num_visual_features, visual_hidden_dim]` is flattened as `[batch_size * num_visual_features, visual_hidden_dim]` and passed like a prompt embedding table.

We first describe three runtime modes for running multimodal models and how to run each model on a single GPU. We then provide general guidelines on using tensor parallelism for the LLM part of the pipeline.

- [Runtime Mode](#runtime-modes)
- [BLIP2](#blip2)
- [CogVLM](#cogvlm)
- [Deplot](#deplot)
- [Fuyu](#fuyu)
- [InternLM-XComposer2](#internlm-xcomposer2)
- [InternVL2](#internvl2)
- [Kosmos-2](#kosmos-2)
- [LLaVA, LLaVa-NeXT, LLaVA-OneVision and VILA](#llava-llava-next-llava-onevision-and-vila)
- [MLLaMA](#mllama)
- [NeVA](#neva)
- [Nougat](#nougat)
- [Phi-3-vision](#phi-3-vision)
- [Qwen2-VL](#qwen2-vl)
- [Video NeVA](#video-neva)
- [Dataset Evaluation](#dataset-evaluation)
- [Enabling Tensor Parallelism for multi-GPU](#enabling-tensor-parallelism-for-multi-gpu)

## Runtime Modes
TensorRT-LLM supports three runtime modes for running multimodal models.
- `cpp_llm_only` (default): vision engine runs in python runtime, LLM in pybind C++ runtime
- `python`: everything runs in python runtime
- `cpp`: everything runs in C++ runtime

This can be specified by the `--session RUNTIME_MODE` argument in `run.py` (see instructions of each model below).
Not all models supports end-to-end `cpp` mode, the checked ones below are supported. See footnotes for reasons models are unsupported
- [ ] BLIP-2-T5 [^1]
- [x] BLIP-2-OPT
- [x] CogVLM
- [ ] Deplot [^1]
- [ ] Pix2Struct [^1]
- [x] Fuyu
- [x] InternVL2-2b
- [x] Kosmos-2
- [x] LLaVA
- [ ] LLaVA-NeXT / OneVision [^2]
- [x] VILA [^3]
- [ ] Mllama [^1]
- [x] NeVA
- [ ] Nougat [^1]
- [ ] Phi-3-Vision [^2]
- [ ] Qwen2-VL [^4]
- [x] Video-NeVA


[^1]: Model uses cross attention to feed visiual features to LLM decoder, which is not supported
[^2]: Model requires post processing its encoder output features, which is not supported
[^3]: Currently C++ runtime only supports single image per request (VILA mode 2)
[^4]: Vision encoder requires additional inputs not supported by the C++ runtime

## BLIP2

This BLIP section covers both BLIP2-OPT and BLIP2-T5, with minor changes needed when switching the LLM backbone.

1. Download Huggingface weights and convert original checkpoint to TRT-LLM checkpoint format
   following example in `examples/opt/README.md` and `examples/enc_dec/README.md`.

    ```bash
    export MODEL_NAME="blip2-opt-2.7b" # options: blip2-opt-6.7b, blip2-flan-t5-xl, blip2-flan-t5-xxl
    git clone https://huggingface.co/Salesforce/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
    ```

    For BLIP2-OPT family,
    ```bash
    python ../opt/convert_checkpoint.py --model_type blip2 \
        --model_dir tmp/hf_models/${MODEL_NAME} \
        --output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
        --dtype float16
    ```

    For BLIP2-T5 family,
    ```bash
    python ../enc_dec/convert_checkpoint.py --model_type blip2 \
        --model_dir tmp/hf_models/${MODEL_NAME} \
        --output_dir tmp/trt_models/${MODEL_NAME}/bfloat16 \
        --tp_size 1 \
        --pp_size 1 \
        --dtype bfloat16
    ```

2. Build TRT-LLM engine from TRT-LLM checkpoint

    For BLIP2-OPT family,
    ```bash
    trtllm-build \
        --checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
        --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/llm \
        --gemm_plugin float16 \
        --max_beam_width 1 \
        --max_batch_size 8 \
        --max_seq_len 1024 \
        --max_input_len 924 \
        --max_multimodal_len 256 # 8 (max_batch_size) * 32 (num_visual_features)
    ```

    For BLIP2-T5 family,
    ```bash
    trtllm-build --checkpoint_dir tmp/trt_models/${MODEL_NAME}/bfloat16/encoder \
        --output_dir tmp/trt_engines/${MODEL_NAME}/bfloat16/llm/encoder \
        --paged_kv_cache disable \
        --moe_plugin disable \
        --gemm_plugin bfloat16 \
        --bert_attention_plugin bfloat16 \
        --gpt_attention_plugin bfloat16 \
        --remove_input_padding enable \
        --context_fmha disable \
        --max_beam_width 1 \
        --max_batch_size 8 \
        --max_input_len 924 \
        --max_multimodal_len 256 # 8 (max_batch_size) * 32 (num_visual_features)

    trtllm-build --checkpoint_dir tmp/trt_models/${MODEL_NAME}/bfloat16/decoder \
        --output_dir tmp/trt_engines/${MODEL_NAME}/bfloat16/llm/decoder \
        --paged_kv_cache disable \
        --moe_plugin disable \
        --gemm_plugin bfloat16 \
        --bert_attention_plugin bfloat16 \
        --gpt_attention_plugin bfloat16 \
        --remove_input_padding enable \
        --context_fmha disable \
        --max_beam_width 1 \
        --max_batch_size 8 \
        --max_seq_len 1024 \
        --max_encoder_input_len 924 \
        --max_input_len 1 # Same command for decoder but don't set --max_multimodal_len
    ```

    **NOTE**: `max_multimodal_len = max_batch_size * num_visual_features`, so if you change max_batch_size, max multimodal length **MUST** be changed accordingly.

3. Build TensorRT engines for vision encoders

    ```bash
    python build_multimodal_engine.py --model_type blip2 --model_path tmp/hf_models/${MODEL_NAME} --output_dir tmp/trt_engines/${MODEL_NAME}/bfloat16/vision --max_batch_size 8
    ```

    The built engines are located in `tmp/trt_engines/${MODEL_NAME}/bfloat16/vision` for BLIP2-T5, similarly for BLIP-OPT.

    To run the BLIP2 pipeline with batch size > 1, change `--max_batch_size` argument to `build_multimodal_engine.py` accordingly.

4. Assemble everything into BLIP2 pipeline

    For BLIP2-OPT family,
    ```bash
    python run.py \
        --max_new_tokens 30 \
        --input_text "Question: which city is this? Answer:" \
        --hf_model_dir tmp/hf_models/${MODEL_NAME} \
        --engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu
    ```

    For BLIP2-T5 family,
    ```bash
    python run.py \
        --max_new_tokens 30 \
        --input_text "Question: which city is this? Answer:" \
        --hf_model_dir tmp/hf_models/${MODEL_NAME} \
        --engine_dir tmp/trt_engines/${MODEL_NAME}/bfloat16
    ```

5. (Optional) INT8/INT4 weight-only quantization for OPT can be enabled using commands as follows (take `INT4` as an example, while `INT8` is the default precision for weight-only quantization):
    ```bash
    python ../opt/convert_checkpoint.py \
        --model_dir tmp/hf_models/${MODEL_NAME} \
        --dtype float16 \
        --output_dir tmp/trt_models/${MODEL_NAME}/int4_weightonly/1-gpu \
        --use_weight_only \
        --weight_only_precision int4

    trtllm-build \
        --checkpoint_dir tmp/trt_models/${MODEL_NAME}/int4_weightonly/1-gpu \
        --output_dir tmp/trt_engines/${MODEL_NAME}/int4_weightonly/1-gpu/llm \
        --gemm_plugin float16 \
        --max_beam_width 1 \
        --max_batch_size 8 \
        --max_multimodal_len 256 \
        --max_input_len 924 \
        --max_seq_len 1024
    ```

    The built OPT engines lie in `tmp/trt_engines/${MODEL_NAME}/int4_weightonly/1-gpu/llm`.
    You should use this directory without the `llm` part as `--engine_dir` argument to `run.py`

    **NOTE:** INT8/INT4 option is not supported for BLIP2-T5, because quantization support has not been
          added for encoder-decoder models yet.

## CogVLM

Currently, CogVLM only support bfloat16 precision.

1. Download Huggingface weights

    ```bash
    export MODEL_NAME="cogvlm-chat-hf"
    git clone https://huggingface.co/THUDM/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
    export TOKENIZER_NAME="vicuna-7b-v1.5"
    git clone https://huggingface.co/lmsys/${TOKENIZER_NAME} tmp/hf_models/${TOKENIZER_NAME}
    ```

    Because currently onnx doesn't support `xops.memory_efficient_attention`, we need to modify some source code of the huggingface CogVLM.
    ```
    cd tmp/hf_models/${MODEL_NAME}
    sed -i '4s/.*//;40s/.*/        out = self.attention(q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2)).transpose(1, 2).contiguous()/;41s/.*//;42s/.*//' visual.py   # It will replace memory_efficient_attention with some basic ops
    ```

2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/cogvlm`

   CogVLM uses a Vit encoder as LLM encoder and a modified Llama as decoder.

    ```bash
    python ../cogvlm/convert_checkpoint.py --model_dir tmp/hf_models/${MODEL_NAME}  --output_dir tmp/trt_models/${MODEL_NAME} --dtype bfloat16 --use_prompt_tuning

    trtllm-build --checkpoint_dir tmp/trt_models/${MODEL_NAME} \
    --output_dir tmp/trt_engines/${MODEL_NAME}/bf16/1-gpu/llm \
    --gemm_plugin bfloat16 \
    --gpt_attention_plugin bfloat16 \
    --remove_input_padding enable \
    --max_batch_size 48 \
    --max_input_len 2048 \
    --max_seq_len 3076 \
    --paged_kv_cache enable \
    --bert_attention_plugin disable \
    --moe_plugin disable \
    --max_multimodal_len 61440 # 48 (max_batch_size) * 1280 (max_num_visual_features)
    ```

3. Generate TensorRT engines for visual components and combine everything into final pipeline.

    ```bash
    python build_multimodal_engine.py --model_type cogvlm --model_path tmp/hf_models/${MODEL_NAME} --max_batch_size 48 --output_dir tmp/trt_engines/${MODEL_NAME}/bf16/1-gpu/vision

    python run.py \
    --max_new_tokens 1000 \
    --input_text " [INST] please describe this image in detail [/INST] " \
    --hf_model_dir tmp/hf_models/${TOKENIZER_NAME} \
    --engine_dir tmp/trt_engines/${MODEL_NAME}/bf16/1-gpu \
    --batch_size 1 \
    --top_p 0.4 \
    --top_k 1 \
    --temperature 0.2 \
    --repetition_penalty 1.2 \
    --enable_context_fmha_fp32_acc

    CogVLM uses model_runner_cpp for its LLM decoder by default. To switch to model_runner, set `--session python` in the command mentioned above.
    ```

## Deplot

1. Download Huggingface weights and convert original checkpoint to TRT-LLM checkpoint format
   following example in `examples/enc_dec/README.md`.

    ```bash
    export MODEL_NAME="deplot"
    git clone https://huggingface.co/google/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}

    python ../enc_dec/convert_checkpoint.py --model_type pix2struct \
        --model_dir tmp/hf_models/${MODEL_NAME} \
        --output_dir tmp/trt_models/${MODEL_NAME}/float16 \
        --tp_size 1 \
        --pp_size 1 \
        --dtype float16
    ```

2. Build TRT-LLM engine from TRT-LLM checkpoint

    ```bash
    trtllm-build --checkpoint_dir tmp/trt_models/${MODEL_NAME}/float16/decoder \
        --output_dir tmp/trt_engines/${MODEL_NAME}/1-gpu/float16/llm/decoder \
        --paged_kv_cache disable \
        --moe_plugin disable \
        --gemm_plugin float16 \
        --bert_attention_plugin float16 \
        --gpt_attention_plugin float16 \
        --remove_input_padding enable \
        --context_fmha disable \
        --max_beam_width 1 \
        --max_batch_size 8 \
        --max_seq_len 2558 \
        --max_encoder_input_len 2048 \
        --max_input_len 1
    ```

    The built deplot engines are located in `tmp/trt_engines/${MODEL_NAME}/1-gpu/float16`.

3. Build TensorRT engines for visual components

    ```bash
    python build_multimodal_engine.py --model_type pix2struct --model_path tmp/hf_models/${MODEL_NAME} --max_batch_size 8 --output_dir tmp/trt_engines/${MODEL_NAME}/1-gpu/float16/vision
    ```

    The built visual engines are located in `tmp/trt_engines/${MODEL_NAME}/1-gpu/float16/vision`.

    To run the deplot pipeline with batch size > 1, change `--max_batch_size` argument to `build_multimodal_engine.py` accordingly.

4. Assemble everything into deplot pipeline

    ```bash
    python run.py \
        --max_new_tokens 100 \
        --input_text "" \
        --hf_model_dir tmp/hf_models/${MODEL_NAME} \
        --engine_dir tmp/trt_engines/${MODEL_NAME}/1-gpu/float16
    ```

## Fuyu

1. Download Huggingface weights

    ```bash
    export MODEL_NAME="fuyu-8b"
    git clone https://huggingface.co/adept/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
    ```

2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/gpt`.
    The LLM portion of Fuyu uses a Persimmon model
    ```bash
    python ../gpt/convert_checkpoint.py \
        --model_dir tmp/hf_models/${MODEL_NAME} \
        --output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
        --dtype float16 \
        --gpt_variant persimmon

    trtllm-build \
        --checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
        --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/llm \
        --gemm_plugin float16 \
        --use_fused_mlp=enable \
        --max_batch_size 1 \
        --max_input_len 2048 \
        --max_seq_len 2560 \
        --max_multimodal_len 2048
    ```

3. Generate TensorRT engines for visual components and combine everything into final pipeline.

    ```bash
    python build_multimodal_engine.py --model_type fuyu --model_path tmp/hf_models/${MODEL_NAME} --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/vision

    python run.py \
        --hf_model_dir tmp/hf_models/${MODEL_NAME} \
        --engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu
    ```

## InternLM-XComposer2

**NOTE: We only support InternLM-XComposer-VL-7b for now**

Firstly, please install transformers with 4.45.2
```bash
    pip install -r requirements-internlm-xcomposer2.txt
```

1. Convert Huggingface weights to TRT-LLM checkpoint format using `examples/internlm/README.md`.

2. Use `trtllm-build` command to build TRT-LLM engine for OPT.

3. The full list of commands is as follows:

    ```bash
    export MODEL_NAME=internlm-xcomposer2-vl-7b
    git lfs clone https://huggingface.co/internlm/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}

    python ../internlm2/convert_checkpoint.py \
        --model_dir tmp/hf_models/${MODEL_NAME} \
        --dtype float16 \
        --output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu

    trtllm-build \
        --checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
        --output_dir trt_engines/${MODEL_NAME}/fp16/1-gpu/llm \
        --gemm_plugin float16 \
        --lora_plugin float16 \
        --lora_dir . \
        --max_lora_rank 256 \
        --max_input_len 1536 \
        --max_batch_size 48 \
        --max_multimodal_len 58800 # 58800 = 1225(visual token/img) * 48 (max_batch_size), as each image corresponds to 1225 visual tokens in the ViT here

    python build_multimodal_engine.py \
        --model_type internlm-xcomposer2 \
        --model_path tmp/hf_models/${MODEL_NAME} \
        --output_dir trt_engines/${MODEL_NAME}/fp16/1-gpu/vision \
        --max_batch_size 48

    python run.py \
        --max_new_tokens 200 \
        --hf_model_dir tmp/hf_models/${MODEL_NAME} \
        --engine_dir trt_engines/${MODEL_NAME}/fp16/1-gpu \
        --batch_size 1
    ```

## InternVL2

[InternVL Family](https://github.com/OpenGVLab/InternVL): Closing the Gap to Commercial Multimodal Models with Open-Source Suites —— A Pioneering Open-Source Alternative to GPT-4o. Here we show how to deploy InternVL2‑1B/InternVL2‑2B/InternVL2‑4B/InternVL2‑8B/InternVL2‑26B in TensorRT-LLM.

Firstly, please install transformers with 4.37.2
```bash
    pip install transformers==4.37.2
```

1. Download Huggingface weights
    - For InternVL2-1B
        ```bash
        export MODEL_NAME="InternVL2-1B"
        git clone https://huggingface.co/OpenGVLab/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
        export LLM_MODEL_NAME="qwen"
        ```

    - For InternVL2-2B/InternVL2‑8B/InternVL2‑26B
        ```bash
        export MODEL_NAME="InternVL2-2B" # or InternVL2‑8B, InternVL2‑26B
        git clone https://huggingface.co/OpenGVLab/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
        export LLM_MODEL_NAME="internlm2"
        ```

    - For InternVL2-4B
        ```bash
        export MODEL_NAME="InternVL2-4B"
        git clone https://huggingface.co/OpenGVLab/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
        export LLM_MODEL_NAME="phi"
        ```

2. Convert Huggingface weights into TRT-LLM checkpoints
    ```bash
    python ../${LLM_MODEL_NAME}/convert_checkpoint.py \
            --model_dir tmp/hf_models/${MODEL_NAME} \
            --output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu  \
            --dtype float16
    ```

3. Build TRT engines
    ```bash
    trtllm-build \
        --checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
        --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/llm \
        --gemm_plugin auto \
        --max_batch_size 1 \
        --max_input_len 4096 \
        --max_seq_len 4608 \
        --max_multimodal_len 3328
    ```

4. Generate TensorRT engines for visual components and combine everything into final pipeline.
    ```bash
    python build_multimodal_engine.py --model_type internvl --model_path tmp/hf_models/${MODEL_NAME} --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/vision
    python run.py \
        --hf_model_dir tmp/hf_models/${MODEL_NAME} \
        --engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/ \
        --image_path tmp/hf_models/${MODEL_NAME}/examples/image1.jpg
    ```

5. (Optional) FP8 and INT8 SmoothQuant quantization is supported for the InternVL2-4B variant (LLM model only).

   ```bash
   # FP8 quantization
   python ../quantization/quantize.py \
        --model_dir tmp/hf_models/${MODEL_NAME} \
        --output_dir tmp/trt_models/${MODEL_NAME}/fp8/1-gpu \
        --dtype bfloat16 \
        --qformat fp8 \
        --kv_cache_dtype fp8

   # INT8 SmoothQuant quantization
   python ../quantization/quantize.py \
        --model_dir tmp/hf_models/${MODEL_NAME} \
        --output_dir tmp/trt_models/${MODEL_NAME}/int8/1-gpu \
        --dtype bfloat16 \
        --qformat int8_sq
   ```

   Then follow the same `trtllm-build`, `build_multimodal_engine.py` and `run.py` steps as before.


## Kosmos-2

1. Download Huggingface weights

    ```bash
    export MODEL_NAME="kosmos-2"
    git clone https://huggingface.co/microsoft/kosmos-2-patch14-224 tmp/hf_models/${MODEL_NAME}
    ```

2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/gpt`.
    ```bash
    python ../gpt/convert_checkpoint.py \
        --model_dir tmp/hf_models/${MODEL_NAME} \
        --output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
        --dtype float16 \
        --gpt_variant ${MODEL_NAME}

    trtllm-build \
        --checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
        --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/llm \
        --gpt_attention_plugin float16 \
        --gemm_plugin float16 \
        --max_batch_size 1 \
        --max_input_len 512 \
        --max_seq_len 1024 \
        --max_multimodal_len 64 # 1 (max_batch_size) * 64 (num_visual_features)
    ```

3. Generate TensorRT engines for visual components and combine everything into final pipeline.

    ```bash
    python build_multimodal_engine.py --model_type kosmos-2 --model_path tmp/hf_models/${MODEL_NAME} --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/vision

    python run.py \
        --hf_model_dir tmp/hf_models/${MODEL_NAME} \
        --engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu
    ```

## LLaVA, LLaVa-NeXT, LLaVA-OneVision and VILA

[LLaVA](https://github.com/haotian-liu/LLaVA) and [VILA](https://github.com/Efficient-Large-Model/VILA) are both visual language models (VLM) that can be deployed in TensorRT-LLM with many quantization options. [LLaVA-NeXT](https://huggingface.co/collections/llava-hf/llava-next-65f75c4afac77fd37dbbe6cf) is an extension of LLaVA. TRT-LLM currently supports [Mistral-7b](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) and [ Nous-Hermes-2-Yi-34B](https://huggingface.co/llava-hf/llava-v1.6-34b-hf) variant of LLaVA-NeXT. [LLaVA-OneVision](https://huggingface.co/collections/llava-hf/llava-onevision-66bb1e9ce8856e210a7ed1fe) is another extension of LLaVA.

1. Download Huggingface model weights. These models have both visual and LLM components
   unlike BLIP2 example which downloads only LLM components from Huggingface.

    For LLaVA,

    ```bash
        export MODEL_NAME="llava-1.5-7b-hf" # also llava-1.5-13b-hf
        git clone https://huggingface.co/llava-hf/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
    ```
    For LLaVA-NeXT,

     ```bash
        export MODEL_NAME="llava-v1.6-mistral-7b-hf" #for 34b variant "llava-v1.6-34b-hf"
        git clone https://huggingface.co/llava-hf/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
    ```

    For LLaVA-OneVision,

    ```bash
        export MODEL_NAME="llava-onevision-qwen2-7b-ov-hf" # also llava-onevision-qwen2-0.5b-ov-hf, llava-onevision-qwen2-72b-ov-hf, etc
        git clone https://huggingface.co/llava-hf/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
    ```

    For VILA, we need a few more steps until it is added to HF model zoo

    ```bash
        # install the following dependency
        pip install -r requirements-vila.txt

        # clone original VILA repo
        export VILA_PATH="tmp/hf_models/VILA"
        git clone https://github.com/Efficient-Large-Model/VILA.git ${VILA_PATH}

        # download VILA checkpoints
        export MODEL_NAME="vila1.5-3b" # NOTE: name must contain vila or VILA! it's used to identify whether we need to register the non-HF VILA codebase in HF Auto class
        git clone https://huggingface.co/Efficient-Large-Model/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
    ```

2. Generate TRT-LLM engine for LLaMA following example in `examples/llama/README.md` and `examples/qwen/README.md`

    ```bash
    python ../llama/convert_checkpoint.py \
        --model_dir tmp/hf_models/${MODEL_NAME} \
        --output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
        --dtype float16

    # for LLaVA
    trtllm-build \
        --checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
        --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/llm \
        --gemm_plugin float16 \
        --use_fused_mlp=enable \
        --max_batch_size 1 \
        --max_input_len 2048 \
        --max_seq_len 2560 \
        --max_multimodal_len 576 # 1 (max_batch_size) * 576 (num_visual_features)

    # for LLaVA-NeXT
    trtllm-build \
        --checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
        --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/llm \
        --gpt_attention_plugin float16 \
        --gemm_plugin float16 \
        --use_fused_mlp=enable \
        --max_batch_size 1 \
        --max_input_len 4096 \
        --max_seq_len 5120 \
        --max_num_tokens 4096 \
        --max_multimodal_len 4096 # 1 (max_batch_size) * 4096 (max_input_len)

    # for VILA
    trtllm-build \
        --checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
        --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/llm \
        --gemm_plugin float16 \
        --use_fused_mlp=enable \
        --max_batch_size 1 \
        --max_input_len 2048 \
        --max_seq_len 2560 \
        --max_multimodal_len 196 # 1 (max_batch_size) * 196 (num_visual_features)

    # for LLaVA-OneVision
    python ../qwen/convert_checkpoint.py \
        --model_dir tmp/hf_models/${MODEL_NAME} \
        --output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
        --dtype float16

    trtllm-build \
        --checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
        --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/llm \
        --gemm_plugin float16 \
        --use_fused_mlp=enable \
        --max_batch_size 1 \
        --max_input_len  7228 \
        --max_seq_len  7328 \
        --max_multimodal_len 7128 # max_batch_size * num_visual_features(depends on the image size or the specified video num frame)
    ```

3. Build TensorRT engines for visual components

    ```bash
    python build_multimodal_engine.py --model_path tmp/hf_models/${MODEL_NAME} --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/vision --model_type llava # for LLaVA

    python build_multimodal_engine.py --model_path tmp/hf_models/${MODEL_NAME} --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/vision --model_type llava_next --max_batch_size 5 # 1 (max_batch_size) * 5 (because LLAVA-NeXT visual encoder can have at most 5 patches)  # for LLaVA-NeXT

    python build_multimodal_engine.py --model_path tmp/hf_models/${MODEL_NAME} --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/vision --model_type llava_onevision --max_batch_size 32 # max_batch_size * patch for image or frame for video # for LLaVA-OneVision

    python build_multimodal_engine.py --model_path tmp/hf_models/${MODEL_NAME} --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/vision --model_type vila --vila_path ${VILA_PATH} # for VILA
    ```

    ```bash
    python run.py \
        --max_new_tokens 30 \
        --hf_model_dir tmp/hf_models/${MODEL_NAME} \
        --engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu \
        --input_text "\n Which city is this?" # for LLaVA and for LLaVA-NeXT

    python run.py \
        --max_new_tokens 30 \
        --hf_model_dir tmp/hf_models/${MODEL_NAME} \
        --engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu \
        --image_path=https://github.com/Efficient-Large-Model/VILA/raw/main/demo_images/av.png,https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png \
        --input_text "\n Please elaborate what you see in the images?","\n Which city is this?" \
        --batch_size=2 # for LLaVA
    ```

    Note that Llava can support N <one image, one prompt text> pairs inference batching, `--batch_size=N` should be used. There should be N images listed under `--image_path` and N text prompts listed under `--input_text`. Don't forget to set the `--max_batch_size` and `--max_multimodal_len` during engine building.

    For LLaVA-OneVision, you can use either image or video as inputs.
    ```bash
    python run.py \
        --max_new_tokens 30 \
        --hf_model_dir tmp/hf_models/${MODEL_NAME} \
        --engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu \
        --input_text "What is shown in this image?" \
        --image_path image.png

    python run.py \
        --max_new_tokens 30 \
        --hf_model_dir tmp/hf_models/${MODEL_NAME} \
        --engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu \
        --input_text "Why is this video funny?" \
        --video_path video.mp4
        --video_num_frames 8 # sample uniformly 8 frames from the video, up to 32 frames
    ```

    For VILA, you can use either local file or web url as input images.
    Suppose you have a local image `av.png` downloaded from `https://github.com/Efficient-Large-Model/VILA/blob/main/demo_trt_llm/av.png` and the url of `merlion.png`
    ```bash
    wget -O av.png https://raw.githubusercontent.com/Efficient-Large-Model/VILA/main/demo_images/av.png

    python run.py  \
        --max_new_tokens 30 \
        --hf_model_dir tmp/hf_models/${MODEL_NAME} \
        --engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu \
        --image_path=av.png,https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png \
        --input_text="<image>\n<image>\n Please elaborate what you see in the images?" \
        --batch_size=1 # for VILA mode 1

    python run.py  \
        --max_new_tokens 30 \
        --hf_model_dir tmp/hf_models/${MODEL_NAME} \
        --engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu \
        --image_path=av.png,https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png \
        --input_text="<image>\n Please elaborate what you see in the images?","<image>\n Which city is this?" \
        --batch_size=2 \
        --check_accuracy # for VILA mode 2
    ```

    Note that VILA can support different modes in terms of batching:
    - Mode 1: if you want to query N images as a whole using a prompt, `--batch_size=1` should be used (which is the default value). Example is given above.
    - Mode 2: if you want to query N <one image, one prompt text> pairs, `--batch_size=N` should be used. There should be N images listed under `--image_path` and N text prompts listed under `--input_text`. Don't forget to set the `--max_batch_size` and `--max_multimodal_len` during engine building.

    Note: use `--run_profiling` for performance measurement, use `--check_accuracy` for accuracy check.

4. (Optional) Different quantization methods supported in LLaMA and Qwen can be applied to LLaVA/VILA/LLaVA-OneVision as well, such as INT4/INT8 weight-only, SmoothQuant, and INT4 Activation-Aware Quantization (AWQ). Detailed instructions can be found in LLaMA [README](../llama/README.md) and Qwen [README](../qwen/README.md).

   For example,

   ```bash
   # INT4 weight only
   python ../llama/convert_checkpoint.py \
        --model_dir tmp/hf_models/${MODEL_NAME} \
        --dtype float16 \
        --output_dir tmp/trt_models/${MODEL_NAME}/int4_weightonly/1-gpu \
        --use_weight_only \
        --weight_only_precision int4

   # INT4 AWQ
   python ../quantization/quantize.py \
        --model_dir tmp/hf_models/${MODEL_NAME} \
        --output_dir tmp/trt_models/${MODEL_NAME}/int4_awq/1-gpu \
        --dtype float16 \
        --qformat int4_awq \
        --calib_size 32
   ```

   Then follow the same `trtllm-build` and `run.py` steps as before. NOTE: for `trtllm-build` command, do not use `--use_fused_mlp=enable` in these quantization modes.

## MLLaMA

This section shows how to build and run a LLaMA-3.2 Vision model in TensorRT-LLM. We use [Llama-3.2-11B-Vision/](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision) as an example.

For LLaMA-3.2 text model, please refer to the [examples/llama/README.md](../llama/README.md) because it shares the model architecture of llama.

### Support data types <!-- omit from toc -->
  * BF16
  * Tensor Parallel
  * INT8 & INT4 Weight-Only
  * FP8

### Build and run vision model <!-- omit from toc -->

* build engine of vision encoder model

```bash
python examples/multimodal/build_multimodal_engine.py --model_type mllama \
                                                  --model_path Llama-3.2-11B-Vision/ \
                                                  --output_dir /tmp/mllama/trt_engines/vision/
```

* build engine of decoder model

```bash
python examples/mllama/convert_checkpoint.py --model_dir Llama-3.2-11B-Vision/ \
                              --output_dir /tmp/mllama/trt_ckpts \
                              --dtype bfloat16

trtllm-build --checkpoint_dir /tmp/mllama/trt_ckpts \
            --output_dir /tmp/mllama/trt_engines/llm/ \
            --max_num_tokens 4096 \
            --max_seq_len 2048 \
            --workers 1 \
            --gemm_plugin auto \
            --max_batch_size 4 \
            --max_encoder_input_len 4100 \
            --input_timing_cache model.cache
```

Note that for instruct Vision model, please set the `max_encoder_input_len` as `6404`.

* Run test on multimodal/run.py with C++ runtime (LLM part only)

```bash
python3 examples/multimodal/run.py --engine_dir /tmp/mllama/trt_engines/ \
                                   --hf_model_dir Llama-3.2-11B-Vision/ \
                                   --image_path https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg \
                                   --input_text "<|image|><|begin_of_text|>If I had to write a haiku for this one" \
                                   --max_new_tokens 50 \
                                   --batch_size 2

Use model_runner_cpp by default. To switch to model_runner, set `--session python` in the command mentioned above.

python3 examples/multimodal/eval.py \
                                   --engine_dir /tmp/mllama/trt_engines/ \
                                   --hf_model_dir Llama-3.2-11B-Vision/ \
                                   --test_trtllm \
                                   --accuracy_threshold 65 \
                                   --eval_task lmms-lab/ai2d
```

### Run MLLaMA decoder part by FP8 <!-- omit from toc -->

```bash
# install modelopt 0.21.0
pip install nvidia-modelopt[torch]~=0.21.0

python ./examples/quantization/quantize.py --model_dir Llama-3.2-11B-Vision/ \
                                           --dtype bfloat16 \
                                           --qformat fp8 \
                                           --output_dir /tmp/llama-3.2-11B-Vision/fp8/ \
                                           --kv_cache_dtype fp8 \
                                           --calib_size 512 \
                                           --calib_dataset scienceqa

trtllm-build --checkpoint_dir /tmp/llama-3.2-11B-Vision/fp8/ \
            --output_dir /tmp/trt_engines/llama-3.2-11B-Vision/fp8/llm \
            --max_num_tokens 4096 \
            --max_seq_len 2048 \
            --workers 1 \
            --gemm_plugin auto \
            --max_batch_size 4 \
            --max_encoder_input_len 4100 \
            --input_timing_cache model.cache \
            --use_paged_context_fmha enable \
            --use_fp8_context_fmha enable

# copy visiual engine directory `/tmp/mllama/trt_engines/vision/` to fp8 engine directory `/tmp/trt_engines/llama-3.2-11B-Vision/fp8/vision`

python3 examples/multimodal/run.py --engine_dir /tmp/trt_engines/llama-3.2-11B-Vision/fp8/ \
                                   --hf_model_dir Llama-3.2-11B-Vision/ \
                                   --image_path https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg \
                                   --input_text "<|image|><|begin_of_text|>If I had to write a haiku for this one" \
                                   --max_new_tokens 50 \
                                   --batch_size 2

python3 examples/multimodal/eval.py --engine_dir /tmp/trt_engines/llama-3.2-11B-Vision/fp8/ \
                                   --hf_model_dir Llama-3.2-11B-Vision/ \
                                   --test_trtllm \
                                   --accuracy_threshold 65 \
                                   --eval_task lmms-lab/ai2d
```

Note that for instruct Vision model, please set the `max_encoder_input_len` as `6404`.

## NeVA

[NeVA](https://docs.nvidia.com/nemo-framework/user-guide/latest/multimodalmodels/neva/index.html) is a groundbreaking addition to the NeMo Multimodal ecosystem. This model seamlessly integrates large language-centric models with a vision encoder, that can be deployed in TensorRT-LLM.

1. Generate TRT-LLM engine for NVGPT following example in `examples/gpt/README.md`. To adhere to the NVGPT conventions of the conversion script, some layer keys have to be remapped using `--nemo_rename_key`.

    ```bash
    export MODEL_NAME="neva"
    python ../gpt/convert_checkpoint.py \
	--nemo_ckpt_path ./${MODEL_NAME}.nemo \
	--dtype bfloat16 \
	--output_dir tmp/trt_models/${MODEL_NAME} \
	--nemo_rename_key model:model.language_model \
        attention.linear_qkv.layer_norm_bias:input_layernorm.bias \
        attention.linear_qkv.layer_norm_weight:input_layernorm.weight \
        mlp.linear_fc1.layer_norm_bias:post_attention_layernorm.bias \
        mlp.linear_fc1.layer_norm_weight:post_attention_layernorm.weight \
        linear_qkv:query_key_value \
        linear_fc1:dense_h_to_4h \
        linear_fc2:dense_4h_to_h \
        linear_proj:dense \
        decoder:encoder

    trtllm-build \
        --checkpoint_dir tmp/trt_models/${MODEL_NAME} \
        --output_dir tmp/trt_engines/${MODEL_NAME}/bf16/1-gpu/llm \
        --gpt_attention_plugin bfloat16 \
        --gemm_plugin bfloat16 \
        --max_batch_size 1 \
        --max_input_len 2048 \
        --max_seq_len 2560 \
        --max_multimodal_len 729 # 1 (max_batch_size) * 729 (num_visual_features)
    ```

2. Build TensorRT engines for visual components

    ```bash
    python build_multimodal_engine.py --model_path ./${MODEL_NAME}.nemo --model_type neva --output_dir tmp/trt_engines/${MODEL_NAME}/bf16/1-gpu/vision
    ```

    ```bash
    python run.py \
        --max_new_tokens 30 \
        --hf_model_dir tmp/trt_models/${MODEL_NAME} \
        --engine_dir tmp/trt_engines/${MODEL_NAME}/bf16/1-gpu \
        --input_text "Question: which city is this? Answer:"
    ```

    Note: use `--run_profiling` for performance measurement, use `--check_accuracy` for accuracy check.

## Nougat

1. Download Huggingface weights

    ```bash
    export MODEL_NAME="nougat-base" # also nougat-small
    git clone https://huggingface.co/facebook/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
    ```

2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/enc_dec`

   Nougat uses mBART architecture but replaces the LLM encoder with a Swin Transformer encoder.
   To achieve this, we add an extra `--nougat` flag (over mBART example) to
   `convert_checkpoint.py` in `examples/enc_dec` and `trtllm-build`.

    ```bash
    python ../enc_dec/convert_checkpoint.py --model_type bart \
        --model_dir tmp/hf_models/${MODEL_NAME} \
        --output_dir tmp/trt_models/${MODEL_NAME}/bfloat16 \
        --tp_size 1 \
        --pp_size 1 \
        --dtype bfloat16 \
        --nougat

    trtllm-build --checkpoint_dir tmp/trt_models/${MODEL_NAME}/bfloat16/decoder \
        --output_dir tmp/trt_engines/${MODEL_NAME}/1-gpu/bfloat16/llm/decoder \
        --paged_kv_cache disable \
        --moe_plugin disable \
        --gemm_plugin bfloat16 \
        --bert_attention_plugin bfloat16 \
        --gpt_attention_plugin bfloat16 \
        --remove_input_padding enable \
        --max_beam_width 1 \
        --max_batch_size 1 \
        --max_seq_len 101 \
        --max_input_len 1 \
        --max_encoder_input_len 588 # 1 (max_batch_size) * 588 (num_visual_features)
    ```

3. Generate TensorRT engines for visual components and combine everything into final pipeline.

    ```bash
    python build_multimodal_engine.py --model_type nougat --model_path tmp/hf_models/${MODEL_NAME} --output_dir tmp/trt_engines/${MODEL_NAME}/1-gpu/bfloat16/vision

    python run.py \
        --hf_model_dir tmp/hf_models/${MODEL_NAME} \
        --engine_dir tmp/trt_engines/${MODEL_NAME}/1-gpu/bfloat16
    ```

    Note: Nougat models usually do not need a text prompt.


## Phi-3-vision

1. Download Huggingface weights

    ```bash
    export MODEL_NAME="Phi-3-vision-128k-instruct" # or Phi-3.5-vision-instruct
    git clone https://huggingface.co/microsoft/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
    ```

2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/phi`.
    ```bash
    python ../phi/convert_checkpoint.py \
        --model_dir tmp/hf_models/${MODEL_NAME} \
        --output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
        --dtype float16

    trtllm-build \
        --checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
        --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/llm \
        --gpt_attention_plugin float16 \
        --gemm_plugin float16 \
        --max_batch_size 1 \
        --max_input_len 4096 \
        --max_seq_len 4608 \
        --max_multimodal_len 4096
    ```

3. Generate TensorRT engines for visual components and combine everything into final pipeline.

    ```bash
    python build_multimodal_engine.py --model_type phi-3-vision --model_path tmp/hf_models/${MODEL_NAME} --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/vision

    python run.py \
        --hf_model_dir tmp/hf_models/${MODEL_NAME} \
        --kv_cache_free_gpu_memory_fraction 0.7 \
        --engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/ \
        --image_path=https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png
    ```

## Qwen2-VL
[Qwen2-VL Family](https://github.com/QwenLM/Qwen2-VL): is the latest version of the vision language models in the Qwen model families. Here we show how to deploy Qwen2-VL 2B and 7B in TensorRT-LLM.

Firstly, please install transformers and qwen-vl-utils
```bash
pip install -r requirements-qwen2vl.txt
```
### Support data types <!-- omit from toc -->
  * FP16
  * FP8

### Build and run vision model <!-- omit from toc -->
* Download Huggingface weights
    ```bash
    export MODEL_NAME="Qwen2-VL-7B-Instruct" # or Qwen2-VL-2B-Instruct
    git clone https://huggingface.co/Qwen/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
    ```
* Build engine of decoder model

    ```bash
    python3 ../qwen/convert_checkpoint.py \
        --model_dir=tmp/hf_models/${MODEL_NAME} \
        --output_dir=tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
        --dtype float16

    trtllm-build --checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
        --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/llm \
        --gemm_plugin=float16 \
        --gpt_attention_plugin=float16 \
        --max_batch_size=4 \
        --max_input_len=2048 \
        --max_seq_len=3072 \
        --max_multimodal_len=1296 #(max_batch_size) * 324 (num_visual_features), this's for image_shape=[504,504]
    ```

* Build engine of vision encoder model
    ```bash
    python build_multimodal_engine.py --model_type qwen2_vl --model_path tmp/hf_models/${MODEL_NAME} --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/vision
    ```

* Run test on multimodal/run.py with C++ runtime (LLM part only)
    ```bash
    python3 run.py \
        --hf_model_dir tmp/hf_models/${MODEL_NAME} \
        --engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/
    ```
### Run Qwen2-VL decoder part by FP8 <!-- omit from toc -->
* Build engine
    ```bash
    python ./examples/quantization/quantize.py \
    --model_dir tmp/hf_models/${MODEL_NAME} \
    --dtype float16 \
    --qformat fp8 \
    --kv_cache_dtype fp8 \
    --output_dir tmp/trt_models/${MODEL_NAME}/fp8/1-gpu \
    --calib_size 512

    trtllm-build --checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp8/1-gpu \
    --output_dir tmp/trt_engines/${MODEL_NAME}/fp8/1-gpu/llm \
    --max_input_len=2048 \
    --max_seq_len 3072 \
    --gemm_plugin auto \
    --max_batch_size 4 \
    --max_multimodal_len=1296

    # copy visiual engine directory `tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/vision/` to fp8 engine directory `tmp/trt_engines/${MODEL_NAME}/fp8/1-gpu/vision`
    ```
* Run test on multimodal/run.py with C++ runtime (LLM part only)
    ```bash
    python3 run.py \
    --hf_model_dir tmp/hf_models/${MODEL_NAME} \
    --engine_dir tmp/trt_engines/${MODEL_NAME}/fp8/1-gpu/
    ```
## Video NeVA

[Video NeVA](https://github.com/NVIDIA/NeMo/blob/main/docs/source/multimodal/mllm/video_neva.rst) is a groundbreaking addition to the NeMo Multimodal ecosystem that could work with video modality. This model seamlessly integrates large language-centric models with a vision encoder, that can be deployed in TensorRT-LLM.

1. Generate TRT-LLM engine for Nemotron model following example in `examples/nemotron/README.md`. To adhere to the NVGPT conventions of the conversion script. This will be used as our base LM for inference.

    ```bash
    pip install decord # used for loading video

    python3 ../quantization/quantize.py \
        --nemo_ckpt_path /path/to/nemotron/model.nemo \
        --dtype bfloat16 \
        --batch_size 64 \
        --qformat full_prec \
        --output_dir nemotron-3/trt_ckpt/bf16/1-gpu


    trtllm-build \
        --checkpoint_dir nemotron-3/trt_ckpt/bf16/1-gpu \
        --output_dir tmp/trt_engines/nemotron-3/bf16/1-gpu/llm \
        --gpt_attention_plugin bfloat16 \
        --gemm_plugin bfloat16 \
        --max_batch_size 1 \
        --max_input_len 4096 \
        --max_seq_len 4352 \
        --max_multimodal_len 3072 # 1 (max_batch_size) * (12 num_frames) * (256 image_token_len)
    ```

2. Build TensorRT engines for visual components

    ```bash
    python build_multimodal_engine.py --model_path /path/to/video/neva/projector.nemo --model_type video-neva --output_dir tmp/trt_engines/nemotron-3/visual_encoder --output_dir tmp/trt_engines/nemotron-3/bf16/1-gpu/vision
    ```

    ```bash
    python run.py \
        --max_new_tokens 30 \
        --hf_model_dir nemotron-3/trt_ckpt/bf16/1-gpu \
        --engine_dir tmp/trt_engines/nemotron-3/bf16/1-gpu \
        --input_text "Question: what is in the video? Answer:" \
        --video_path /path/to/your/local/video/file
    ```

    Note: use `--run_profiling` for performance measurement, use `--check_accuracy` for accuracy check.

## Dataset Evaluation

This section explains how to evaluate datasets using our provided script, including supported models and configurations.

### Evaluation Command <!-- omit from toc -->
To run an evaluation, use the following command:

```bash
python ./examples/multimodal/eval.py \
    --model_type <model_type> \
    --engine_dir <engine_dir> \
    --hf_model_dir <hf_model_dir> \
    --dataset_dir <dataset_dir> \
    --test_trtllm (or --test_hf, or both) \
    --accuracy_threshold <threshold> \
    --eval_task <eval_task> \
    --max_ite 20 \
    --visual_engine_name <visual_engine_name>
```

### Parameters <!-- omit from toc -->
- `--model_type`: Specify the model type to evaluate.
- `--engine_dir`: Path to the model engines directory.
- `--hf_model_dir`: Path to the Hugging Face model directory.
- `--dataset_dir`: Path to the dataset directory. If not specified, will load the dataset from HF with the `--eval_task` as dataset tag.
- `--test_trtllm` or `--test_hf`: Specify which evaluation framework to use. Both can be used simultaneously.
- `--accuracy_threshold`: Set the accuracy threshold for evaluation.
- `--eval_task`: Specify the evaluation task. Supported tasks: `['lmms-lab/ai2d', 'lmms-lab/VQAv2', 'lmms-lab/MME']`. Default to `'lmms-lab/VQAv2'`.
- `--max_ite`: Maximum number of iterations, default to 20.
- `--visual_engine_name`: Name of the visual engine.

### Supported Evaluation Tasks <!-- omit from toc -->
The following evaluation tasks are supported:
- `lmms-lab/ai2d`
- `lmms-lab/VQAv2`
- `lmms-lab/MME`

### Supported Model Types <!-- omit from toc -->
The script supports the following model types:
- `blip2`
- `fuyu`
- `kosmos-2`
- `llava`
- `llava_next`
- `llava_onevision`
- `phi-3-vision`
- `qwen2_vl`
- `mllama`
- `vila`
- `cogvlm`
- `neva`
- `internvl`

**Note:** The models `vila`, `cogvlm`, `neva`, and `internvl` do not support the `--test_hf` evaluation framework.

## Enabling Tensor Parallelism for multi-GPU

The LLM part of the pipeline can be run on multiple GPUs using tensor parallelism.
The visual encoder will be replicated on each GPU and operate in a data parallel fashion.

To enable tensor parallelism, both weight conversion step (from Huggingface to FT format)
and engine building step should use additional arguments. Finally `run.py` should be prefixed
with `mpirun -n NUM_GPUS --allow-run-as-root`.

The full set of commands to enable 2-way tensor parallelism for LLaVA is:

    ```bash
    export MODEL_NAME="llava-1.5-7b-hf"

    python ../llama/convert_checkpoint.py \
        --model_dir tmp/hf_models/${MODEL_NAME} \
        --output_dir tmp/trt_models/${MODEL_NAME}/fp16/2-gpu \
        --dtype float16 --tp_size 2

    trtllm-build \
        --checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/2-gpu \
        --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/2-gpu/llm \
        --gemm_plugin float16 \
        --max_batch_size 1 \
        --max_input_len 2048 \
        --max_seq_len 2560 \
        --max_multimodal_len 576

    python build_multimodal_engine.py --model_type llava --model_path tmp/hf_models/${MODEL_NAME} --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/2-gpu/vision

    mpirun -n 2 --allow-run-as-root \
        python run.py \
        --max_new_tokens 30 \
        --hf_model_dir tmp/hf_models/${MODEL_NAME} \
        --engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/2-gpu \
    ```