From 25533a77362a3b15c64c1d76902bd0970d4870ac Mon Sep 17 00:00:00 2001
From: mayani-nv <67936769+mayani-nv@users.noreply.github.com>
Date: Fri, 9 May 2025 15:42:58 -0700
Subject: [PATCH] Updating the multimodal models README to add steps for
 running phi-4-multimodal instruct (#3932)

* Update run.py for draft_target_model

This change makes the draft target model works without mismatch in the vocab size

Signed-off-by: mayani-nv <67936769+mayani-nv@users.noreply.github.com>

* updating README with phi-4-multimodal-instruct steps

* adding ENGINE_DIR, HF_DIR and CKPT_DIR as per review

* addressing review comments on PR

* updating readme

---------

Signed-off-by: mayani-nv <67936769+mayani-nv@users.noreply.github.com>
Co-authored-by: rakib-hasan <rhasan@nvidia.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
---
 examples/models/core/multimodal/README.md | 46 +++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/examples/models/core/multimodal/README.md b/examples/models/core/multimodal/README.md
index 29822c1129..342c05fa89 100644
--- a/examples/models/core/multimodal/README.md
+++ b/examples/models/core/multimodal/README.md
@@ -20,6 +20,7 @@ We first describe three runtime modes for running multimodal models and how to r
 - [NeVA](#neva)
 - [Nougat](#nougat)
 - [Phi-3-vision](#phi-3-vision)
+- [Phi-4-multimodal](#phi-4-multimodal)
 - [Qwen2-VL](#qwen2-vl)
 - [Video NeVA](#video-neva)
 - [Dataset Evaluation](#dataset-evaluation)
@@ -49,6 +50,7 @@ Not all models supports end-to-end `cpp` mode, the checked ones below are suppor
 - [x] NeVA
 - [ ] Nougat [^1]
 - [ ] Phi-3-Vision [^2]
+- [ ] Phi-4-multimodal
 - [ ] Qwen2-VL [^4]
 - [x] Video-NeVA
 
@@ -967,7 +969,51 @@ Note that for instruct Vision model, please set the `max_encoder_input_len` as `
         --engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/ \
         --image_path=https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png
     ```
+## Phi-4-multimodal
+Navigate to the folder `TensorRT-LLM/examples/models/core/multimodal`
 
+1. Download Huggingface weights
+
+    ```bash
+    export MODEL_NAME="Phi-4-multimodal-instruct" 
+    export HF_DIR="tmp/hf_models/${MODEL_NAME}"
+    export CKPT_DIR="tmp/trt_models/${MODEL_NAME}/fp16/1-gpu"
+    export ENGINE_DIR="tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu"
+    git clone https://huggingface.co/microsoft/${MODEL_NAME} ${HF_DIR} 
+
+    ```
+
+2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/models/core/phi`.
+    ```bash
+    python ../phi/convert_checkpoint.py \
+        --model_dir ${HF_DIR} \
+        --output_dir ${CKPT_DIR} \
+        --dtype float16
+
+    trtllm-build \
+        --checkpoint_dir  ${CKPT_DIR} \
+        --output_dir ${ENGINE_DIR} \
+        --gpt_attention_plugin float16 \
+        --gemm_plugin float16 \
+        --max_batch_size 1 \
+        --max_input_len 4096 \
+        --max_seq_len 4608 \
+        --max_multimodal_len 4096
+    ```
+
+3. Generate TensorRT engines for visual components and combine everything into final pipeline.
+*Note: the encoders are not the TRT engines but are pure Pytorch ones*
+
+    ```bash
+    python build_multimodal_engine.py --model_type phi-4-multimodal --model_path ${HF_DIR} --output_dir ${ENGINE_DIR} 
+
+    python run.py \
+        --hf_model_dir ${HF_DIR} \
+        --kv_cache_free_gpu_memory_fraction 0.7 \
+        --engine_dir ${ENGINE_DIR} \
+        --image_path=https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png
+        --audio_path=${HF_DIR}/examples/what_is_shown_in_this_image.wav
+    ```
 ## Qwen2-VL
 [Qwen2-VL Family](https://github.com/QwenLM/Qwen2-VL): is the latest version of the vision language models in the Qwen model families. Here we show how to deploy Qwen2-VL 2B and 7B in TensorRT-LLM.