mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-02-06 11:11:36 +08:00
Updating the multimodal models README to add steps for running phi-4-multimodal instruct (#3932)
* Update run.py for draft_target_model This change makes the draft target model works without mismatch in the vocab size Signed-off-by: mayani-nv <67936769+mayani-nv@users.noreply.github.com> * updating README with phi-4-multimodal-instruct steps * adding ENGINE_DIR, HF_DIR and CKPT_DIR as per review * addressing review comments on PR * updating readme --------- Signed-off-by: mayani-nv <67936769+mayani-nv@users.noreply.github.com> Co-authored-by: rakib-hasan <rhasan@nvidia.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
This commit is contained in:
parent
2d0f93a054
commit
25533a7736
@ -20,6 +20,7 @@ We first describe three runtime modes for running multimodal models and how to r
|
||||
- [NeVA](#neva)
|
||||
- [Nougat](#nougat)
|
||||
- [Phi-3-vision](#phi-3-vision)
|
||||
- [Phi-4-multimodal](#phi-4-multimodal)
|
||||
- [Qwen2-VL](#qwen2-vl)
|
||||
- [Video NeVA](#video-neva)
|
||||
- [Dataset Evaluation](#dataset-evaluation)
|
||||
@ -49,6 +50,7 @@ Not all models supports end-to-end `cpp` mode, the checked ones below are suppor
|
||||
- [x] NeVA
|
||||
- [ ] Nougat [^1]
|
||||
- [ ] Phi-3-Vision [^2]
|
||||
- [ ] Phi-4-multimodal
|
||||
- [ ] Qwen2-VL [^4]
|
||||
- [x] Video-NeVA
|
||||
|
||||
@ -967,7 +969,51 @@ Note that for instruct Vision model, please set the `max_encoder_input_len` as `
|
||||
--engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/ \
|
||||
--image_path=https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png
|
||||
```
|
||||
## Phi-4-multimodal
|
||||
Navigate to the folder `TensorRT-LLM/examples/models/core/multimodal`
|
||||
|
||||
1. Download Huggingface weights
|
||||
|
||||
```bash
|
||||
export MODEL_NAME="Phi-4-multimodal-instruct"
|
||||
export HF_DIR="tmp/hf_models/${MODEL_NAME}"
|
||||
export CKPT_DIR="tmp/trt_models/${MODEL_NAME}/fp16/1-gpu"
|
||||
export ENGINE_DIR="tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu"
|
||||
git clone https://huggingface.co/microsoft/${MODEL_NAME} ${HF_DIR}
|
||||
|
||||
```
|
||||
|
||||
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/models/core/phi`.
|
||||
```bash
|
||||
python ../phi/convert_checkpoint.py \
|
||||
--model_dir ${HF_DIR} \
|
||||
--output_dir ${CKPT_DIR} \
|
||||
--dtype float16
|
||||
|
||||
trtllm-build \
|
||||
--checkpoint_dir ${CKPT_DIR} \
|
||||
--output_dir ${ENGINE_DIR} \
|
||||
--gpt_attention_plugin float16 \
|
||||
--gemm_plugin float16 \
|
||||
--max_batch_size 1 \
|
||||
--max_input_len 4096 \
|
||||
--max_seq_len 4608 \
|
||||
--max_multimodal_len 4096
|
||||
```
|
||||
|
||||
3. Generate TensorRT engines for visual components and combine everything into final pipeline.
|
||||
*Note: the encoders are not the TRT engines but are pure Pytorch ones*
|
||||
|
||||
```bash
|
||||
python build_multimodal_engine.py --model_type phi-4-multimodal --model_path ${HF_DIR} --output_dir ${ENGINE_DIR}
|
||||
|
||||
python run.py \
|
||||
--hf_model_dir ${HF_DIR} \
|
||||
--kv_cache_free_gpu_memory_fraction 0.7 \
|
||||
--engine_dir ${ENGINE_DIR} \
|
||||
--image_path=https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png
|
||||
--audio_path=${HF_DIR}/examples/what_is_shown_in_this_image.wav
|
||||
```
|
||||
## Qwen2-VL
|
||||
[Qwen2-VL Family](https://github.com/QwenLM/Qwen2-VL): is the latest version of the vision language models in the Qwen model families. Here we show how to deploy Qwen2-VL 2B and 7B in TensorRT-LLM.
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user