|
|
||
|---|---|---|
| .. | ||
| convert_checkpoint.py | ||
| README.md | ||
ViT in MultiModal
This document uses the LLaVA-NeXT model as an example to show how to build the vision encoder in TRTLLM.
LLaVA-NeXT is an extension of LLaVA. TRT-LLM currently supports Mistral-7b and Nous-Hermes-2-Yi-34B variant of LLaVA-NeXT.
-
Download Huggingface model weights. These models have both visual and LLM components unlike BLIP2 example which downloads only LLM components from Huggingface.
export MODEL_NAME="llava-v1.6-mistral-7b-hf" #for 34b variant "llava-v1.6-34b-hf" git clone https://huggingface.co/llava-hf/${MODEL_NAME} tmp/hf_models/${MODEL_NAME} -
Generate TRT-LLM engine for visual component
python ./convert_checkpoint.py \ --model_dir tmp/hf_models/${MODEL_NAME} \ --output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu/vision \ --dtype float16 trtllm-build \ --checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu/vision \ --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/vision \ --remove_input_padding disable \ --bert_attention_plugin disable \ --max_batch_size 8 # copy the image newlines tensor to engine directory cp tmp/trt_models/${MODEL_NAME}/fp16/1-gpu/vision/image_newlines.safetensors tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/vision -
Generate TRT-LLM engine for LLaMA following example in
examples/models/core/llama/README.mdpython ../llama/convert_checkpoint.py \ --model_dir tmp/hf_models/${MODEL_NAME} \ --output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu/llm \ --dtype float16 trtllm-build \ --checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu/llm \ --output_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/llm \ --gpt_attention_plugin float16 \ --gemm_plugin float16 \ --use_fused_mlp=enable \ --max_batch_size 8 \ --max_input_len 4096 \ --max_seq_len 5120 \ --max_num_tokens 32768 \ --max_multimodal_len 32768 # 8 (max_batch_size) * 4096 (max_input_len) -
Run
python ../multimodal/run.py \ --max_new_tokens 30 \ --hf_model_dir tmp/hf_models/${MODEL_NAME} \ --engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu \ --input_text "Question: which city is this? Answer:" -
(Optional) Different quantization methods supported in LLaMA can be applied to LLaVA-Next as well, such as INT4/INT8 weight-only, SmoothQuant, and INT4 Activation-Aware Quantization (AWQ). Detailed instructions can be found in LLaMA README.
For example,
# INT4 weight only python ../llama/convert_checkpoint.py \ --model_dir tmp/hf_models/${MODEL_NAME} \ --dtype float16 \ --output_dir tmp/trt_models/${MODEL_NAME}/int4_weightonly/1-gpu/llm \ --use_weight_only \ --weight_only_precision int4 # INT4 AWQ python ../../../quantization/quantize.py \ --model_dir tmp/hf_models/${MODEL_NAME} \ --output_dir tmp/trt_models/${MODEL_NAME}/int4_awq/1-gpu/llm \ --dtype float16 \ --qformat int4_awq \ --calib_size 32Then follow the same
trtllm-buildandrun.pysteps as before. NOTE: fortrtllm-buildcommand, do not use--use_fused_mlp=enablein these quantization modes.