mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

rakib-hasan ff3b741045 feat: adding multimodal (only image for now) support in trtllm-bench (#3490 ) * feat: adding multimodal (only image for now) support in trtllm-bench Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * fix: add in load_dataset() calls to maintain the v2.19.2 behavior Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * re-adding prompt_token_ids and using that for prompt_len Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * updating the datasets version in examples as well Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * api changes are not needed Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * moving datasets requirement and removing a missed api change Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * addressing review comments Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * refactoring the quickstart example Signed-off-by: Rakib Hasan <rhasan@nvidia.com> --------- Signed-off-by: Rakib Hasan <rhasan@nvidia.com>	2025-04-18 07:06:16 +08:00
..
README.md	chore: clean some ci of qa test (#3083 )	2025-03-31 14:30:41 +08:00
requirements.txt	feat: adding multimodal (only image for now) support in trtllm-bench (#3490 )	2025-04-18 07:06:16 +08:00

feat: adding multimodal (only image for now) support in trtllm-bench (#3490 )

* feat: adding multimodal (only image for now) support in trtllm-bench

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* fix: add  in load_dataset() calls to maintain the v2.19.2 behavior

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* re-adding prompt_token_ids and using that for prompt_len

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* updating the datasets version in examples as well

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* api changes are not needed

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* moving datasets requirement and removing a missed api change

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* addressing review comments

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* refactoring the quickstart example

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

---------

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

2025-04-18 07:06:16 +08:00

README.md

chore: clean some ci of qa test (#3083 )

2025-03-31 14:30:41 +08:00

requirements.txt

feat: adding multimodal (only image for now) support in trtllm-bench (#3490 )

2025-04-18 07:06:16 +08:00

README.md

Smaug

This document elaborates how to build the Smaug-72B-v0.1 model to runnable engines on multi-GPU node and perform a summarization task using these engines.

Overview

The TensorRT-LLM support for Smaug-72B-v0.1 is based on the LLaMA model, the implementation can be found in tensorrt_llm/models/llama/model.py. Smaug model resembles LLaMA very much except it uses bias term in its attention module, we therefore reuse the LLaMA example code for Smaug,

convert_checkpoint.py to convert the LLaMA model into tensorrt-llm checkpoint format.

In addition, there are two shared files in the parent folder examples for inference and evaluation:

../../../run.py to run the inference on an input text;
../../../summarize.py to summarize the articles in the cnn_dailymail dataset.

Support Matrix

FP16

Usage

This section gives a whole process where we convert HF models, build TensorRT-LLM engines and ultimately perform summarization.

Build TensorRT engine(s)

Run the following commands and TRT-LLM will first transforms a HF model into its own checkpoint format, then builds a TRT engine based on the checkpoint

python ../../../llama/convert_checkpoint.py \
    --model_dir ./Smaug-72B-v0.1 \
    --output_dir ./tllm_checkpoint_8gpu_tp8 \
    --dtype float16 \
    --tp_size 8

trtllm-build --checkpoint_dir ./tllm_checkpoint_8gpu_tp8 \
    --output_dir ./Smaug_72B_tp8 \
    --gemm_plugin float16 \
    --gpt_attention_plugin float16 \
    --context_fmha=enable \
    --max_batch_size 64 \
    --remove_input_padding=enable

Run Summarization

After building TRT engine, we can use it to perform various tasks. TensorRT-LLM provides handy code to run summarization on cnn_dailymail dataset and get ROUGE scores. The ROUGE-1 score can be used to validate model implementations.

mpirun -n 8 -allow-run-as-root python ../../../summarize.py \
    --hf_model_dir ../Smaug-72B-v0.1 \
    --engine_dir ./Smaug_72B_tp8 \
    --data_type fp16 \
    --test_hf \
    --hf_device_map_auto \
    --test_trt_llm