TensorRT-LLMs/examples/models/contrib/smaug
rakib-hasan ff3b741045
feat: adding multimodal (only image for now) support in trtllm-bench (#3490)
* feat: adding multimodal (only image for now) support in trtllm-bench

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* fix: add  in load_dataset() calls to maintain the v2.19.2 behavior

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* re-adding prompt_token_ids and using that for prompt_len

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* updating the datasets version in examples as well

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* api changes are not needed

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* moving datasets requirement and removing a missed api change

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* addressing review comments

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* refactoring the quickstart example

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

---------

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
2025-04-18 07:06:16 +08:00
..
README.md chore: clean some ci of qa test (#3083) 2025-03-31 14:30:41 +08:00
requirements.txt feat: adding multimodal (only image for now) support in trtllm-bench (#3490) 2025-04-18 07:06:16 +08:00

Smaug

This document elaborates how to build the Smaug-72B-v0.1 model to runnable engines on multi-GPU node and perform a summarization task using these engines.

Overview

The TensorRT-LLM support for Smaug-72B-v0.1 is based on the LLaMA model, the implementation can be found in tensorrt_llm/models/llama/model.py. Smaug model resembles LLaMA very much except it uses bias term in its attention module, we therefore reuse the LLaMA example code for Smaug,

In addition, there are two shared files in the parent folder examples for inference and evaluation:

Support Matrix

  • FP16

Usage

This section gives a whole process where we convert HF models, build TensorRT-LLM engines and ultimately perform summarization.

Build TensorRT engine(s)

Run the following commands and TRT-LLM will first transforms a HF model into its own checkpoint format, then builds a TRT engine based on the checkpoint

python ../../../llama/convert_checkpoint.py \
    --model_dir ./Smaug-72B-v0.1 \
    --output_dir ./tllm_checkpoint_8gpu_tp8 \
    --dtype float16 \
    --tp_size 8

trtllm-build --checkpoint_dir ./tllm_checkpoint_8gpu_tp8 \
    --output_dir ./Smaug_72B_tp8 \
    --gemm_plugin float16 \
    --gpt_attention_plugin float16 \
    --context_fmha=enable \
    --max_batch_size 64 \
    --remove_input_padding=enable

Run Summarization

After building TRT engine, we can use it to perform various tasks. TensorRT-LLM provides handy code to run summarization on cnn_dailymail dataset and get ROUGE scores. The ROUGE-1 score can be used to validate model implementations.

mpirun -n 8 -allow-run-as-root python ../../../summarize.py \
    --hf_model_dir ../Smaug-72B-v0.1 \
    --engine_dir ./Smaug_72B_tp8 \
    --data_type fp16 \
    --test_hf \
    --hf_device_map_auto \
    --test_trt_llm