| .. | ||
| README.md | ||
| requirements.txt | ||
Smaug
This document elaborates how to build the Smaug-72B-v0.1 model to runnable engines on multi-GPU node and perform a summarization task using these engines.
Overview
The TensorRT-LLM support for Smaug-72B-v0.1 is based on the LLaMA model, the implementation can be found in tensorrt_llm/models/llama/model.py. Smaug model resembles LLaMA very much except it uses bias term in its attention module, we therefore reuse the LLaMA example code for Smaug,
convert_checkpoint.pyto convert the LLaMA model into tensorrt-llm checkpoint format.
In addition, there are two shared files in the parent folder examples for inference and evaluation:
../run.pyto run the inference on an input text;../summarize.pyto summarize the articles in the cnn_dailymail dataset.
Support Matrix
- FP16
Usage
This section gives a whole process where we convert HF models, build TensorRT-LLM engines and ultimately perform summarization.
Build TensorRT engine(s)
Run the following commands and TRT-LLM will first transforms a HF model into its own checkpoint format, then builds a TRT engine based on the checkpoint
python examples/llama/convert_checkpoint.py \
--model_dir ./Smaug-72B-v0.1 \
--output_dir ./tllm_checkpoint_8gpu_tp8 \
--dtype float16 \
--tp_size 8
trtllm-build --checkpoint_dir ./tllm_checkpoint_8gpu_tp8 \
--output_dir ./Smaug_72B_tp8 \
--gemm_plugin float16 \
--gpt_attention_plugin float16 \
--context_fmha=enable \
--max_batch_size 64 \
--remove_input_padding=enable
Run Summarization
After building TRT engine, we can use it to perform various tasks. TensorRT-LLM provides handy code to run summarization on cnn_dailymail dataset and get ROUGE scores. The ROUGE-1 score can be used to validate model implementations.
mpirun -n 8 -allow-run-as-root python examples/summarize.py \
--hf_model_dir ../Smaug-72B-v0.1 \
--engine_dir ./Smaug_72B_tp8 \
--data_type fp16 \
--test_hf \
--hf_device_map_auto \
--test_trt_llm