mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Guoming Zhang 9f0f52249e [None][doc] Rename TensorRT-LLM to TensorRT LLM for homepage and the … (#7850 )

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

2025-09-25 21:02:35 +08:00

3.7 KiB

Raw Blame History

Granite

This document shows how to build and run a Granite 3.0 model in TensorRT-LLM.

The TensorRT LLM Granite implementation is based on the LLaMA model, with Mixture of Experts (MoE) enabled. The implementation can be found in llama/model.py. See the LLaMA example examples/models/core/llama for details.

Granite 3.0

Download model checkpoints

First, download the HuggingFace BF16 checkpoints of Granite 3.0 model.

HF_MODEL="granite-3.0-8b-instruct" # or granite-3.0-3b-a800m-instruct
# clone the model we want to build
git clone https://huggingface.co/ibm-granite/${HF_MODEL} tmp/hf_checkpoints/${HF_MODEL}

Convert weights from HF Transformers to TensorRT LLM format

Set environment variables and necessary directory:

PREC_RAW="bfloat16"
TP=1
mkdir -p tmp/trt_engines

BF16

Convert the weights using the convert_checkpoint.py script:

ENGINE="${HF_MODEL}_${PREC_RAW}_tp${TP}"
export TRTLLM_DISABLE_UNIFIED_CONVERTER=1  # The current checkpoint conversion code requires legacy path
python3 ../llama/convert_checkpoint.py --model_dir tmp/hf_checkpoints/${HF_MODEL} \
                                       --output_dir tmp/tllm_checkpoints/${ENGINE} \
                                       --dtype ${PREC_RAW} \
                                       --tp_size ${TP} \
                                       --use_embedding_sharing

FP8 PTQ

Notes:

Currently quantize.py does not support Expert Parallelism (EP) mode yet. User should use ../llama/convert_checkpoint.py and specify --moe_ep_size 1 instead, if needed.
TensorRT LLM uses static quantization methods, which is expected to be faster at runtime as compared to dynamic quantization methods. This comes at a cost of an offline calibration step during quantization. batch_size and calib_size can be adjusted to shorten the calibration time. Please refer to ../../../quantization/README.md for explanation.

PREC_QUANT="fp8"
ENGINE="${HF_MODEL}_${PREC_QUANT}_tp${TP}"
python ../../../quantization/quantize.py --model_dir tmp/hf_checkpoints/${HF_MODEL} \
                                   --dtype ${PREC_RAW} \
                                   --qformat ${PREC_QUANT} \
                                   --kv_cache_dtype ${PREC_QUANT} \
                                   --output_dir tmp/tllm_checkpoints/${ENGINE} \
                                   --batch_size 1 \
                                   --calib_size 128 \
                                   --tp_size ${TP}

Build TensorRT engine

# Enable fp8 context fmha to get further acceleration by setting `--use_fp8_context_fmha enable`
# Use --workers to enable parallel build
trtllm-build --checkpoint_dir ./tmp/tllm_checkpoints/${ENGINE} \
             --output_dir ./tmp/trt_engines/${ENGINE} \
             --gpt_attention_plugin ${PREC_RAW} \
             --gemm_plugin ${PREC_RAW} \
             --workers ${TP}

Run Engine

Test your engine with the run.py script:

mpirun -n ${TP} --allow-run-as-root python ../../../run.py --engine_dir ./tmp/trt_engines/${ENGINE} --tokenizer_dir tmp/hf_checkpoints/${HF_MODEL} --max_output_len 20 --input_text "The future of AI is"

For more usage examples see examples/models/core/llama/README.md

3.7 KiB Raw Blame History