mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
| .. | ||
| README.md | ||
Exaone
This document shows how to build and run a Exaone model in TensorRT-LLM.
The TensorRT-LLM Exaone implementation is based on the LLaMA model. The implementation can be found in llama/model.py.
See the LLaMA example examples/llama for details.
Support Matrix
- FP16
- BF16
- INT8 & INT4 Weight-Only
Download model checkpoints
First, download the HuggingFace FP16 checkpoints of Exaone model.
git clone https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct hf_models/exaone
TensorRT-LLM workflow
Next, we build the model with trtllm-build.
Convert checkpoint and build TRTLLM engine
As written above, we will use llama's convert_checkpoint.py for Exaone model.
# Build a single-GPU float16 engine from HF weights.
# Build the EXAONE model using a single GPU and FP16.
python ../llama/convert_checkpoint.py \
--model_dir hf_models/exaone \
--output_dir trt_models/exaone/fp16/1-gpu \
--dtype float16
trtllm-build \
--checkpoint_dir trt_models/exaone/fp16/1-gpu \
--output_dir trt_engines/exaone/fp16/1-gpu \
--gemm_plugin auto
# Build the EXAONE model using a single GPU and and apply INT8 weight-only quantization.
python ../llama/convert_checkpoint.py \
--model_dir hf_models/exaone \
--output_dir trt_models/exaone/fp16_wq_8/1-gpu \
--use_weight_only \
--weight_only_precision int8 \
--dtype float16
trtllm-build \
--checkpoint_dir trt_models/exaone/fp16_wq_8/1-gpu \
--output_dir trt_engines/exaone/fp16_wq_8/1-gpu \
--gemm_plugin auto
Note
: Exaone model is currently not supported with
--load_by_shard.
Run Engine
Test your engine with the run.py script:
python3 ../run.py \
--input_text "When did the first world war end?" \
--max_output_len=100 \
--tokenizer_dir hf_models/exaone \
--engine_dir trt_engines/exaone/fp16/1-gpu
python ../summarize.py \
--test_trt_llm \
--data_type fp16 \
--hf_model_dir hf_models/exaone \
--engine_dir trt_engines/exaone/fp16/1-gpu
For more examples see examples/llama/README.md