mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Kaiyu Xie 74b324f667 Update TensorRT-LLM (#2110 )		2024-08-13 22:34:33 +08:00
..
README.md	Update TensorRT-LLM (#2110 )	2024-08-13 22:34:33 +08:00

README.md

Exaone

This document shows how to build and run a Exaone model in TensorRT-LLM.

The TensorRT-LLM Exaone implementation is based on the LLaMA model. The implementation can be found in llama/model.py. See the LLaMA example examples/llama for details.

Exaone

Support Matrix

FP16
BF16
INT8 & INT4 Weight-Only

Download model checkpoints

First, download the HuggingFace FP16 checkpoints of Exaone model.

git clone https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct hf_models/exaone

TensorRT-LLM workflow

Next, we build the model with trtllm-build.

Convert checkpoint and build TRTLLM engine

As written above, we will use llama's convert_checkpoint.py for Exaone model.

# Build a single-GPU float16 engine from HF weights.

# Build the EXAONE model using a single GPU and FP16.
python ../llama/convert_checkpoint.py \
    --model_dir hf_models/exaone \
    --output_dir trt_models/exaone/fp16/1-gpu \
    --dtype float16

trtllm-build \
    --checkpoint_dir trt_models/exaone/fp16/1-gpu \
    --output_dir trt_engines/exaone/fp16/1-gpu \
    --gemm_plugin auto

# Build the EXAONE model using a single GPU and and apply INT8 weight-only quantization.
python ../llama/convert_checkpoint.py \
    --model_dir hf_models/exaone \
    --output_dir trt_models/exaone/fp16_wq_8/1-gpu \
    --use_weight_only \
    --weight_only_precision int8 \
    --dtype float16

trtllm-build \
    --checkpoint_dir trt_models/exaone/fp16_wq_8/1-gpu \
    --output_dir trt_engines/exaone/fp16_wq_8/1-gpu \
    --gemm_plugin auto

Note

: Exaone model is currently not supported with --load_by_shard.

Run Engine

Test your engine with the run.py script:

python3 ../run.py \
    --input_text "When did the first world war end?" \
    --max_output_len=100 \
    --tokenizer_dir hf_models/exaone \
    --engine_dir trt_engines/exaone/fp16/1-gpu

python ../summarize.py \
    --test_trt_llm \
    --data_type fp16 \
    --hf_model_dir hf_models/exaone \
    --engine_dir trt_engines/exaone/fp16/1-gpu

For more examples see examples/llama/README.md