11 KiB
EXAONE
This document shows how to build and run a EXAONE model in TensorRT-LLM.
The TensorRT-LLM EXAONE implementation is based on the LLaMA model. The implementation can be found in llama/model.py.
See the LLaMA example examples/models/core/llama for details.
Support Matrix
- FP16
- BF16
- Tensor Parallel
- FP8
- INT8 & INT4 Weight-Only
- INT8 SmoothQuant
- INT4 AWQ & W4A8 AWQ
Supported Models
EXAONE-3.0
Download the HuggingFace FP32 checkpoints of EXAONE-3.0 model. We support EXAONE-3.0 families but here, we only use the EXAONE-3.0-7.8B-Instruct model for the example.
export HF_MODEL_DIR=hf_models/exaone
git clone https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct $HF_MODEL_DIR
EXAONE-Deep
Download the HuggingFace checkpoints of EXAONE-Deep model. Here, we only use the EXAONE-Deep-2.4B model for the example. We can use the same procedure as EXAONE-3.0 to convert the weights and build the TensorRT engine.
export HF_MODEL_DIR=hf_models/exaone_deep
git clone https://huggingface.co/LGAI-EXAONE/EXAONE-Deep-2.4B $HF_MODEL_DIR
EXAONE-4.0
Download he HuggingFace checkpoints of EXAONE-4.0 model. Here, we only use the TODO: replace with REAL name, EXAONE-4.0 model for the example. From EXAONE-4.0 model, we support EXAONE models only on PyTorch flow.
export HF_MODEL_DIR=hf_models/exaone4
git clone ... $HF_MODEL_DIR (TODO Change ... to real HF directory)
Usage
The next section describe how to convert the weights from the HuggingFace (HF) Transformers format to the TensorRT-LLM format. We will use llama's convert_checkpoint.py for EXAONE model and then we build the model with trtllm-build.
Pytorch flow
To quickly run EXAONE-4.0 models, you can use examples/llm-api/quickstart_advanced.py:
python ../../../llm-api/quickstart_advanced.py --model_dir hf_models/$MODEL_NAME --disable_kv_cache_reuse
SWA currently does not support kv_cache_reuse. Please make sure to disable KV cache reuse when running with SWA.
The output will be like:
TODO: Fill this with real HF checkpoints output
PyTorch flow Quantization
For PyTorch flow, TRT-LLM supports quantized format generated by TensorRT Model Optimizer.
You can either do pre-quantized models in HF model hub, or can generate quantized model by yourself and then run models with below command:
git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git
cd TensorRT-Model-Optimizer/examples/llm_ptq
scripts/huggingface_example.sh --model hf_models/$MODEL_NAME --quant fp8 --export_fmt hf
For more information, please refer to official docs or TensorRT Model Optimizer.
Troubleshooting
The following error may occur during quantization:
torch._dynamo.exc.Unsupported: Graph break under GenericContextWrappingVariable
Explanation: Attempted to graph break in an active context manager(s) that doesn't support graph breaking.
Hint: Move the offending context manager(s) to outside the compiled region.
Hint: This graph break may have been caused by an earlier graph break. Resolving the earlier graph break may resolve this one.
This error may indicate an incompatibility between torch.compile() and the HybridCache module of the transformers library. As a result, TensorRT Model Optimizer (ModelOpt) cannot perform PTQ with HybridCache.
Temporarily switching to DynamicCache when creating PTQ models could help address the issue. This can be done by updating the cache_implementation field in the generation_config.json file located in the model checkpoint directory, for example:
# generation_config.json
{
// Change "hybrid" to "dynamic" to run PTQ.
// Revert this to "hybrid" after quantization is complete.
"cache_implementation": "hybrid",
...
}
For models with sliding window attention, DynamicCache is less memory-efficient than HybridCache because it retains the entire key-value cache. However, this does not break the model's attention logic, as the cache implementation is separated from the attention computation itself. This trade-off is acceptable for the PTQ process, which is a one-time procedure. Our tests confirm that this workaround does not degrade accuracy on MMLU or GSM8K benchmarks with the default ModelOpt settings.
TRT flow
Convert checkpoint and build TensorRT engine(s)
# Build a single-GPU float16 engine from HF weights.
# Build the EXAONE model using a single GPU and FP16.
python ../llama/convert_checkpoint.py \
--model_dir $HF_MODEL_DIR \
--output_dir trt_models/exaone/fp16/1-gpu \
--dtype float16
trtllm-build \
--checkpoint_dir trt_models/exaone/fp16/1-gpu \
--output_dir trt_engines/exaone/fp16/1-gpu \
--gemm_plugin auto
# Build the EXAONE model using a single GPU and and apply INT8 weight-only quantization.
python ../llama/convert_checkpoint.py \
--model_dir $HF_MODEL_DIR \
--output_dir trt_models/exaone/int8_wq/1-gpu \
--use_weight_only \
--weight_only_precision int8 \
--dtype float16
trtllm-build \
--checkpoint_dir trt_models/exaone/int8_wq/1-gpu \
--output_dir trt_engines/exaone/int8_wq/1-gpu \
--gemm_plugin auto
# Build the EXAONE model using a single GPU and and apply INT4 weight-only quantization.
python ../llama/convert_checkpoint.py \
--model_dir $HF_MODEL_DIR \
--output_dir trt_models/exaone/int4_wq/1-gpu \
--use_weight_only \
--weight_only_precision int4 \
--dtype float16
trtllm-build \
--checkpoint_dir trt_models/exaone/int4_wq/1-gpu \
--output_dir trt_engines/exaone/int4_wq/1-gpu \
--gemm_plugin auto
# Build the EXAONE model using 2-way tensor parallelism and FP16.
python ../llama/convert_checkpoint.py \
--model_dir $HF_MODEL_DIR \
--output_dir trt_models/exaone/fp16/2-gpu \
--tp_size 2 \
--dtype float16
trtllm-build \
--checkpoint_dir trt_models/exaone/fp16/2-gpu \
--output_dir trt_engines/exaone/fp16/2-gpu \
--gemm_plugin auto
Note
: EXAONE model is not supported with
--load_by_shard.
FP8 Post-Training Quantization
The examples below uses the NVIDIA Modelopt (AlgorithMic Model Optimization) toolkit for the model quantization process.
First make sure Modelopt toolkit is installed (see examples/quantization/README.md)
# Build the EXAONE model using a single GPU and and apply FP8 quantization.
python ../../../quantization/quantize.py \
--model_dir $HF_MODEL_DIR \
--dtype float16 \
--qformat fp8 \
--kv_cache_dtype fp8 \
--output_dir trt_models/exaone/fp8/1-gpu \
trtllm-build \
--checkpoint_dir trt_models/exaone/fp8/1-gpu \
--output_dir trt_engines/exaone/fp8/1-gpu \
--gemm_plugin auto
SmoothQuant
The examples below uses the NVIDIA Modelopt (AlgorithMic Model Optimization) toolkit for the model quantization process.
First make sure Modelopt toolkit is installed (see examples/quantization/README.md)
# Build the EXAONE model using a single GPU and and apply INT8 SmoothQuant.
python ../../../quantization/quantize.py \
--model_dir $HF_MODEL_DIR \
--dtype float16 \
--qformat int8_sq \
--output_dir trt_models/exaone/int8_sq/1-gpu
trtllm-build \
--checkpoint_dir trt_models/exaone/int8_sq/1-gpu \
--output_dir trt_engines/exaone/int8_sq/1-gpu \
--gemm_plugin auto
Groupwise quantization (AWQ)
The examples below uses the NVIDIA Modelopt (AlgorithMic Model Optimization) toolkit for the model quantization process.
First make sure Modelopt toolkit is installed (see examples/quantization/README.md)
# Build the EXAONE model using a single GPU and and apply INT4 AWQ.
python ../../../quantization/quantize.py \
--model_dir $HF_MODEL_DIR \
--dtype float16 \
--qformat int4_awq \
--output_dir trt_models/exaone/int4_awq/1-gpu
trtllm-build \
--checkpoint_dir trt_models/exaone/int4_awq/1-gpu \
--output_dir trt_engines/exaone/int4_awq/1-gpu \
--gemm_plugin auto
W4A16 AWQ with FP8 GEMM (W4A8 AWQ)
For Hopper GPUs, TRT-LLM also supports employing FP8 GEMM for accelerating linear layers. This mode is noted with w4a8_awq for Modelopt and TRT-LLM, in which both weights and activations are converted from W4A16 to FP8 for GEMM calculation.
Please make sure your system contains a Hopper GPU before trying the commands below.
# Build the EXAONE model using a single GPU and and apply W4A8 AWQ.
python ../../../quantization/quantize.py \
--model_dir $HF_MODEL_DIR \
--dtype float16 \
--qformat w4a8_awq \
--output_dir trt_models/exaone/w4a8_awq/1-gpu
trtllm-build \
--checkpoint_dir trt_models/exaone/w4a8_awq/1-gpu \
--output_dir trt_engines/exaone/w4a8_awq/1-gpu \
--gemm_plugin auto
Run Engine
Test your engine with the run.py script:
python3 ../../../run.py \
--input_text "When did the first world war end?" \
--max_output_len=100 \
--tokenizer_dir $HF_MODEL_DIR \
--engine_dir trt_engines/exaone/fp16/1-gpu
# Run with 2 GPUs
mpirun -n 2 --allow-run-as-root \
python3 ../../../run.py \
--input_text "When did the first world war end?" \
--max_output_len=100 \
--tokenizer_dir $HF_MODEL_DIR \
--engine_dir trt_engines/exaone/fp16/2-gpu
python ../../../summarize.py \
--test_trt_llm \
--data_type fp16 \
--hf_model_dir $HF_MODEL_DIR \
--engine_dir trt_engines/exaone/fp16/1-gpu
For more examples see examples/models/core/llama/README.md