mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

2025-07-14 22:28:10 +09:00

11 KiB

Raw Blame History

EXAONE

This document shows how to build and run a EXAONE model in TensorRT-LLM.

The TensorRT-LLM EXAONE implementation is based on the LLaMA model. The implementation can be found in llama/model.py. See the LLaMA example examples/models/core/llama for details.

EXAONE

Support Matrix

FP16
BF16
Tensor Parallel
FP8
INT8 & INT4 Weight-Only
INT8 SmoothQuant
INT4 AWQ & W4A8 AWQ

Supported Models

EXAONE-3.0

Download the HuggingFace FP32 checkpoints of EXAONE-3.0 model. We support EXAONE-3.0 families but here, we only use the EXAONE-3.0-7.8B-Instruct model for the example.

export HF_MODEL_DIR=hf_models/exaone
git clone https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct $HF_MODEL_DIR

EXAONE-Deep

Download the HuggingFace checkpoints of EXAONE-Deep model. Here, we only use the EXAONE-Deep-2.4B model for the example. We can use the same procedure as EXAONE-3.0 to convert the weights and build the TensorRT engine.

export HF_MODEL_DIR=hf_models/exaone_deep
git clone https://huggingface.co/LGAI-EXAONE/EXAONE-Deep-2.4B $HF_MODEL_DIR

EXAONE-4.0

Download he HuggingFace checkpoints of EXAONE-4.0 model. Here, we only use the TODO: replace with REAL name, EXAONE-4.0 model for the example. From EXAONE-4.0 model, we support EXAONE models only on PyTorch flow.

export HF_MODEL_DIR=hf_models/exaone4
git clone ... $HF_MODEL_DIR (TODO Change ... to real HF directory)

Usage

The next section describe how to convert the weights from the HuggingFace (HF) Transformers format to the TensorRT-LLM format. We will use llama's convert_checkpoint.py for EXAONE model and then we build the model with trtllm-build.

Pytorch flow

To quickly run EXAONE-4.0 models, you can use examples/llm-api/quickstart_advanced.py:

python ../../../llm-api/quickstart_advanced.py --model_dir hf_models/$MODEL_NAME --disable_kv_cache_reuse

SWA currently does not support kv_cache_reuse. Please make sure to disable KV cache reuse when running with SWA.

The output will be like:

TODO: Fill this with real HF checkpoints output

PyTorch flow Quantization

For PyTorch flow, TRT-LLM supports quantized format generated by TensorRT Model Optimizer.

You can either do pre-quantized models in HF model hub, or can generate quantized model by yourself and then run models with below command:

git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git
cd TensorRT-Model-Optimizer/examples/llm_ptq
scripts/huggingface_example.sh --model  hf_models/$MODEL_NAME --quant fp8 --export_fmt hf

For more information, please refer to official docs or TensorRT Model Optimizer.

Troubleshooting

The following error may occur during quantization:

torch._dynamo.exc.Unsupported: Graph break under GenericContextWrappingVariable
Explanation: Attempted to graph break in an active context manager(s) that doesn't support graph breaking.
Hint: Move the offending context manager(s) to outside the compiled region.
Hint: This graph break may have been caused by an earlier graph break. Resolving the earlier graph break may resolve this one.

This error may indicate an incompatibility between torch.compile() and the HybridCache module of the transformers library. As a result, TensorRT Model Optimizer (ModelOpt) cannot perform PTQ with HybridCache.

Temporarily switching to DynamicCache when creating PTQ models could help address the issue. This can be done by updating the cache_implementation field in the generation_config.json file located in the model checkpoint directory, for example:

# generation_config.json
{
    // Change "hybrid" to "dynamic" to run PTQ.
    // Revert this to "hybrid" after quantization is complete.
    "cache_implementation": "hybrid",
    ...
}

For models with sliding window attention, DynamicCache is less memory-efficient than HybridCache because it retains the entire key-value cache. However, this does not break the model's attention logic, as the cache implementation is separated from the attention computation itself. This trade-off is acceptable for the PTQ process, which is a one-time procedure. Our tests confirm that this workaround does not degrade accuracy on MMLU or GSM8K benchmarks with the default ModelOpt settings.

TRT flow

Convert checkpoint and build TensorRT engine(s)

# Build a single-GPU float16 engine from HF weights.

# Build the EXAONE model using a single GPU and FP16.
python ../llama/convert_checkpoint.py \
    --model_dir $HF_MODEL_DIR \
    --output_dir trt_models/exaone/fp16/1-gpu \
    --dtype float16

trtllm-build \
    --checkpoint_dir trt_models/exaone/fp16/1-gpu \
    --output_dir trt_engines/exaone/fp16/1-gpu \
    --gemm_plugin auto

# Build the EXAONE model using a single GPU and and apply INT8 weight-only quantization.
python ../llama/convert_checkpoint.py \
    --model_dir $HF_MODEL_DIR \
    --output_dir trt_models/exaone/int8_wq/1-gpu \
    --use_weight_only \
    --weight_only_precision int8 \
    --dtype float16

trtllm-build \
    --checkpoint_dir trt_models/exaone/int8_wq/1-gpu \
    --output_dir trt_engines/exaone/int8_wq/1-gpu \
    --gemm_plugin auto

# Build the EXAONE model using a single GPU and and apply INT4 weight-only quantization.
python ../llama/convert_checkpoint.py \
    --model_dir $HF_MODEL_DIR \
    --output_dir trt_models/exaone/int4_wq/1-gpu \
    --use_weight_only \
    --weight_only_precision int4 \
    --dtype float16

trtllm-build \
    --checkpoint_dir trt_models/exaone/int4_wq/1-gpu \
    --output_dir trt_engines/exaone/int4_wq/1-gpu \
    --gemm_plugin auto

# Build the EXAONE model using 2-way tensor parallelism and FP16.
python ../llama/convert_checkpoint.py \
    --model_dir $HF_MODEL_DIR \
    --output_dir trt_models/exaone/fp16/2-gpu \
    --tp_size 2 \
    --dtype float16

trtllm-build \
    --checkpoint_dir trt_models/exaone/fp16/2-gpu \
    --output_dir trt_engines/exaone/fp16/2-gpu \
    --gemm_plugin auto

Note

: EXAONE model is not supported with --load_by_shard.

FP8 Post-Training Quantization

The examples below uses the NVIDIA Modelopt (AlgorithMic Model Optimization) toolkit for the model quantization process.

First make sure Modelopt toolkit is installed (see examples/quantization/README.md)

# Build the EXAONE model using a single GPU and and apply FP8 quantization.
python ../../../quantization/quantize.py \
    --model_dir $HF_MODEL_DIR \
    --dtype float16 \
    --qformat fp8 \
    --kv_cache_dtype fp8 \
    --output_dir trt_models/exaone/fp8/1-gpu \

trtllm-build \
    --checkpoint_dir trt_models/exaone/fp8/1-gpu \
    --output_dir trt_engines/exaone/fp8/1-gpu \
    --gemm_plugin auto

SmoothQuant

The examples below uses the NVIDIA Modelopt (AlgorithMic Model Optimization) toolkit for the model quantization process.

First make sure Modelopt toolkit is installed (see examples/quantization/README.md)

# Build the EXAONE model using a single GPU and and apply INT8 SmoothQuant.
python ../../../quantization/quantize.py \
    --model_dir $HF_MODEL_DIR \
    --dtype float16 \
    --qformat int8_sq \
    --output_dir trt_models/exaone/int8_sq/1-gpu

trtllm-build \
    --checkpoint_dir trt_models/exaone/int8_sq/1-gpu \
    --output_dir trt_engines/exaone/int8_sq/1-gpu \
    --gemm_plugin auto

Groupwise quantization (AWQ)

The examples below uses the NVIDIA Modelopt (AlgorithMic Model Optimization) toolkit for the model quantization process.

First make sure Modelopt toolkit is installed (see examples/quantization/README.md)

# Build the EXAONE model using a single GPU and and apply INT4 AWQ.
python ../../../quantization/quantize.py \
    --model_dir $HF_MODEL_DIR \
    --dtype float16 \
    --qformat int4_awq \
    --output_dir trt_models/exaone/int4_awq/1-gpu

trtllm-build \
    --checkpoint_dir trt_models/exaone/int4_awq/1-gpu \
    --output_dir trt_engines/exaone/int4_awq/1-gpu \
    --gemm_plugin auto

W4A16 AWQ with FP8 GEMM (W4A8 AWQ)

For Hopper GPUs, TRT-LLM also supports employing FP8 GEMM for accelerating linear layers. This mode is noted with w4a8_awq for Modelopt and TRT-LLM, in which both weights and activations are converted from W4A16 to FP8 for GEMM calculation.

Please make sure your system contains a Hopper GPU before trying the commands below.

# Build the EXAONE model using a single GPU and and apply W4A8 AWQ.
python ../../../quantization/quantize.py \
    --model_dir $HF_MODEL_DIR \
    --dtype float16 \
    --qformat w4a8_awq \
    --output_dir trt_models/exaone/w4a8_awq/1-gpu

trtllm-build \
    --checkpoint_dir trt_models/exaone/w4a8_awq/1-gpu \
    --output_dir trt_engines/exaone/w4a8_awq/1-gpu \
    --gemm_plugin auto

Run Engine

Test your engine with the run.py script:

python3 ../../../run.py \
    --input_text "When did the first world war end?" \
    --max_output_len=100 \
    --tokenizer_dir $HF_MODEL_DIR \
    --engine_dir trt_engines/exaone/fp16/1-gpu

# Run with 2 GPUs
mpirun -n 2 --allow-run-as-root \
    python3 ../../../run.py \
    --input_text "When did the first world war end?" \
    --max_output_len=100 \
    --tokenizer_dir $HF_MODEL_DIR \
    --engine_dir trt_engines/exaone/fp16/2-gpu

python ../../../summarize.py \
    --test_trt_llm \
    --data_type fp16 \
    --hf_model_dir $HF_MODEL_DIR \
    --engine_dir trt_engines/exaone/fp16/1-gpu

For more examples see examples/models/core/llama/README.md

11 KiB Raw Blame History

EXAONE

Support Matrix

Supported Models

EXAONE-3.0

EXAONE-Deep

EXAONE-4.0

Usage

Pytorch flow

PyTorch flow Quantization

TRT flow

Convert checkpoint and build TensorRT engine(s)

FP8 Post-Training Quantization

SmoothQuant

Groupwise quantization (AWQ)

W4A16 AWQ with FP8 GEMM (W4A8 AWQ)

Run Engine

11 KiB

Raw Blame History