# EXAONE This document shows how to build and run a [EXAONE](https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct) model in TensorRT-LLM. The TensorRT-LLM EXAONE implementation is based on the LLaMA model. The implementation can be found in [llama/model.py](../../../../tensorrt_llm/models/llama/model.py). See the LLaMA example [`examples/models/core/llama`](../llama) for details. - [EXAONE](#exaone) - [Support Matrix](#support-matrix) - [Supported Models](#supported-models) - [EXAONE-3.0](#exaone-30) - [EXAONE-Deep](#exaone-deep) - [Usage](#usage) - [Convert checkpoint and build TensorRT engine(s)](#convert-checkpoint-and-build-tensorrt-engines) - [FP8 Post-Training Quantization](#fp8-post-training-quantization) - [SmoothQuant](#smoothquant) - [Groupwise quantization (AWQ)](#groupwise-quantization-awq) - [W4A16 AWQ with FP8 GEMM (W4A8 AWQ)](#w4a16-awq-with-fp8-gemm-w4a8-awq) - [Run Engine](#run-engine) ## Support Matrix * FP16 * BF16 * Tensor Parallel * FP8 * INT8 & INT4 Weight-Only * INT8 SmoothQuant * INT4 AWQ & W4A8 AWQ ## Supported Models ### EXAONE-3.0 Download the HuggingFace FP32 checkpoints of EXAONE-3.0 model. We support EXAONE-3.0 families but here, we only use the `EXAONE-3.0-7.8B-Instruct` model for the example. ```bash export HF_MODEL_DIR=hf_models/exaone git clone https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct $HF_MODEL_DIR ``` ### EXAONE-Deep Download the HuggingFace BF16 checkpoints of EXAONE-Deep model. Here, we only use the `EXAONE-Deep-2.4B` model for the example. We can use the same procedure as EXAONE-3.0 to convert the weights and build the TensorRT engine. ```bash export HF_MODEL_DIR=hf_models/exaone_deep git clone https://huggingface.co/LGAI-EXAONE/EXAONE-Deep-2.4B $HF_MODEL_DIR ``` ## Usage The next section describe how to convert the weights from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format. We will use llama's [convert_checkpoint.py](../llama/convert_checkpoint.py) for EXAONE model and then we build the model with `trtllm-build`. ### Convert checkpoint and build TensorRT engine(s) ```bash # Build a single-GPU float16 engine from HF weights. # Build the EXAONE model using a single GPU and FP16. python ../llama/convert_checkpoint.py \ --model_dir $HF_MODEL_DIR \ --output_dir trt_models/exaone/fp16/1-gpu \ --dtype float16 trtllm-build \ --checkpoint_dir trt_models/exaone/fp16/1-gpu \ --output_dir trt_engines/exaone/fp16/1-gpu \ --gemm_plugin auto # Build the EXAONE model using a single GPU and and apply INT8 weight-only quantization. python ../llama/convert_checkpoint.py \ --model_dir $HF_MODEL_DIR \ --output_dir trt_models/exaone/int8_wq/1-gpu \ --use_weight_only \ --weight_only_precision int8 \ --dtype float16 trtllm-build \ --checkpoint_dir trt_models/exaone/int8_wq/1-gpu \ --output_dir trt_engines/exaone/int8_wq/1-gpu \ --gemm_plugin auto # Build the EXAONE model using a single GPU and and apply INT4 weight-only quantization. python ../llama/convert_checkpoint.py \ --model_dir $HF_MODEL_DIR \ --output_dir trt_models/exaone/int4_wq/1-gpu \ --use_weight_only \ --weight_only_precision int4 \ --dtype float16 trtllm-build \ --checkpoint_dir trt_models/exaone/int4_wq/1-gpu \ --output_dir trt_engines/exaone/int4_wq/1-gpu \ --gemm_plugin auto # Build the EXAONE model using 2-way tensor parallelism and FP16. python ../llama/convert_checkpoint.py \ --model_dir $HF_MODEL_DIR \ --output_dir trt_models/exaone/fp16/2-gpu \ --tp_size 2 \ --dtype float16 trtllm-build \ --checkpoint_dir trt_models/exaone/fp16/2-gpu \ --output_dir trt_engines/exaone/fp16/2-gpu \ --gemm_plugin auto ``` > **NOTE**: EXAONE model is not supported with `--load_by_shard`. ### FP8 Post-Training Quantization The examples below uses the NVIDIA Modelopt (AlgorithMic Model Optimization) toolkit for the model quantization process. First make sure Modelopt toolkit is installed (see [examples/quantization/README.md](/examples/quantization/README.md#preparation)) ```bash # Build the EXAONE model using a single GPU and and apply FP8 quantization. python ../../../quantization/quantize.py \ --model_dir $HF_MODEL_DIR \ --dtype float16 \ --qformat fp8 \ --kv_cache_dtype fp8 \ --output_dir trt_models/exaone/fp8/1-gpu \ trtllm-build \ --checkpoint_dir trt_models/exaone/fp8/1-gpu \ --output_dir trt_engines/exaone/fp8/1-gpu \ --gemm_plugin auto ``` ### SmoothQuant The examples below uses the NVIDIA Modelopt (AlgorithMic Model Optimization) toolkit for the model quantization process. First make sure Modelopt toolkit is installed (see [examples/quantization/README.md](/examples/quantization/README.md#preparation)) ```bash # Build the EXAONE model using a single GPU and and apply INT8 SmoothQuant. python ../../../quantization/quantize.py \ --model_dir $HF_MODEL_DIR \ --dtype float16 \ --qformat int8_sq \ --output_dir trt_models/exaone/int8_sq/1-gpu trtllm-build \ --checkpoint_dir trt_models/exaone/int8_sq/1-gpu \ --output_dir trt_engines/exaone/int8_sq/1-gpu \ --gemm_plugin auto ``` ### Groupwise quantization (AWQ) The examples below uses the NVIDIA Modelopt (AlgorithMic Model Optimization) toolkit for the model quantization process. First make sure Modelopt toolkit is installed (see [examples/quantization/README.md](/examples/quantization/README.md#preparation)) ```bash # Build the EXAONE model using a single GPU and and apply INT4 AWQ. python ../../../quantization/quantize.py \ --model_dir $HF_MODEL_DIR \ --dtype float16 \ --qformat int4_awq \ --output_dir trt_models/exaone/int4_awq/1-gpu trtllm-build \ --checkpoint_dir trt_models/exaone/int4_awq/1-gpu \ --output_dir trt_engines/exaone/int4_awq/1-gpu \ --gemm_plugin auto ``` #### W4A16 AWQ with FP8 GEMM (W4A8 AWQ) For Hopper GPUs, TRT-LLM also supports employing FP8 GEMM for accelerating linear layers. This mode is noted with `w4a8_awq` for Modelopt and TRT-LLM, in which both weights and activations are converted from W4A16 to FP8 for GEMM calculation. Please make sure your system contains a Hopper GPU before trying the commands below. ```bash # Build the EXAONE model using a single GPU and and apply W4A8 AWQ. python ../../../quantization/quantize.py \ --model_dir $HF_MODEL_DIR \ --dtype float16 \ --qformat w4a8_awq \ --output_dir trt_models/exaone/w4a8_awq/1-gpu trtllm-build \ --checkpoint_dir trt_models/exaone/w4a8_awq/1-gpu \ --output_dir trt_engines/exaone/w4a8_awq/1-gpu \ --gemm_plugin auto ``` ### Run Engine Test your engine with the [run.py](../run.py) script: ```bash python3 ../../../run.py \ --input_text "When did the first world war end?" \ --max_output_len=100 \ --tokenizer_dir $HF_MODEL_DIR \ --engine_dir trt_engines/exaone/fp16/1-gpu # Run with 2 GPUs mpirun -n 2 --allow-run-as-root \ python3 ../../../run.py \ --input_text "When did the first world war end?" \ --max_output_len=100 \ --tokenizer_dir $HF_MODEL_DIR \ --engine_dir trt_engines/exaone/fp16/2-gpu python ../../../summarize.py \ --test_trt_llm \ --data_type fp16 \ --hf_model_dir $HF_MODEL_DIR \ --engine_dir trt_engines/exaone/fp16/1-gpu ``` For more examples see [`examples/models/core/llama/README.md`](../llama/README.md)