# EXAONE

This document shows how to build and run a [EXAONE](https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct) model in TensorRT-LLM.

The TensorRT-LLM EXAONE implementation is based on the LLaMA model. The implementation can be found in [llama/model.py](../../../../tensorrt_llm/models/llama/model.py).
See the LLaMA example [`examples/models/core/llama`](../llama) for details.

- [EXAONE](#exaone)
  - [Support Matrix](#support-matrix)
  - [Supported Models](#supported-models)
    - [EXAONE-3.0](#exaone-30)
    - [EXAONE-Deep](#exaone-deep)
    - [EXAONE-4.0](#exaone-40)
  - [Usage](#usage)
    - [PyTorch flow](#pytorch-flow)
        -[PyTorch flow Quantization](#pytorch-flow-quantization)
    - [TRT Flow](#trt-flow)
    - [Convert checkpoint and build TensorRT engine(s)](#convert-checkpoint-and-build-tensorrt-engines)
    - [FP8 Post-Training Quantization](#fp8-post-training-quantization)
    - [SmoothQuant](#smoothquant)
    - [Groupwise quantization (AWQ)](#groupwise-quantization-awq)
        - [W4A16 AWQ with FP8 GEMM (W4A8 AWQ)](#w4a16-awq-with-fp8-gemm-w4a8-awq)
    - [Run Engine](#run-engine)

## Support Matrix
  * FP16
  * BF16
  * Tensor Parallel
  * FP8
  * INT8 & INT4 Weight-Only
  * INT8 SmoothQuant
  * INT4 AWQ & W4A8 AWQ

## Supported Models
### EXAONE-3.0

Download the HuggingFace FP32 checkpoints of EXAONE-3.0 model. We support EXAONE-3.0 families but here, we only use the `EXAONE-3.0-7.8B-Instruct` model for the example.

```bash
export HF_MODEL_DIR=hf_models/exaone
git clone https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct $HF_MODEL_DIR
```

### EXAONE-Deep

Download the HuggingFace checkpoints of EXAONE-Deep model. Here, we only use the `EXAONE-Deep-2.4B` model for the example. We can use the same procedure as EXAONE-3.0 to convert the weights and build the TensorRT engine.

```bash
export HF_MODEL_DIR=hf_models/exaone_deep
git clone https://huggingface.co/LGAI-EXAONE/EXAONE-Deep-2.4B $HF_MODEL_DIR
```

### EXAONE-4.0

Download he HuggingFace checkpoints of EXAONE-4.0 model. Here, we only use the `EXAONE-4.0-32B` model for the example. From EXAONE-4.0 model, we support only on PyTorch flow.

```bash
export HF_MODEL_DIR=hf_models/exaone4
git clone https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B $HF_MODEL_DIR
```

### Pytorch flow

To quickly run EXAONE-4.0 models, you can use [examples/llm-api/quickstart_advanced.py](../../../llm-api/quickstart_advanced.py):

```bash
python ../../../llm-api/quickstart_advanced.py --model_dir hf_models/$MODEL_NAME --disable_kv_cache_reuse
```

SWA currently does not support kv_cache_reuse. Please make sure to disable KV cache reuse when running with SWA.

The output will be like:
```bash
TODO: Fill this with real HF checkpoints output
```

#### PyTorch flow Quantization

For PyTorch flow, TRT-LLM supports quantized format generated by [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer).

You can either do pre-quantized models in HF model hub, or can generate quantized model by yourself and then run models with below command:

```bash
git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git
cd TensorRT-Model-Optimizer/examples/llm_ptq
scripts/huggingface_example.sh --model  hf_models/$MODEL_NAME --quant fp8 --export_fmt hf
```

For more information, please refer to official [docs](https://github.com/NVIDIA/TensorRT-Model-Optimizer) or [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer).

Troubleshooting

The following error may occur during quantization:
```bash
torch._dynamo.exc.Unsupported: Graph break under GenericContextWrappingVariable
Explanation: Attempted to graph break in an active context manager(s) that doesn't support graph breaking.
Hint: Move the offending context manager(s) to outside the compiled region.
Hint: This graph break may have been caused by an earlier graph break. Resolving the earlier graph break may resolve this one.
```

This error may indicate an incompatibility between `torch.compile()` and the `HybridCache` module of the transformers library. As a result, [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) (ModelOpt) cannot perform PTQ with HybridCache.

Temporarily switching to `DynamicCache` when creating PTQ models could help address the issue. This can be done by updating the `cache_implementation` field in the `generation_config.json` file located in the model checkpoint directory, for example:
```json
# generation_config.json
{
    // Change "hybrid" to "dynamic" to run PTQ.
    // Revert this to "hybrid" after quantization is complete.
    "cache_implementation": "hybrid",
    ...
}
```
For models with sliding window attention, DynamicCache is less memory-efficient than HybridCache because it retains the entire key-value cache. However, this does not break the model's attention logic, as the cache implementation is separated from the attention computation itself. This trade-off is acceptable for the PTQ process, which is a one-time procedure. Our tests confirm that this workaround does not degrade accuracy on MMLU or GSM8K benchmarks with the default ModelOpt settings.

### TRT flow

The next section describe how to convert the weights from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format. We will use llama's [convert_checkpoint.py](../llama/convert_checkpoint.py) for EXAONE model and then we build the model with `trtllm-build`.

### Convert checkpoint and build TensorRT engine(s)

```bash
# Build a single-GPU float16 engine from HF weights.

# Build the EXAONE model using a single GPU and FP16.
python ../llama/convert_checkpoint.py \
    --model_dir $HF_MODEL_DIR \
    --output_dir trt_models/exaone/fp16/1-gpu \
    --dtype float16

trtllm-build \
    --checkpoint_dir trt_models/exaone/fp16/1-gpu \
    --output_dir trt_engines/exaone/fp16/1-gpu \
    --gemm_plugin auto

# Build the EXAONE model using a single GPU and and apply INT8 weight-only quantization.
python ../llama/convert_checkpoint.py \
    --model_dir $HF_MODEL_DIR \
    --output_dir trt_models/exaone/int8_wq/1-gpu \
    --use_weight_only \
    --weight_only_precision int8 \
    --dtype float16

trtllm-build \
    --checkpoint_dir trt_models/exaone/int8_wq/1-gpu \
    --output_dir trt_engines/exaone/int8_wq/1-gpu \
    --gemm_plugin auto

# Build the EXAONE model using a single GPU and and apply INT4 weight-only quantization.
python ../llama/convert_checkpoint.py \
    --model_dir $HF_MODEL_DIR \
    --output_dir trt_models/exaone/int4_wq/1-gpu \
    --use_weight_only \
    --weight_only_precision int4 \
    --dtype float16

trtllm-build \
    --checkpoint_dir trt_models/exaone/int4_wq/1-gpu \
    --output_dir trt_engines/exaone/int4_wq/1-gpu \
    --gemm_plugin auto

# Build the EXAONE model using 2-way tensor parallelism and FP16.
python ../llama/convert_checkpoint.py \
    --model_dir $HF_MODEL_DIR \
    --output_dir trt_models/exaone/fp16/2-gpu \
    --tp_size 2 \
    --dtype float16

trtllm-build \
    --checkpoint_dir trt_models/exaone/fp16/2-gpu \
    --output_dir trt_engines/exaone/fp16/2-gpu \
    --gemm_plugin auto
```
> **NOTE**: EXAONE model is not supported with `--load_by_shard`.

### FP8 Post-Training Quantization

The examples below uses the NVIDIA Modelopt (AlgorithMic Model Optimization) toolkit for the model quantization process.

First make sure Modelopt toolkit is installed (see [examples/quantization/README.md](/examples/quantization/README.md#preparation))

```bash
# Build the EXAONE model using a single GPU and and apply FP8 quantization.
python ../../../quantization/quantize.py \
    --model_dir $HF_MODEL_DIR \
    --dtype float16 \
    --qformat fp8 \
    --kv_cache_dtype fp8 \
    --output_dir trt_models/exaone/fp8/1-gpu \

trtllm-build \
    --checkpoint_dir trt_models/exaone/fp8/1-gpu \
    --output_dir trt_engines/exaone/fp8/1-gpu \
    --gemm_plugin auto
```

### SmoothQuant

The examples below uses the NVIDIA Modelopt (AlgorithMic Model Optimization) toolkit for the model quantization process.

First make sure Modelopt toolkit is installed (see [examples/quantization/README.md](/examples/quantization/README.md#preparation))

```bash
# Build the EXAONE model using a single GPU and and apply INT8 SmoothQuant.
python ../../../quantization/quantize.py \
    --model_dir $HF_MODEL_DIR \
    --dtype float16 \
    --qformat int8_sq \
    --output_dir trt_models/exaone/int8_sq/1-gpu

trtllm-build \
    --checkpoint_dir trt_models/exaone/int8_sq/1-gpu \
    --output_dir trt_engines/exaone/int8_sq/1-gpu \
    --gemm_plugin auto
```

### Groupwise quantization (AWQ)

The examples below uses the NVIDIA Modelopt (AlgorithMic Model Optimization) toolkit for the model quantization process.

First make sure Modelopt toolkit is installed (see [examples/quantization/README.md](/examples/quantization/README.md#preparation))

```bash
# Build the EXAONE model using a single GPU and and apply INT4 AWQ.
python ../../../quantization/quantize.py \
    --model_dir $HF_MODEL_DIR \
    --dtype float16 \
    --qformat int4_awq \
    --output_dir trt_models/exaone/int4_awq/1-gpu

trtllm-build \
    --checkpoint_dir trt_models/exaone/int4_awq/1-gpu \
    --output_dir trt_engines/exaone/int4_awq/1-gpu \
    --gemm_plugin auto
```

#### W4A16 AWQ with FP8 GEMM (W4A8 AWQ)
For Hopper GPUs, TRT-LLM also supports employing FP8 GEMM for accelerating linear layers. This mode is noted with `w4a8_awq` for Modelopt and TRT-LLM, in which both weights and activations are converted from W4A16 to FP8 for GEMM calculation.

Please make sure your system contains a Hopper GPU before trying the commands below.

```bash
# Build the EXAONE model using a single GPU and and apply W4A8 AWQ.
python ../../../quantization/quantize.py \
    --model_dir $HF_MODEL_DIR \
    --dtype float16 \
    --qformat w4a8_awq \
    --output_dir trt_models/exaone/w4a8_awq/1-gpu

trtllm-build \
    --checkpoint_dir trt_models/exaone/w4a8_awq/1-gpu \
    --output_dir trt_engines/exaone/w4a8_awq/1-gpu \
    --gemm_plugin auto
```


### Run Engine
Test your engine with the [run.py](../run.py) script:

```bash
python3 ../../../run.py \
    --input_text "When did the first world war end?" \
    --max_output_len=100 \
    --tokenizer_dir $HF_MODEL_DIR \
    --engine_dir trt_engines/exaone/fp16/1-gpu

# Run with 2 GPUs
mpirun -n 2 --allow-run-as-root \
    python3 ../../../run.py \
    --input_text "When did the first world war end?" \
    --max_output_len=100 \
    --tokenizer_dir $HF_MODEL_DIR \
    --engine_dir trt_engines/exaone/fp16/2-gpu

python ../../../summarize.py \
    --test_trt_llm \
    --data_type fp16 \
    --hf_model_dir $HF_MODEL_DIR \
    --engine_dir trt_engines/exaone/fp16/1-gpu
```

For more examples see [`examples/models/core/llama/README.md`](../llama/README.md)