TensorRT-LLMs/examples/gptj/README.md

# GPT-J

This document explains how to build the [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6b) model using TensorRT-LLM and run on a single GPU.

## Overview

The TensorRT-LLM GPT-J implementation can be found in [`tensorrt_llm/models/gptj/model.py`](../../tensorrt_llm/models/gptj/model.py). The TensorRT-LLM GPT-J example
code is located in [`examples/gptj`](./). There are three main files in that folder:

 * [`build.py`](./build.py) to build the [TensorRT](https://developer.nvidia.com/tensorrt) engine(s) needed to run the GPT-J model,
 * [`run.py`](./run.py) to run the inference on an input text,
 * [`summarize.py`](./summarize.py) to summarize the articles in the [cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail) dataset using the model.

## Support Matrix
  * FP16
  * FP8
  * INT4 Weight-Only
  * FP8 KV CACHE

## Usage

### 1. Download weights from HuggingFace (HF) Transformers

```bash
# 1. Weights & config
git clone https://huggingface.co/EleutherAI/gpt-j-6b gptj_model
pushd gptj_model && \
  rm -f pytorch_model.bin && \
  wget https://huggingface.co/EleutherAI/gpt-j-6b/resolve/main/pytorch_model.bin && \
popd

# 2. Vocab and merge table
wget https://huggingface.co/EleutherAI/gpt-j-6b/resolve/main/vocab.json
wget https://huggingface.co/EleutherAI/gpt-j-6b/resolve/main/merges.txt
```

### 2. Build TensorRT engine(s)

TensorRT-LLM builds TensorRT engine(s) using a HF checkpoint. If no checkpoint directory is specified, TensorRT-LLM will build engine(s) using
dummy weights.

Examples of build invocations:

```bash
# Build a float16 engine using HF weights.
# Enable several TensorRT-LLM plugins to increase runtime performance. It also helps with build time.

python3 build.py --dtype=float16 \
                 --log_level=verbose \
                 --enable_context_fmha \
                 --use_gpt_attention_plugin float16 \
                 --use_gemm_plugin float16 \
                 --max_batch_size=32 \
                 --max_input_len=1919 \
                 --max_output_len=128 \
                 --remove_input_padding \
                 --output_dir=gptj_engine \
                 --model_dir=gptj_model 2>&1 | tee build.log

# Build a float16 engine using dummy weights, useful for performance tests.
# Enable several TensorRT-LLM plugins to increase runtime performance. It also helps with build time.

python3 build.py --dtype=float16 \
                 --log_level=verbose \
                 --enable_context_fmha \
                 --use_gpt_attention_plugin float16 \
                 --use_gemm_plugin float16 \
                 --max_batch_size=32 \
                 --max_input_len=1919 \
                 --max_output_len=128 \
                 --remove_input_padding \
                 --output_dir=gptj_engine_dummy_weights 2>&1 | tee build.log

# Build an int4 weight only quantization engine using awq int4 weight only quantized weights.
# Enable several TensorRT-LLM plugins to increase runtime performance. It also helps with build time.

python3 build.py --dtype=float16 \
                 --log_level=verbose \
                 --enable_context_fmha \
                 --use_gpt_attention_plugin float16 \
                 --use_gemm_plugin float16 \
                 --max_batch_size=32 \
                 --max_input_len=1919 \
                 --max_output_len=128 \
                 --remove_input_padding \
                 --output_dir=gptj_engine \
                 --use_weight_only \
                 --per_group \
                 --weight_only_precision=int4 \
                 --model_dir=awq_int4_weight_only_quantized_models 2>&1 | tee build.log

```

#### FP8 Post-Training Quantization

The examples below uses the NVIDIA AMMO (AlgorithMic Model Optimization) toolkit for the model quantization process.

First make sure AMMO toolkit is installed (see [examples/quantization/README.md](/examples/quantization/README.md#preparation))

Now quantize HF GPT-J weights as follows.
After successfully running the script, the output should be in .npz format, e.g. `quantized_fp8/gptj_tp1_rank0.npz`,
where FP8 scaling factors are stored.

```bash
# Quantize HF GPT-J 6B checkpoint into FP8 format
python quantize.py --model_dir gptj_model \
                   --dtype float16 \
                   --qformat fp8 \
                   --export_path ./quantized_fp8 \
                   --calib_size 512

# Build GPT-J 6B using original HF checkpoint + PTQ scaling factors
python build.py --model_dir gptj_model \
                --quantized_fp8_model_path ./quantized_fp8/gptj_tp1_rank0.npz \
                --dtype float16 \
                --use_gpt_attention_plugin float16 \
                --enable_context_fmha \
                --enable_two_optimization_profiles \
                --output_dir gptj_engine_fp8_quantized \
                --enable_fp8 \
                --fp8_kv_cache \
                --strongly_typed
```

#### Fused MultiHead Attention (FMHA)

You can enable the FMHA kernels for GPT by adding `--enable_context_fmha` to the invocation of `build.py`. Note that it is disabled by default because of possible accuracy issues due to the use of Flash Attention.

If you find that the default fp16 accumulation (`--enable_context_fmha`) cannot meet the requirement, you can try to enable fp32 accumulation by adding `--enable_context_fmha_fp32_acc`. However, it is expected to see performance drop.

Note `--enable_context_fmha` / `--enable_context_fmha_fp32_acc` has to be used together with `--use_gpt_attention_plugin float16`.

#### FP8 KV cache

One can enable FP8 for KV cache to reduce memory footprint used by KV cache and improve the accuracy over INT8 KV cache. There are 3 options need to be added to the invocation of `build.py` for that:

- `--enable_fp8` enables FP8 GEMMs in the network.
- `--fp8_kv_cache` to enable FP8 accurancy for KV cache.
- `--quantized_fp8_model_path` to provide path to the quantized model calibrated for FP8. For more details see [quantization docs](../quantization/README.md).

#### AWQ INT4 weight only quantization

One can enable AWQ INT4 weight only quantization with these 3 options when building engine with `build.py`:

- `--use_weight_only` enables weight only GEMMs in the network.
- `--per_group` enable groupwise weight only quantization, for GPT-J example, we support AWQ with the group size default as 128.
- `--weight_only_precision=int4` the precision of weight only quantization. Only int4 is supported for groupwise weight only quantization.

The linear layer in the AWQ int4 weight only quantized weights should have 3 parameters:
1. FP16 smoothed_weights (=weights/pre_quant_scale) with shape [n, k] ;
2. FP16 amax (the max abs values of the smoothed_weights) with shape [n, k/group_size];
3. FP16 pre_quant_scale (the smooth scales used to multiply by activation) with shape [k];

### 3. Run


To run a TensorRT-LLM GPT-J model:

```bash
python3 run.py --max_output_len=50 --engine_dir=gptj_engine
```

## Summarization using the GPT-J model

The following section describes how to run a TensorRT-LLM GPT-J model to summarize the articles from the
[cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail) dataset. For each summary, the script can compute the
[ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)) scores and use the `ROUGE-1` score to validate the implementation.
The script can also perform the same summarization using the HF GPT-J model.

As previously explained, the first step is to build the TensorRT engine as described above using HF weights. You also have to install the requirements:

```bash
pip install -r requirements.txt
```

The summarization can be done using the [`summarize.py`](./summarize.py) script as follows:

```bash
# Run the summarization task.
python3 summarize.py --engine_dir gptj_engine \
                     --model_dir gptj_model \
                     --test_hf \
                     --batch_size 1 \
                     --test_trt_llm \
                     --tensorrt_llm_rouge1_threshold 14 \
                     --data_type fp16 \
                     --check_accuracy

```

## Known issues

- You must enable the LayerNorm plugin to build the engine for GPT-J when using TensorRT 8.6, this constraint is removed in TensorRT 9.0. To enable LayerNorm plugin, you should add `--use_layernorm_plugin <float16 or float32>` in the build.py, see build.py commands example above.