TensorRT-LLMs/examples/gptj/README.md

194 lines
8.2 KiB
Markdown

# GPT-J
This document explains how to build the [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6b) model using TensorRT-LLM and run on a single GPU.
## Overview
The TensorRT-LLM GPT-J implementation can be found in [`tensorrt_llm/models/gptj/model.py`](../../tensorrt_llm/models/gptj/model.py). The TensorRT-LLM GPT-J example
code is located in [`examples/gptj`](./). There are three main files in that folder:
* [`build.py`](./build.py) to build the [TensorRT](https://developer.nvidia.com/tensorrt) engine(s) needed to run the GPT-J model,
* [`run.py`](./run.py) to run the inference on an input text,
* [`summarize.py`](./summarize.py) to summarize the articles in the [cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail) dataset using the model.
## Support Matrix
* FP16
* FP8
* INT4 Weight-Only
* FP8 KV CACHE
## Usage
### 1. Download weights from HuggingFace (HF) Transformers
```bash
# 1. Weights & config
git clone https://huggingface.co/EleutherAI/gpt-j-6b gptj_model
pushd gptj_model && \
rm -f pytorch_model.bin && \
wget https://huggingface.co/EleutherAI/gpt-j-6b/resolve/main/pytorch_model.bin && \
popd
# 2. Vocab and merge table
wget https://huggingface.co/EleutherAI/gpt-j-6b/resolve/main/vocab.json
wget https://huggingface.co/EleutherAI/gpt-j-6b/resolve/main/merges.txt
```
### 2. Build TensorRT engine(s)
TensorRT-LLM builds TensorRT engine(s) using a HF checkpoint. If no checkpoint directory is specified, TensorRT-LLM will build engine(s) using
dummy weights.
Examples of build invocations:
```bash
# Build a float16 engine using HF weights.
# Enable several TensorRT-LLM plugins to increase runtime performance. It also helps with build time.
python3 build.py --dtype=float16 \
--log_level=verbose \
--enable_context_fmha \
--use_gpt_attention_plugin float16 \
--use_gemm_plugin float16 \
--max_batch_size=32 \
--max_input_len=1919 \
--max_output_len=128 \
--remove_input_padding \
--output_dir=gptj_engine \
--model_dir=gptj_model 2>&1 | tee build.log
# Build a float16 engine using dummy weights, useful for performance tests.
# Enable several TensorRT-LLM plugins to increase runtime performance. It also helps with build time.
python3 build.py --dtype=float16 \
--log_level=verbose \
--enable_context_fmha \
--use_gpt_attention_plugin float16 \
--use_gemm_plugin float16 \
--max_batch_size=32 \
--max_input_len=1919 \
--max_output_len=128 \
--remove_input_padding \
--output_dir=gptj_engine_dummy_weights 2>&1 | tee build.log
# Build an int4 weight only quantization engine using awq int4 weight only quantized weights.
# Enable several TensorRT-LLM plugins to increase runtime performance. It also helps with build time.
python3 build.py --dtype=float16 \
--log_level=verbose \
--enable_context_fmha \
--use_gpt_attention_plugin float16 \
--use_gemm_plugin float16 \
--max_batch_size=32 \
--max_input_len=1919 \
--max_output_len=128 \
--remove_input_padding \
--output_dir=gptj_engine \
--use_weight_only \
--per_group \
--weight_only_precision=int4 \
--model_dir=awq_int4_weight_only_quantized_models 2>&1 | tee build.log
```
#### FP8 Post-Training Quantization
The examples below uses the NVIDIA AMMO (AlgorithMic Model Optimization) toolkit for the model quantization process.
First make sure AMMO toolkit is installed (see [examples/quantization/README.md](/examples/quantization/README.md#preparation))
Now quantize HF GPT-J weights as follows.
After successfully running the script, the output should be in .npz format, e.g. `quantized_fp8/gptj_tp1_rank0.npz`,
where FP8 scaling factors are stored.
```bash
# Quantize HF GPT-J 6B checkpoint into FP8 format
python quantize.py --model_dir gptj_model \
--dtype float16 \
--qformat fp8 \
--export_path ./quantized_fp8 \
--calib_size 512
# Build GPT-J 6B using original HF checkpoint + PTQ scaling factors
python build.py --model_dir gptj_model \
--quantized_fp8_model_path ./quantized_fp8/gptj_tp1_rank0.npz \
--dtype float16 \
--use_gpt_attention_plugin float16 \
--enable_context_fmha \
--enable_two_optimization_profiles \
--output_dir gptj_engine_fp8_quantized \
--enable_fp8 \
--fp8_kv_cache \
--strongly_typed
```
#### Fused MultiHead Attention (FMHA)
You can enable the FMHA kernels for GPT by adding `--enable_context_fmha` to the invocation of `build.py`. Note that it is disabled by default because of possible accuracy issues due to the use of Flash Attention.
If you find that the default fp16 accumulation (`--enable_context_fmha`) cannot meet the requirement, you can try to enable fp32 accumulation by adding `--enable_context_fmha_fp32_acc`. However, it is expected to see performance drop.
Note `--enable_context_fmha` / `--enable_context_fmha_fp32_acc` has to be used together with `--use_gpt_attention_plugin float16`.
#### FP8 KV cache
One can enable FP8 for KV cache to reduce memory footprint used by KV cache and improve the accuracy over INT8 KV cache. There are 3 options need to be added to the invocation of `build.py` for that:
- `--enable_fp8` enables FP8 GEMMs in the network.
- `--fp8_kv_cache` to enable FP8 accurancy for KV cache.
- `--quantized_fp8_model_path` to provide path to the quantized model calibrated for FP8. For more details see [quantization docs](../quantization/README.md).
#### AWQ INT4 weight only quantization
One can enable AWQ INT4 weight only quantization with these 3 options when building engine with `build.py`:
- `--use_weight_only` enables weight only GEMMs in the network.
- `--per_group` enable groupwise weight only quantization, for GPT-J example, we support AWQ with the group size default as 128.
- `--weight_only_precision=int4` the precision of weight only quantization. Only int4 is supported for groupwise weight only quantization.
The linear layer in the AWQ int4 weight only quantized weights should have 3 parameters:
1. FP16 smoothed_weights (=weights/pre_quant_scale) with shape [n, k] ;
2. FP16 amax (the max abs values of the smoothed_weights) with shape [n, k/group_size];
3. FP16 pre_quant_scale (the smooth scales used to multiply by activation) with shape [k];
### 3. Run
To run a TensorRT-LLM GPT-J model:
```bash
python3 run.py --max_output_len=50 --engine_dir=gptj_engine
```
## Summarization using the GPT-J model
The following section describes how to run a TensorRT-LLM GPT-J model to summarize the articles from the
[cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail) dataset. For each summary, the script can compute the
[ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)) scores and use the `ROUGE-1` score to validate the implementation.
The script can also perform the same summarization using the HF GPT-J model.
As previously explained, the first step is to build the TensorRT engine as described above using HF weights. You also have to install the requirements:
```bash
pip install -r requirements.txt
```
The summarization can be done using the [`summarize.py`](./summarize.py) script as follows:
```bash
# Run the summarization task.
python3 summarize.py --engine_dir gptj_engine \
--model_dir gptj_model \
--test_hf \
--batch_size 1 \
--test_trt_llm \
--tensorrt_llm_rouge1_threshold 14 \
--data_type fp16 \
--check_accuracy
```
## Known issues
- You must enable the LayerNorm plugin to build the engine for GPT-J when using TensorRT 8.6, this constraint is removed in TensorRT 9.0. To enable LayerNorm plugin, you should add `--use_layernorm_plugin <float16 or float32>` in the build.py, see build.py commands example above.