mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-02-12 14:03:48 +08:00
194 lines
8.2 KiB
Markdown
194 lines
8.2 KiB
Markdown
# GPT-J
|
|
|
|
This document explains how to build the [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6b) model using TensorRT-LLM and run on a single GPU.
|
|
|
|
## Overview
|
|
|
|
The TensorRT-LLM GPT-J implementation can be found in [`tensorrt_llm/models/gptj/model.py`](../../tensorrt_llm/models/gptj/model.py). The TensorRT-LLM GPT-J example
|
|
code is located in [`examples/gptj`](./). There are three main files in that folder:
|
|
|
|
* [`build.py`](./build.py) to build the [TensorRT](https://developer.nvidia.com/tensorrt) engine(s) needed to run the GPT-J model,
|
|
* [`run.py`](./run.py) to run the inference on an input text,
|
|
* [`summarize.py`](./summarize.py) to summarize the articles in the [cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail) dataset using the model.
|
|
|
|
## Support Matrix
|
|
* FP16
|
|
* FP8
|
|
* INT4 Weight-Only
|
|
* FP8 KV CACHE
|
|
|
|
## Usage
|
|
|
|
### 1. Download weights from HuggingFace (HF) Transformers
|
|
|
|
```bash
|
|
# 1. Weights & config
|
|
git clone https://huggingface.co/EleutherAI/gpt-j-6b gptj_model
|
|
pushd gptj_model && \
|
|
rm -f pytorch_model.bin && \
|
|
wget https://huggingface.co/EleutherAI/gpt-j-6b/resolve/main/pytorch_model.bin && \
|
|
popd
|
|
|
|
# 2. Vocab and merge table
|
|
wget https://huggingface.co/EleutherAI/gpt-j-6b/resolve/main/vocab.json
|
|
wget https://huggingface.co/EleutherAI/gpt-j-6b/resolve/main/merges.txt
|
|
```
|
|
|
|
### 2. Build TensorRT engine(s)
|
|
|
|
TensorRT-LLM builds TensorRT engine(s) using a HF checkpoint. If no checkpoint directory is specified, TensorRT-LLM will build engine(s) using
|
|
dummy weights.
|
|
|
|
Examples of build invocations:
|
|
|
|
```bash
|
|
# Build a float16 engine using HF weights.
|
|
# Enable several TensorRT-LLM plugins to increase runtime performance. It also helps with build time.
|
|
|
|
python3 build.py --dtype=float16 \
|
|
--log_level=verbose \
|
|
--enable_context_fmha \
|
|
--use_gpt_attention_plugin float16 \
|
|
--use_gemm_plugin float16 \
|
|
--max_batch_size=32 \
|
|
--max_input_len=1919 \
|
|
--max_output_len=128 \
|
|
--remove_input_padding \
|
|
--output_dir=gptj_engine \
|
|
--model_dir=gptj_model 2>&1 | tee build.log
|
|
|
|
# Build a float16 engine using dummy weights, useful for performance tests.
|
|
# Enable several TensorRT-LLM plugins to increase runtime performance. It also helps with build time.
|
|
|
|
python3 build.py --dtype=float16 \
|
|
--log_level=verbose \
|
|
--enable_context_fmha \
|
|
--use_gpt_attention_plugin float16 \
|
|
--use_gemm_plugin float16 \
|
|
--max_batch_size=32 \
|
|
--max_input_len=1919 \
|
|
--max_output_len=128 \
|
|
--remove_input_padding \
|
|
--output_dir=gptj_engine_dummy_weights 2>&1 | tee build.log
|
|
|
|
# Build an int4 weight only quantization engine using awq int4 weight only quantized weights.
|
|
# Enable several TensorRT-LLM plugins to increase runtime performance. It also helps with build time.
|
|
|
|
python3 build.py --dtype=float16 \
|
|
--log_level=verbose \
|
|
--enable_context_fmha \
|
|
--use_gpt_attention_plugin float16 \
|
|
--use_gemm_plugin float16 \
|
|
--max_batch_size=32 \
|
|
--max_input_len=1919 \
|
|
--max_output_len=128 \
|
|
--remove_input_padding \
|
|
--output_dir=gptj_engine \
|
|
--use_weight_only \
|
|
--per_group \
|
|
--weight_only_precision=int4 \
|
|
--model_dir=awq_int4_weight_only_quantized_models 2>&1 | tee build.log
|
|
|
|
```
|
|
|
|
#### FP8 Post-Training Quantization
|
|
|
|
The examples below uses the NVIDIA AMMO (AlgorithMic Model Optimization) toolkit for the model quantization process.
|
|
|
|
First make sure AMMO toolkit is installed (see [examples/quantization/README.md](/examples/quantization/README.md#preparation))
|
|
|
|
Now quantize HF GPT-J weights as follows.
|
|
After successfully running the script, the output should be in .npz format, e.g. `quantized_fp8/gptj_tp1_rank0.npz`,
|
|
where FP8 scaling factors are stored.
|
|
|
|
```bash
|
|
# Quantize HF GPT-J 6B checkpoint into FP8 format
|
|
python quantize.py --model_dir gptj_model \
|
|
--dtype float16 \
|
|
--qformat fp8 \
|
|
--export_path ./quantized_fp8 \
|
|
--calib_size 512
|
|
|
|
# Build GPT-J 6B using original HF checkpoint + PTQ scaling factors
|
|
python build.py --model_dir gptj_model \
|
|
--quantized_fp8_model_path ./quantized_fp8/gptj_tp1_rank0.npz \
|
|
--dtype float16 \
|
|
--use_gpt_attention_plugin float16 \
|
|
--enable_context_fmha \
|
|
--enable_two_optimization_profiles \
|
|
--output_dir gptj_engine_fp8_quantized \
|
|
--enable_fp8 \
|
|
--fp8_kv_cache \
|
|
--strongly_typed
|
|
```
|
|
|
|
#### Fused MultiHead Attention (FMHA)
|
|
|
|
You can enable the FMHA kernels for GPT by adding `--enable_context_fmha` to the invocation of `build.py`. Note that it is disabled by default because of possible accuracy issues due to the use of Flash Attention.
|
|
|
|
If you find that the default fp16 accumulation (`--enable_context_fmha`) cannot meet the requirement, you can try to enable fp32 accumulation by adding `--enable_context_fmha_fp32_acc`. However, it is expected to see performance drop.
|
|
|
|
Note `--enable_context_fmha` / `--enable_context_fmha_fp32_acc` has to be used together with `--use_gpt_attention_plugin float16`.
|
|
|
|
#### FP8 KV cache
|
|
|
|
One can enable FP8 for KV cache to reduce memory footprint used by KV cache and improve the accuracy over INT8 KV cache. There are 3 options need to be added to the invocation of `build.py` for that:
|
|
|
|
- `--enable_fp8` enables FP8 GEMMs in the network.
|
|
- `--fp8_kv_cache` to enable FP8 accurancy for KV cache.
|
|
- `--quantized_fp8_model_path` to provide path to the quantized model calibrated for FP8. For more details see [quantization docs](../quantization/README.md).
|
|
|
|
#### AWQ INT4 weight only quantization
|
|
|
|
One can enable AWQ INT4 weight only quantization with these 3 options when building engine with `build.py`:
|
|
|
|
- `--use_weight_only` enables weight only GEMMs in the network.
|
|
- `--per_group` enable groupwise weight only quantization, for GPT-J example, we support AWQ with the group size default as 128.
|
|
- `--weight_only_precision=int4` the precision of weight only quantization. Only int4 is supported for groupwise weight only quantization.
|
|
|
|
The linear layer in the AWQ int4 weight only quantized weights should have 3 parameters:
|
|
1. FP16 smoothed_weights (=weights/pre_quant_scale) with shape [n, k] ;
|
|
2. FP16 amax (the max abs values of the smoothed_weights) with shape [n, k/group_size];
|
|
3. FP16 pre_quant_scale (the smooth scales used to multiply by activation) with shape [k];
|
|
|
|
### 3. Run
|
|
|
|
|
|
To run a TensorRT-LLM GPT-J model:
|
|
|
|
```bash
|
|
python3 run.py --max_output_len=50 --engine_dir=gptj_engine
|
|
```
|
|
|
|
## Summarization using the GPT-J model
|
|
|
|
The following section describes how to run a TensorRT-LLM GPT-J model to summarize the articles from the
|
|
[cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail) dataset. For each summary, the script can compute the
|
|
[ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)) scores and use the `ROUGE-1` score to validate the implementation.
|
|
The script can also perform the same summarization using the HF GPT-J model.
|
|
|
|
As previously explained, the first step is to build the TensorRT engine as described above using HF weights. You also have to install the requirements:
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
The summarization can be done using the [`summarize.py`](./summarize.py) script as follows:
|
|
|
|
```bash
|
|
# Run the summarization task.
|
|
python3 summarize.py --engine_dir gptj_engine \
|
|
--model_dir gptj_model \
|
|
--test_hf \
|
|
--batch_size 1 \
|
|
--test_trt_llm \
|
|
--tensorrt_llm_rouge1_threshold 14 \
|
|
--data_type fp16 \
|
|
--check_accuracy
|
|
|
|
```
|
|
|
|
## Known issues
|
|
|
|
- You must enable the LayerNorm plugin to build the engine for GPT-J when using TensorRT 8.6, this constraint is removed in TensorRT 9.0. To enable LayerNorm plugin, you should add `--use_layernorm_plugin <float16 or float32>` in the build.py, see build.py commands example above.
|