TensorRT-LLMs/examples/models/contrib/gptneox/README.md
Guoming Zhang 9f0f52249e [None][doc] Rename TensorRT-LLM to TensorRT LLM for homepage and the … (#7850)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00

230 lines
9.9 KiB
Markdown

# GPT-NeoX
This document explains how to build the [GPT-NeoX](https://huggingface.co/EleutherAI/gpt-neox-20b) model using TensorRT LLM and run on a single GPU and a single node with
multiple GPUs.
- [GPT-NeoX](#gpt-neox)
- [Overview](#overview)
- [Support Matrix](#support-matrix)
- [Usage](#usage)
- [1. Download weights from HuggingFace (HF) Transformers](#1-download-weights-from-huggingface-hf-transformers)
- [2. Convert weights from HF Transformers to TensorRT LLM format](#2-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
- [3. Build TensorRT engine(s)](#3-build-tensorrt-engines)
- [4. Summarization using the GPT-NeoX model](#4-summarization-using-the-gpt-neox-model)
- [Apply groupwise quantization GPTQ](#apply-groupwise-quantization-gptq)
- [1. Download weights from HuggingFace (HF)](#1-download-weights-from-huggingface-hf)
- [2. Generating quantized weights](#2-generating-quantized-weights)
- [3. Convert weights from HF Transformers to TensorRT LLM format](#3-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
- [4. Build TensorRT engine(s)](#4-build-tensorrt-engines)
- [5. Summarization using the GPT-NeoX model](#5-summarization-using-the-gpt-neox-model)
## Overview
The TensorRT LLM GPT-NeoX implementation can be found in [`tensorrt_llm/models/gptneox/model.py`](../../tensorrt_llm/models/gptneox/model.py). The TensorRT LLM GPT-NeoX example code is located in [`examples/models/contrib/gptneox`](./). There is one main file:
* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format.
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
* [`../../../run.py`](../../../run.py) to run the inference on an input text;
* [`../../../summarize.py`](../../../summarize.py) to summarize the articles in the [cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset.
## Support Matrix
* FP16
* INT8 Weight-Only
* INT4 GPTQ
* Tensor Parallel
## Usage
The TensorRT LLM GPT-NeoX example code locates at [examples/models/contrib/gptneox](./). It takes HF weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
### 1. Download weights from HuggingFace (HF) Transformers
Please install required packages first:
```bash
pip install -r requirements.txt
```
```bash
# Weights & config
git clone https://huggingface.co/EleutherAI/gpt-neox-20b gptneox_model
```
### 2. Convert weights from HF Transformers to TensorRT LLM format
If you want to use Int8 weight only quantization, just need to add `--use_weight_only` flag.
```bash
# Single GPU
python3 convert_checkpoint.py --model_dir ./gptneox_model \
--dtype float16 \
--output_dir ./gptneox/20B/trt_ckpt/fp16/1-gpu/
# With 2-way Tensor Parallel
python3 convert_checkpoint.py --model_dir ./gptneox_model \
--dtype float16 \
--tp_size 2 \
--workers 2 \
--output_dir ./gptneox/20B/trt_ckpt/fp16/2-gpu/
# Single GPU with int8 weight only
python3 convert_checkpoint.py --model_dir ./gptneox_model \
--dtype float16 \
--use_weight_only \
--output_dir ./gptneox/20B/trt_ckpt/int8_wo/1-gpu/
# With 2-way Tensor Parallel with int8 weight only
python3 convert_checkpoint.py --model_dir ./gptneox_model \
--dtype float16 \
--use_weight_only \
--tp_size 2 \
--workers 2 \
--output_dir ./gptneox/20B/trt_ckpt/int8_wo/2-gpu/
```
### 3. Build TensorRT engine(s)
```bash
# Single GPU
trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/fp16/1-gpu/ \
--gemm_plugin float16 \
--max_batch_size 8 \
--max_input_len 924 \
--max_seq_len 1024 \
--output_dir ./gptneox/20B/trt_engines/fp16/1-gpu/
# With 2-way Tensor Parallel
trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/fp16/2-gpu/ \
--gemm_plugin float16 \
--max_batch_size 8 \
--max_input_len 924 \
--max_seq_len 1024 \
--workers 2 \
--output_dir ./gptneox/20B/trt_engines/fp16/2-gpu/
# Single GPU with int8 weight only
trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/int8_wo/1-gpu/ \
--gemm_plugin float16 \
--max_batch_size 8 \
--max_input_len 924 \
--max_seq_len 1024 \
--output_dir ./gptneox/20B/trt_engines/int8_wo/1-gpu/
# With 2-way Tensor Parallel with int8 weight only
trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/int8_wo/2-gpu/ \
--gemm_plugin float16 \
--max_batch_size 8 \
--max_input_len 924 \
--max_seq_len 1024 \
--workers 2 \
--output_dir ./gptneox/20B/trt_engines/int8_wo/2-gpu/
```
### 4. Summarization using the GPT-NeoX model
The following section describes how to run a TensorRT LLM GPT-NeoX model to summarize the articles from the
[cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset. For each summary, the script can compute the
[ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)) scores and use the `ROUGE-1` score to validate the implementation.
The script can also perform the same summarization using the HF GPT-NeoX model.
```bash
# Single GPU
python3 ../../../summarize.py --engine_dir ./gptneox/20B/trt_engines/fp16/1-gpu/ \
--test_trt_llm \
--hf_model_dir gptneox_model \
--data_type fp16
# With 2-way Tensor Parallel
mpirun -np 2 --oversubscribe --allow-run-as-root \
python3 ../../../summarize.py --engine_dir ./gptneox/20B/trt_engines/fp16/2-gpu/ \
--test_trt_llm \
--hf_model_dir gptneox_model \
--data_type fp16
# Single GPU with int8 weight only
python3 ../../../summarize.py --engine_dir ./gptneox/20B/trt_engines/int8_wo/1-gpu/ \
--test_trt_llm \
--hf_model_dir gptneox_model \
--data_type fp16
# With 2-way Tensor Parallel with int8 weight only
mpirun -np 2 --oversubscribe --allow-run-as-root \
python3 ../../../summarize.py --engine_dir ./gptneox/20B/trt_engines/int8_wo/2-gpu/ \
--test_trt_llm \
--hf_model_dir gptneox_model \
--data_type fp16
```
## Apply groupwise quantization GPTQ
### 1. Download weights from HuggingFace (HF)
```bash
# Weights & config
sh get_weights.sh
```
### 2. Generating quantized weights
In this example, the weights are quantized using [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa). Note that the parameter `--act-order` referring to whether to apply the activation order GPTQ heuristic is **not supported** by TRT-LLM.
```bash
sh gptq_convert.sh
```
### 3. Convert weights from HF Transformers to TensorRT LLM format
To apply groupwise quantization GPTQ, addition command-line flags need to be passed to `convert_checkpoint.py`:
Here `--quant_ckpt_path` flag specifies the output safetensors of `gptq_convert.sh` script.
```bash
# Single GPU
python3 convert_checkpoint.py --model_dir ./gptneox_model \
--dtype float16 \
--use_weight_only \
--weight_only_precision int4_gptq \
--quant_ckpt_path ./gptneox_model/gptneox-20b-4bit-gs128.safetensors \
--output_dir ./gptneox/20B/trt_ckpt/int4_gptq/1-gpu/
# With 2-way Tensor Parallel
python3 convert_checkpoint.py --model_dir ./gptneox_model \
--dtype float16 \
--use_weight_only \
--weight_only_precision int4_gptq \
--tp_size 2 \
--workers 2 \
--quant_ckpt_path ./gptneox_model/gptneox-20b-4bit-gs128.safetensors \
--output_dir ./gptneox/20B/trt_ckpt/int4_gptq/2-gpu/
```
### 4. Build TensorRT engine(s)
The command to build TensorRT engines to apply GPTQ does not change:
```bash
# Single GPU
trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/int4_gptq/1-gpu/ \
--gemm_plugin float16 \
--max_batch_size 8 \
--max_input_len 924 \
--max_seq_len 1024 \
--output_dir ./gptneox/20B/trt_engines/int4_gptq/1-gpu/
# With 2-way Tensor Parallel
trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/int4_gptq/2-gpu/ \
--gemm_plugin float16 \
--max_batch_size 8 \
--max_input_len 924 \
--max_seq_len 1024 \
--workers 2 \
--output_dir ./gptneox/20B/trt_engines/int4_gptq/2-gpu/
```
### 5. Summarization using the GPT-NeoX model
The command to run summarization with GPTQ quantized model also does not change:
```bash
# Single GPU
python3 ../../../summarize.py --engine_dir ./gptneox/20B/trt_engines/int4_gptq/1-gpu/ \
--test_trt_llm \
--hf_model_dir gptneox_model \
--data_type fp16
# With 2-way Tensor Parallel
mpirun -np 2 --oversubscribe --allow-run-as-root \
python3 ../../../summarize.py --engine_dir ./gptneox/20B/trt_engines/int4_gptq/2-gpu/ \
--test_trt_llm \
--hf_model_dir gptneox_model \
--data_type fp16
```