mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-02-11 21:43:24 +08:00
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
230 lines
9.9 KiB
Markdown
230 lines
9.9 KiB
Markdown
# GPT-NeoX
|
|
|
|
This document explains how to build the [GPT-NeoX](https://huggingface.co/EleutherAI/gpt-neox-20b) model using TensorRT LLM and run on a single GPU and a single node with
|
|
multiple GPUs.
|
|
|
|
- [GPT-NeoX](#gpt-neox)
|
|
- [Overview](#overview)
|
|
- [Support Matrix](#support-matrix)
|
|
- [Usage](#usage)
|
|
- [1. Download weights from HuggingFace (HF) Transformers](#1-download-weights-from-huggingface-hf-transformers)
|
|
- [2. Convert weights from HF Transformers to TensorRT LLM format](#2-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
|
|
- [3. Build TensorRT engine(s)](#3-build-tensorrt-engines)
|
|
- [4. Summarization using the GPT-NeoX model](#4-summarization-using-the-gpt-neox-model)
|
|
- [Apply groupwise quantization GPTQ](#apply-groupwise-quantization-gptq)
|
|
- [1. Download weights from HuggingFace (HF)](#1-download-weights-from-huggingface-hf)
|
|
- [2. Generating quantized weights](#2-generating-quantized-weights)
|
|
- [3. Convert weights from HF Transformers to TensorRT LLM format](#3-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
|
|
- [4. Build TensorRT engine(s)](#4-build-tensorrt-engines)
|
|
- [5. Summarization using the GPT-NeoX model](#5-summarization-using-the-gpt-neox-model)
|
|
|
|
## Overview
|
|
|
|
The TensorRT LLM GPT-NeoX implementation can be found in [`tensorrt_llm/models/gptneox/model.py`](../../tensorrt_llm/models/gptneox/model.py). The TensorRT LLM GPT-NeoX example code is located in [`examples/models/contrib/gptneox`](./). There is one main file:
|
|
|
|
* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format.
|
|
|
|
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
|
|
|
|
* [`../../../run.py`](../../../run.py) to run the inference on an input text;
|
|
* [`../../../summarize.py`](../../../summarize.py) to summarize the articles in the [cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset.
|
|
|
|
## Support Matrix
|
|
* FP16
|
|
* INT8 Weight-Only
|
|
* INT4 GPTQ
|
|
* Tensor Parallel
|
|
|
|
## Usage
|
|
|
|
The TensorRT LLM GPT-NeoX example code locates at [examples/models/contrib/gptneox](./). It takes HF weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
|
|
|
|
### 1. Download weights from HuggingFace (HF) Transformers
|
|
|
|
Please install required packages first:
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
```bash
|
|
# Weights & config
|
|
git clone https://huggingface.co/EleutherAI/gpt-neox-20b gptneox_model
|
|
```
|
|
|
|
### 2. Convert weights from HF Transformers to TensorRT LLM format
|
|
|
|
If you want to use Int8 weight only quantization, just need to add `--use_weight_only` flag.
|
|
|
|
```bash
|
|
# Single GPU
|
|
python3 convert_checkpoint.py --model_dir ./gptneox_model \
|
|
--dtype float16 \
|
|
--output_dir ./gptneox/20B/trt_ckpt/fp16/1-gpu/
|
|
# With 2-way Tensor Parallel
|
|
python3 convert_checkpoint.py --model_dir ./gptneox_model \
|
|
--dtype float16 \
|
|
--tp_size 2 \
|
|
--workers 2 \
|
|
--output_dir ./gptneox/20B/trt_ckpt/fp16/2-gpu/
|
|
# Single GPU with int8 weight only
|
|
python3 convert_checkpoint.py --model_dir ./gptneox_model \
|
|
--dtype float16 \
|
|
--use_weight_only \
|
|
--output_dir ./gptneox/20B/trt_ckpt/int8_wo/1-gpu/
|
|
# With 2-way Tensor Parallel with int8 weight only
|
|
python3 convert_checkpoint.py --model_dir ./gptneox_model \
|
|
--dtype float16 \
|
|
--use_weight_only \
|
|
--tp_size 2 \
|
|
--workers 2 \
|
|
--output_dir ./gptneox/20B/trt_ckpt/int8_wo/2-gpu/
|
|
```
|
|
|
|
### 3. Build TensorRT engine(s)
|
|
```bash
|
|
# Single GPU
|
|
trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/fp16/1-gpu/ \
|
|
--gemm_plugin float16 \
|
|
--max_batch_size 8 \
|
|
--max_input_len 924 \
|
|
--max_seq_len 1024 \
|
|
--output_dir ./gptneox/20B/trt_engines/fp16/1-gpu/
|
|
# With 2-way Tensor Parallel
|
|
trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/fp16/2-gpu/ \
|
|
--gemm_plugin float16 \
|
|
--max_batch_size 8 \
|
|
--max_input_len 924 \
|
|
--max_seq_len 1024 \
|
|
--workers 2 \
|
|
--output_dir ./gptneox/20B/trt_engines/fp16/2-gpu/
|
|
# Single GPU with int8 weight only
|
|
trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/int8_wo/1-gpu/ \
|
|
--gemm_plugin float16 \
|
|
--max_batch_size 8 \
|
|
--max_input_len 924 \
|
|
--max_seq_len 1024 \
|
|
--output_dir ./gptneox/20B/trt_engines/int8_wo/1-gpu/
|
|
# With 2-way Tensor Parallel with int8 weight only
|
|
trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/int8_wo/2-gpu/ \
|
|
--gemm_plugin float16 \
|
|
--max_batch_size 8 \
|
|
--max_input_len 924 \
|
|
--max_seq_len 1024 \
|
|
--workers 2 \
|
|
--output_dir ./gptneox/20B/trt_engines/int8_wo/2-gpu/
|
|
```
|
|
|
|
### 4. Summarization using the GPT-NeoX model
|
|
|
|
The following section describes how to run a TensorRT LLM GPT-NeoX model to summarize the articles from the
|
|
[cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset. For each summary, the script can compute the
|
|
[ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)) scores and use the `ROUGE-1` score to validate the implementation.
|
|
The script can also perform the same summarization using the HF GPT-NeoX model.
|
|
|
|
```bash
|
|
# Single GPU
|
|
python3 ../../../summarize.py --engine_dir ./gptneox/20B/trt_engines/fp16/1-gpu/ \
|
|
--test_trt_llm \
|
|
--hf_model_dir gptneox_model \
|
|
--data_type fp16
|
|
# With 2-way Tensor Parallel
|
|
mpirun -np 2 --oversubscribe --allow-run-as-root \
|
|
python3 ../../../summarize.py --engine_dir ./gptneox/20B/trt_engines/fp16/2-gpu/ \
|
|
--test_trt_llm \
|
|
--hf_model_dir gptneox_model \
|
|
--data_type fp16
|
|
# Single GPU with int8 weight only
|
|
python3 ../../../summarize.py --engine_dir ./gptneox/20B/trt_engines/int8_wo/1-gpu/ \
|
|
--test_trt_llm \
|
|
--hf_model_dir gptneox_model \
|
|
--data_type fp16
|
|
# With 2-way Tensor Parallel with int8 weight only
|
|
mpirun -np 2 --oversubscribe --allow-run-as-root \
|
|
python3 ../../../summarize.py --engine_dir ./gptneox/20B/trt_engines/int8_wo/2-gpu/ \
|
|
--test_trt_llm \
|
|
--hf_model_dir gptneox_model \
|
|
--data_type fp16
|
|
```
|
|
|
|
## Apply groupwise quantization GPTQ
|
|
|
|
### 1. Download weights from HuggingFace (HF)
|
|
|
|
```bash
|
|
# Weights & config
|
|
sh get_weights.sh
|
|
```
|
|
|
|
### 2. Generating quantized weights
|
|
|
|
In this example, the weights are quantized using [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa). Note that the parameter `--act-order` referring to whether to apply the activation order GPTQ heuristic is **not supported** by TRT-LLM.
|
|
|
|
```bash
|
|
sh gptq_convert.sh
|
|
```
|
|
|
|
### 3. Convert weights from HF Transformers to TensorRT LLM format
|
|
|
|
To apply groupwise quantization GPTQ, addition command-line flags need to be passed to `convert_checkpoint.py`:
|
|
Here `--quant_ckpt_path` flag specifies the output safetensors of `gptq_convert.sh` script.
|
|
|
|
```bash
|
|
# Single GPU
|
|
python3 convert_checkpoint.py --model_dir ./gptneox_model \
|
|
--dtype float16 \
|
|
--use_weight_only \
|
|
--weight_only_precision int4_gptq \
|
|
--quant_ckpt_path ./gptneox_model/gptneox-20b-4bit-gs128.safetensors \
|
|
--output_dir ./gptneox/20B/trt_ckpt/int4_gptq/1-gpu/
|
|
# With 2-way Tensor Parallel
|
|
python3 convert_checkpoint.py --model_dir ./gptneox_model \
|
|
--dtype float16 \
|
|
--use_weight_only \
|
|
--weight_only_precision int4_gptq \
|
|
--tp_size 2 \
|
|
--workers 2 \
|
|
--quant_ckpt_path ./gptneox_model/gptneox-20b-4bit-gs128.safetensors \
|
|
--output_dir ./gptneox/20B/trt_ckpt/int4_gptq/2-gpu/
|
|
```
|
|
|
|
### 4. Build TensorRT engine(s)
|
|
|
|
The command to build TensorRT engines to apply GPTQ does not change:
|
|
|
|
```bash
|
|
# Single GPU
|
|
trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/int4_gptq/1-gpu/ \
|
|
--gemm_plugin float16 \
|
|
--max_batch_size 8 \
|
|
--max_input_len 924 \
|
|
--max_seq_len 1024 \
|
|
--output_dir ./gptneox/20B/trt_engines/int4_gptq/1-gpu/
|
|
# With 2-way Tensor Parallel
|
|
trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/int4_gptq/2-gpu/ \
|
|
--gemm_plugin float16 \
|
|
--max_batch_size 8 \
|
|
--max_input_len 924 \
|
|
--max_seq_len 1024 \
|
|
--workers 2 \
|
|
--output_dir ./gptneox/20B/trt_engines/int4_gptq/2-gpu/
|
|
```
|
|
|
|
### 5. Summarization using the GPT-NeoX model
|
|
|
|
The command to run summarization with GPTQ quantized model also does not change:
|
|
|
|
```bash
|
|
# Single GPU
|
|
python3 ../../../summarize.py --engine_dir ./gptneox/20B/trt_engines/int4_gptq/1-gpu/ \
|
|
--test_trt_llm \
|
|
--hf_model_dir gptneox_model \
|
|
--data_type fp16
|
|
# With 2-way Tensor Parallel
|
|
mpirun -np 2 --oversubscribe --allow-run-as-root \
|
|
python3 ../../../summarize.py --engine_dir ./gptneox/20B/trt_engines/int4_gptq/2-gpu/ \
|
|
--test_trt_llm \
|
|
--hf_model_dir gptneox_model \
|
|
--data_type fp16
|
|
```
|