# GPT-NeoX This document explains how to build the [GPT-NeoX](https://huggingface.co/EleutherAI/gpt-neox-20b) model using TensorRT LLM and run on a single GPU and a single node with multiple GPUs. - [GPT-NeoX](#gpt-neox) - [Overview](#overview) - [Support Matrix](#support-matrix) - [Usage](#usage) - [1. Download weights from HuggingFace (HF) Transformers](#1-download-weights-from-huggingface-hf-transformers) - [2. Convert weights from HF Transformers to TensorRT LLM format](#2-convert-weights-from-hf-transformers-to-tensorrt-llm-format) - [3. Build TensorRT engine(s)](#3-build-tensorrt-engines) - [4. Summarization using the GPT-NeoX model](#4-summarization-using-the-gpt-neox-model) - [Apply groupwise quantization GPTQ](#apply-groupwise-quantization-gptq) - [1. Download weights from HuggingFace (HF)](#1-download-weights-from-huggingface-hf) - [2. Generating quantized weights](#2-generating-quantized-weights) - [3. Convert weights from HF Transformers to TensorRT LLM format](#3-convert-weights-from-hf-transformers-to-tensorrt-llm-format) - [4. Build TensorRT engine(s)](#4-build-tensorrt-engines) - [5. Summarization using the GPT-NeoX model](#5-summarization-using-the-gpt-neox-model) ## Overview The TensorRT LLM GPT-NeoX implementation can be found in [`tensorrt_llm/models/gptneox/model.py`](../../tensorrt_llm/models/gptneox/model.py). The TensorRT LLM GPT-NeoX example code is located in [`examples/models/contrib/gptneox`](./). There is one main file: * [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format. In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation: * [`../../../run.py`](../../../run.py) to run the inference on an input text; * [`../../../summarize.py`](../../../summarize.py) to summarize the articles in the [cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset. ## Support Matrix * FP16 * INT8 Weight-Only * INT4 GPTQ * Tensor Parallel ## Usage The TensorRT LLM GPT-NeoX example code locates at [examples/models/contrib/gptneox](./). It takes HF weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference. ### 1. Download weights from HuggingFace (HF) Transformers Please install required packages first: ```bash pip install -r requirements.txt ``` ```bash # Weights & config git clone https://huggingface.co/EleutherAI/gpt-neox-20b gptneox_model ``` ### 2. Convert weights from HF Transformers to TensorRT LLM format If you want to use Int8 weight only quantization, just need to add `--use_weight_only` flag. ```bash # Single GPU python3 convert_checkpoint.py --model_dir ./gptneox_model \ --dtype float16 \ --output_dir ./gptneox/20B/trt_ckpt/fp16/1-gpu/ # With 2-way Tensor Parallel python3 convert_checkpoint.py --model_dir ./gptneox_model \ --dtype float16 \ --tp_size 2 \ --workers 2 \ --output_dir ./gptneox/20B/trt_ckpt/fp16/2-gpu/ # Single GPU with int8 weight only python3 convert_checkpoint.py --model_dir ./gptneox_model \ --dtype float16 \ --use_weight_only \ --output_dir ./gptneox/20B/trt_ckpt/int8_wo/1-gpu/ # With 2-way Tensor Parallel with int8 weight only python3 convert_checkpoint.py --model_dir ./gptneox_model \ --dtype float16 \ --use_weight_only \ --tp_size 2 \ --workers 2 \ --output_dir ./gptneox/20B/trt_ckpt/int8_wo/2-gpu/ ``` ### 3. Build TensorRT engine(s) ```bash # Single GPU trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/fp16/1-gpu/ \ --gemm_plugin float16 \ --max_batch_size 8 \ --max_input_len 924 \ --max_seq_len 1024 \ --output_dir ./gptneox/20B/trt_engines/fp16/1-gpu/ # With 2-way Tensor Parallel trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/fp16/2-gpu/ \ --gemm_plugin float16 \ --max_batch_size 8 \ --max_input_len 924 \ --max_seq_len 1024 \ --workers 2 \ --output_dir ./gptneox/20B/trt_engines/fp16/2-gpu/ # Single GPU with int8 weight only trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/int8_wo/1-gpu/ \ --gemm_plugin float16 \ --max_batch_size 8 \ --max_input_len 924 \ --max_seq_len 1024 \ --output_dir ./gptneox/20B/trt_engines/int8_wo/1-gpu/ # With 2-way Tensor Parallel with int8 weight only trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/int8_wo/2-gpu/ \ --gemm_plugin float16 \ --max_batch_size 8 \ --max_input_len 924 \ --max_seq_len 1024 \ --workers 2 \ --output_dir ./gptneox/20B/trt_engines/int8_wo/2-gpu/ ``` ### 4. Summarization using the GPT-NeoX model The following section describes how to run a TensorRT LLM GPT-NeoX model to summarize the articles from the [cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset. For each summary, the script can compute the [ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)) scores and use the `ROUGE-1` score to validate the implementation. The script can also perform the same summarization using the HF GPT-NeoX model. ```bash # Single GPU python3 ../../../summarize.py --engine_dir ./gptneox/20B/trt_engines/fp16/1-gpu/ \ --test_trt_llm \ --hf_model_dir gptneox_model \ --data_type fp16 # With 2-way Tensor Parallel mpirun -np 2 --oversubscribe --allow-run-as-root \ python3 ../../../summarize.py --engine_dir ./gptneox/20B/trt_engines/fp16/2-gpu/ \ --test_trt_llm \ --hf_model_dir gptneox_model \ --data_type fp16 # Single GPU with int8 weight only python3 ../../../summarize.py --engine_dir ./gptneox/20B/trt_engines/int8_wo/1-gpu/ \ --test_trt_llm \ --hf_model_dir gptneox_model \ --data_type fp16 # With 2-way Tensor Parallel with int8 weight only mpirun -np 2 --oversubscribe --allow-run-as-root \ python3 ../../../summarize.py --engine_dir ./gptneox/20B/trt_engines/int8_wo/2-gpu/ \ --test_trt_llm \ --hf_model_dir gptneox_model \ --data_type fp16 ``` ## Apply groupwise quantization GPTQ ### 1. Download weights from HuggingFace (HF) ```bash # Weights & config sh get_weights.sh ``` ### 2. Generating quantized weights In this example, the weights are quantized using [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa). Note that the parameter `--act-order` referring to whether to apply the activation order GPTQ heuristic is **not supported** by TRT-LLM. ```bash sh gptq_convert.sh ``` ### 3. Convert weights from HF Transformers to TensorRT LLM format To apply groupwise quantization GPTQ, addition command-line flags need to be passed to `convert_checkpoint.py`: Here `--quant_ckpt_path` flag specifies the output safetensors of `gptq_convert.sh` script. ```bash # Single GPU python3 convert_checkpoint.py --model_dir ./gptneox_model \ --dtype float16 \ --use_weight_only \ --weight_only_precision int4_gptq \ --quant_ckpt_path ./gptneox_model/gptneox-20b-4bit-gs128.safetensors \ --output_dir ./gptneox/20B/trt_ckpt/int4_gptq/1-gpu/ # With 2-way Tensor Parallel python3 convert_checkpoint.py --model_dir ./gptneox_model \ --dtype float16 \ --use_weight_only \ --weight_only_precision int4_gptq \ --tp_size 2 \ --workers 2 \ --quant_ckpt_path ./gptneox_model/gptneox-20b-4bit-gs128.safetensors \ --output_dir ./gptneox/20B/trt_ckpt/int4_gptq/2-gpu/ ``` ### 4. Build TensorRT engine(s) The command to build TensorRT engines to apply GPTQ does not change: ```bash # Single GPU trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/int4_gptq/1-gpu/ \ --gemm_plugin float16 \ --max_batch_size 8 \ --max_input_len 924 \ --max_seq_len 1024 \ --output_dir ./gptneox/20B/trt_engines/int4_gptq/1-gpu/ # With 2-way Tensor Parallel trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/int4_gptq/2-gpu/ \ --gemm_plugin float16 \ --max_batch_size 8 \ --max_input_len 924 \ --max_seq_len 1024 \ --workers 2 \ --output_dir ./gptneox/20B/trt_engines/int4_gptq/2-gpu/ ``` ### 5. Summarization using the GPT-NeoX model The command to run summarization with GPTQ quantized model also does not change: ```bash # Single GPU python3 ../../../summarize.py --engine_dir ./gptneox/20B/trt_engines/int4_gptq/1-gpu/ \ --test_trt_llm \ --hf_model_dir gptneox_model \ --data_type fp16 # With 2-way Tensor Parallel mpirun -np 2 --oversubscribe --allow-run-as-root \ python3 ../../../summarize.py --engine_dir ./gptneox/20B/trt_engines/int4_gptq/2-gpu/ \ --test_trt_llm \ --hf_model_dir gptneox_model \ --data_type fp16 ```