mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
* Update TensorRT-LLM --------- Co-authored-by: erenup <ping.nie@pku.edu.cn> Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
601 lines
29 KiB
Markdown
601 lines
29 KiB
Markdown
# GPT
|
|
|
|
This document explains how to build the [GPT](https://huggingface.co/gpt2) model using TensorRT-LLM and run on a single GPU, a single node with
|
|
multiple GPUs or multiple nodes with multiple GPUs.
|
|
|
|
## Overview
|
|
|
|
The TensorRT-LLM GPT implementation can be found in [`tensorrt_llm/models/gpt/model.py`](../../tensorrt_llm/models/gpt/model.py). The TensorRT-LLM GPT example code is located in [`examples/gpt`](./). There are two main files:
|
|
|
|
* [`hf_gpt_convert.py`](./hf_gpt_convert.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers)
|
|
format to the [FasterTransformer (FT)](https://github.com/NVIDIA/FasterTransformer) format,
|
|
* [`build.py`](./build.py) to build the [TensorRT](https://developer.nvidia.com/tensorrt) engine(s) needed to run the GPT model.
|
|
|
|
In addition, there are two shared files in the parent folder [`examples`](../) for inference and evaluation:
|
|
|
|
* [`../run.py`](../run.py) to run the inference on an input text;
|
|
* [`../summarize.py`](../summarize.py) to summarize the articles in the [cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail) dataset.
|
|
|
|
## Support Matrix
|
|
* FP16
|
|
* FP8
|
|
* Inflight Batching
|
|
* PAGED_KV_CACHE
|
|
* FP8 KV CACHE
|
|
* Tensor Parallel
|
|
* STRONGLY TYPED
|
|
|
|
## Usage
|
|
|
|
The next two sections describe how to convert the weights from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers)
|
|
format to the FT format. You can skip those two sections if you already have weights in the
|
|
FT format.
|
|
|
|
Note, also, that if your weights are neither in HF Transformers nor in FT formats, you will need to convert to the FT format. The script like
|
|
[`hf_gpt_convert.py`](./hf_gpt_convert.py) can serve as a starting point.
|
|
|
|
### 1. Download weights from HuggingFace Transformers
|
|
|
|
```bash
|
|
# Weights & config
|
|
rm -rf gpt2 && git clone https://huggingface.co/gpt2-medium gpt2
|
|
pushd gpt2 && rm pytorch_model.bin model.safetensors && wget -q https://huggingface.co/gpt2-medium/resolve/main/pytorch_model.bin && popd
|
|
```
|
|
|
|
### 2. Convert weights from HF Transformers to FT format
|
|
|
|
TensorRT-LLM can directly load weights from FT. The [`hf_gpt_convert.py`](./hf_gpt_convert.py) script allows you to convert weights from HF Transformers
|
|
format to FT format.
|
|
|
|
```bash
|
|
python3 hf_gpt_convert.py -i gpt2 -o ./c-model/gpt2 --tensor-parallelism 1 --storage-type float16
|
|
```
|
|
|
|
This script uses multiple processes to speed-up writing the model to disk. This may saturate your RAM depending on the model you are exporting.
|
|
In case that happens, you can reduce the number of processes with `--processes <num_processes>`. Set it to 1 for minimal RAM usage.
|
|
|
|
### 3. Build TensorRT engine(s)
|
|
|
|
TensorRT-LLM builds TensorRT engine(s) using a checkpoint in FT format. The checkpoint directory provides the model's weights, architecture configuration
|
|
and custom tokenizer if specified. If no checkpoint directories are specified, TensorRT-LLM will build engine(s) using random weights. When building with
|
|
random weights, you can use command-line arguments to modify the architecture: `--n_layer, --n_head, --n_embd, --hidden_act, --no_bias, ...`
|
|
Also, note that the number of TensorRT engines depends on the number of GPUs that will be used to run inference.
|
|
|
|
The [`build.py`](./build.py) script requires a single GPU to build the TensorRT engine(s). However, if you have more than one GPU in your system (of the same
|
|
model), you can enable parallel builds to accelerate the engine building process. For that, add the `--parallel_build` argument to the build command. Please
|
|
note that for the moment, the `parallel_build` feature cannot take advantage of more than a single node.
|
|
|
|
Examples of build invocations:
|
|
|
|
```bash
|
|
# Build a single-GPU float16 engine using FT weights.
|
|
# Enable the special TensorRT-LLM GPT Attention plugin (--use_gpt_attention_plugin) to increase runtime performance.
|
|
# It is recommend to use --remove_input_padding along with --use_gpt_attention_plugin for better performance
|
|
python3 build.py --model_dir=./c-model/gpt2/1-gpu --use_gpt_attention_plugin --remove_input_padding
|
|
|
|
# Build 8-GPU GPT-175B float16 engines using dummy weights, useful for performance tests.
|
|
# Enable several TensorRT-LLM plugins to increase runtime performance. It also helps with build time.
|
|
python3 build.py --world_size=8 \
|
|
--log_level=verbose \
|
|
--n_layer=96 \
|
|
--n_embd=12288 \
|
|
--n_head=96 \
|
|
--max_batch_size=256 \
|
|
--dtype float16 \
|
|
--remove_input_padding \
|
|
--use_gpt_attention_plugin \
|
|
--enable_context_fmha \
|
|
--use_gemm_plugin \
|
|
--output_dir=gpt_175b 2>&1 | tee build.log
|
|
|
|
# Build 16-GPU GPT-530B float16 engines using dummy weights, useful for performance tests.
|
|
# Enable several TensorRT-LLM plugins to increase runtime performance. It also helps with build time.
|
|
python3 build.py --world_size=16 \
|
|
--log_level=info \
|
|
--n_layer=105 \
|
|
--n_embd=20480 \
|
|
--n_head=128 \
|
|
--max_batch_size=128 \
|
|
--max_input_len=128 \
|
|
--max_output_len=20 \
|
|
--dtype float16 \
|
|
--remove_input_padding \
|
|
--use_gpt_attention_plugin \
|
|
--enable_context_fmha \
|
|
--use_gemm_plugin \
|
|
--output_dir=gpt_530b 2>&1 | tee build.log
|
|
```
|
|
|
|
#### Fused MultiHead Attention (FMHA)
|
|
|
|
You can enable the FMHA kernels for GPT by adding `--enable_context_fmha` to the invocation of `build.py`.
|
|
|
|
If you find that the default fp16 accumulation (`--enable_context_fmha`) cannot meet the requirement, you can try to enable fp32 accumulation by adding `--enable_context_fmha_fp32_acc`. However, it is expected to see performance drop.
|
|
|
|
Note `--enable_context_fmha` / `--enable_context_fmha_fp32_acc` has to be used together with `--use_gpt_attention_plugin float16`.
|
|
|
|
#### In-flight batching and paged KV cache
|
|
|
|
If one wants to use [in-flight batching in C++ runtime](../../docs/in_flight_batching.md), the engine must be built accordingly.
|
|
In-flight batching is enabled by adding `--use_inflight_batching` to the invocation of `build.py`.
|
|
Note that in-flight batching in C++ runtime works only with attention plugin `--use_gpt_attention_plugin=float16`, paged KV cache `--paged_kv_cache` and with packed data `--remove_input_padding`.
|
|
Adding `--use_inflight_batching` will enable these three flags if not already enabled. It is possible to choose a different precision for `--use_gpt_attention_plugin` if the flag is provided separately.
|
|
One can additionally control the size of the block in paged KV cache using `--tokens_per_block=N`.
|
|
|
|
### 4. Run
|
|
|
|
|
|
#### Single node, single GPU
|
|
|
|
To run a TensorRT-LLM GPT model on a single GPU, you can use `python3`:
|
|
|
|
```bash
|
|
# Run the GPT-350M model on a single GPU.
|
|
python3 ../run.py --max_output_len=8 --no_add_special_tokens
|
|
```
|
|
|
|
#### Single node, multiple GPUs
|
|
|
|
To run a model using multiple GPUs on a single node, you can use `mpirun` as:
|
|
|
|
```bash
|
|
# Run the GPT-175B model on a single node using multiple GPUs.
|
|
mpirun -np 8 python3 ../run.py --max_output_len=8 --engine_dir=gpt_175b --no_add_special_tokens
|
|
```
|
|
|
|
#### Multiple nodes, multiple GPUs using [Slurm](https://slurm.schedmd.com)
|
|
|
|
To run a model using multiple nodes, you should use a cluster manager like `Slurm`. The following section shows how to configure
|
|
TensorRT-LLM to execute on two nodes using Slurm.
|
|
|
|
We start by preparing an `sbatch` script called `tensorrt_llm_run.sub`. That script contains the following code (you must replace
|
|
the `<REPLACE ...>` strings with your own values):
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
#SBATCH -o logs/tensorrt_llm.out
|
|
#SBATCH -e logs/tensorrt_llm.error
|
|
#SBATCH -J <REPLACE WITH YOUR JOB's NAME>
|
|
#SBATCH -A <REPLACE WITH YOUR ACCOUNT's NAME>
|
|
#SBATCH -p <REPLACE WITH YOUR PARTITION's NAME>
|
|
#SBATCH --nodes=2
|
|
#SBATCH --ntasks-per-node=8
|
|
#SBATCH --time=00:30:00
|
|
|
|
sudo nvidia-smi -lgc 1410,1410
|
|
|
|
srun --mpi=pmix \
|
|
--container-image <image> \
|
|
--container-mounts <path>:<path> \
|
|
--container-workdir <path> \
|
|
--output logs/tensorrt_llm_%t.out \
|
|
--error logs/tensorrt_llm_%t.error python3 -u ../run.py --max_output_len=8 --engine_dir <engine_dir> --no_add_special_tokens
|
|
```
|
|
|
|
Then, submit the job using:
|
|
|
|
```bash
|
|
sbatch tensorrt_llm_run.sub
|
|
```
|
|
|
|
You might have to contact your cluster's administrator to help you customize the above script.
|
|
|
|
## GPT Variant - SantaCoder
|
|
|
|
The SantaCoder extends the existing GPT model with multi-query attention mechanism. The following example shows building a 4-GPU engine and running simple prompt to generate the implementation of `hello_world()`.
|
|
|
|
The main differences in this example are:
|
|
1. In model conversion `hf_gpt_convert.py` where extra option `--model santacoder` is required to allow converting checkpoint correctly
|
|
2. In engine execution `../run.py` where `--tokenizer_dir ./santacoder` needs to be specified to decode the output ids correctly.
|
|
|
|
```bash
|
|
git clone https://huggingface.co/bigcode/santacoder
|
|
|
|
python3 hf_gpt_convert.py -p 8 --model santacoder -i ./santacoder -o ./c-model/santacoder --tensor-parallelism 4 --storage-type float16
|
|
|
|
python3 build.py \
|
|
--model_dir ./c-model/santacoder/4-gpu \
|
|
--remove_input_padding \
|
|
--use_gpt_attention_plugin \
|
|
--enable_context_fmha \
|
|
--use_gemm_plugin \
|
|
--parallel_build \
|
|
--output_dir santacoder_outputs_tp4 \
|
|
--world_size 4
|
|
|
|
mpirun -np 4 python3 ../run.py --engine_dir santacoder_outputs_tp4 --tokenizer_dir ./santacoder --input_text "def print_hello_world():" --max_output_len 20 --no_add_special_tokens
|
|
```
|
|
|
|
## GPT Variant - StarCoder
|
|
|
|
For StarCoder, the steps are similar except that `santacoder` is swapped with `starcoder`.
|
|
|
|
```bash
|
|
git clone https://huggingface.co/bigcode/starcoder
|
|
|
|
python3 hf_gpt_convert.py -p 8 --model starcoder -i ./starcoder -o ./c-model/starcoder --tensor-parallelism 4 --storage-type float16
|
|
|
|
python3 build.py \
|
|
--model_dir ./c-model/starcoder/4-gpu \
|
|
--remove_input_padding \
|
|
--use_gpt_attention_plugin \
|
|
--enable_context_fmha \
|
|
--use_gemm_plugin \
|
|
--parallel_build \
|
|
--output_dir starcoder_outputs_tp4 \
|
|
--world_size 4
|
|
|
|
mpirun -np 4 python3 ../run.py --engine_dir starcoder_outputs_tp4 --tokenizer_dir ./starcoder --input_text "def print_hello_world():" --max_output_len 20 --no_add_special_tokens
|
|
```
|
|
|
|
## Summarization using the GPT model
|
|
|
|
The following section describes how to run a TensorRT-LLM GPT model to summarize the articles from the
|
|
[cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail) dataset. For each summary, the script can compute the
|
|
[ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)) scores and use the `ROUGE-1` score to validate the implementation.
|
|
The script can also perform the same summarization using the HF GPT model.
|
|
|
|
As previously explained, the first step is to convert from an HF checkpoint and build the TensorRT engines.
|
|
|
|
```bash
|
|
# Load the GPT2 weights from the HF hub.
|
|
pip install -r requirements.txt
|
|
rm -rf gpt2 && git clone https://huggingface.co/gpt2
|
|
pushd gpt2 && rm pytorch_model.bin model.safetensors && wget -q https://huggingface.co/gpt2/resolve/main/pytorch_model.bin && popd
|
|
|
|
# Convert the weights to FT format.
|
|
python3 hf_gpt_convert.py -i gpt2 -o ./c-model/gpt2/fp16 --tensor-parallelism 1 --storage-type float16
|
|
|
|
# Build the model.
|
|
python3 build.py --model_dir=./c-model/gpt2/fp16/1-gpu \
|
|
--remove_input_padding \
|
|
--use_gpt_attention_plugin \
|
|
--enable_context_fmha \
|
|
--use_gemm_plugin \
|
|
--max_batch_size 8 \
|
|
--max_input_len 924 \
|
|
--max_output_len 100 \
|
|
--output_dir trt_engine/gpt2/fp16/1-gpu/ \
|
|
--hidden_act gelu
|
|
```
|
|
|
|
The summarization can be done using the [`../summarize.py`](../summarize.py) script as follows:
|
|
|
|
```bash
|
|
# Run the summarization task.
|
|
python3 ../summarize.py --engine_dir trt_engine/gpt2/fp16/1-gpu \
|
|
--hf_model_dir gpt2 \
|
|
--test_trt_llm \
|
|
--test_hf \
|
|
--batch_size 1 \
|
|
--check_accuracy \
|
|
--tensorrt_llm_rouge1_threshold=14 \
|
|
--no_add_special_tokens
|
|
```
|
|
|
|
## SmoothQuant
|
|
|
|
This section explains how to use SmoothQuant on GPT models with TensorRT-LLM.
|
|
|
|
### Overview
|
|
|
|
[SmoothQuant](https://arxiv.org/abs/2211.10438) is a post-training quantization
|
|
(PTQ) method to quantize LLM models to INT8 for faster inference. As explained
|
|
in the article, SmoothQuant modifies a model to enable INT8 quantization
|
|
without significantly altering the accuracy.
|
|
|
|
#### Model Transformation
|
|
|
|
A LLM model is made of multiple matrix-multiplication operations (or GEMMs): `Y
|
|
= XW` where `X` of shape `[n, k]`, holds the activation (produced at run-time)
|
|
and `W`, of shape `[k, m]` are the learned weights. `Y`, of shape `[n, m]`, is
|
|
the matrix product of `X` and `W`.
|
|
|
|
SmoothQuant introduces scaling along the `k` dimension by defining a vector of
|
|
strictly positive coefficients `s`. `Y = X diag(s)^{-1} diag(s) W`. We now have
|
|
`Y = X'W'` where `X' = X diag(s)^{-1}` and `W' = diag(s) W`. This
|
|
transformation is introduced so the quantization behaves better. In *normal*
|
|
models, `X` tends to be ill-conditioned: it has mostly small-magnitude
|
|
coefficients, but also some outliers that makes quantization difficult.
|
|
Conversely, the re-scaled `X'` is better suited for INT8 conversion.
|
|
|
|
In this example, we only replace Attention's QKV and MLP's FC1 GEMMs to their
|
|
Smoothquant'd version since it is sufficient to maintain the accuracy for the
|
|
GPT model. During inference, `X'` is computed by fusing the channel-wise
|
|
multiplication by `diag(s)^{-1}` with the preceding layernorm's lambda and beta
|
|
parameters. `W'` is pre-computed and doesn't need additional modification
|
|
during inference.
|
|
|
|
#### INT8 inference
|
|
|
|
The INT8 quantization scheme used in TensorRT-LLM theoretically works on any
|
|
GPT model. However, Smoothquant'd models tend to produce more accurate results
|
|
with reduced precision.
|
|
|
|
INT8 inference modifies GEMMs `Y = XW` so that both `X` and `W` use INT8. The
|
|
matrix-multiplication is sped-up because of smaller weight size and fast matrix
|
|
products computation thanks to NVIDIA *Tensor Cores* operating on INT8 inputs.
|
|
|
|
During inference, X is transformed from its standard floating point (fp)
|
|
values: `X_{i8} <- X_{fp} * s_x`. This scaling puts `X` values in the INT8
|
|
range: `[-128, 127]`. Similarly, W is scaled, `W_{i8} <- W_{fp} * s_w` but that
|
|
operation is done at model export time, no need for subsequent operations at
|
|
run-time.
|
|
|
|
The optimized TensorRT-LLM GEMM implementation for SmoothQuant does the integer
|
|
matrix-multiplication `Y_{i32} <- X_{i8} W_{i8}` and rescales the result to its
|
|
original range `Y_{fp} <- Y_{i32} * (s_x)^{-1} * (s_w)^{-1}`. Note that
|
|
`Y_{i32}` isn't stored in memory, the re-scaling happens in the GEMM's epilogue
|
|
and only `Y_{fp}` gets saved.
|
|
|
|
By default `s_x` and `s_w` are single-value coefficients. This is the
|
|
*per-tensor* mode. Values for `s_x` and `s_w` are static, estimated at model
|
|
export time.
|
|
|
|
TensorRT-LLM also supports more elaborate modes:
|
|
- per-channel: `s_w` is a fixed vector of size `[1, m]`. For that,
|
|
TensorRT-LLM loads the adequately scaled version of of `W_{i8}` at model
|
|
construction time.
|
|
- per-token: `s_x` is a vector of size `[n, 1]` determined at run-time, based
|
|
on the per-token (a.k.a. per-row) absolute maximum of `X`.
|
|
Users can mix-and-match per-channel and per-token options. Both tend to
|
|
increase the accuracy of the model at the cost of a slightly increased latency.
|
|
|
|
### Usage
|
|
|
|
#### SmoothQuant a HF model, export weights & scales for TensorRT-LLM
|
|
|
|
For SmoothQuant, [`hf_gpt_convert.py`](./hf_gpt_convert.py) features a
|
|
`--smoothquant, -sq` option. It must be set to a decimal value in `[0, 1]` and
|
|
corresponds to the `alpha` parameter in the [SmoothQuant
|
|
paper](https://arxiv.org/abs/2211.10438). Setting `-sq` will smooth the model
|
|
as explained in [model transformation](#model-transformation) and export the
|
|
scaling factors needed for INT8 inference.
|
|
|
|
Example:
|
|
```bash
|
|
python3 hf_gpt_convert.py -i gpt2 -o ./c-model/gpt2-smooth --smoothquant 0.5 -t float16
|
|
```
|
|
|
|
#### Build TensorRT engine(s)
|
|
|
|
[`build.py`](./build.py) add new options for the support of INT8 inference of SmoothQuant models.
|
|
|
|
`--use_smooth_quant` is the starting point of INT8 inference. By default, it
|
|
will run the model in the _per-tensor_ mode, as explained in [INT8
|
|
inference](#int8-inference).
|
|
|
|
Then, you can add any combination of `--per-token` and `--per-channel` to get the corresponding behaviors.
|
|
|
|
Examples of build invocations:
|
|
|
|
```bash
|
|
# Build model for SmoothQuant in the _per_tensor_ mode.
|
|
python3 build.py --model_dir=./c-model/gpt2-smooth/1-gpu \
|
|
--use_gpt_attention_plugin \
|
|
--use_smooth_quant
|
|
|
|
# Build model for SmoothQuant in the _per_token_ + _per_channel_ mode
|
|
python3 build.py --model_dir=./c-model/gpt2-smooth/1-gpu \
|
|
--use_gpt_attention_plugin \
|
|
--use_smooth_quant \
|
|
--per_token \
|
|
--per_channel
|
|
```
|
|
Note that GPT attention plugin is required to be enabled for SmoothQuant for now.
|
|
|
|
### INT8 KV Cache, export weights & scales for TensorRT-LLM
|
|
|
|
For int8 kv cache, [`hf_gpt_convert.py`](./hf_gpt_convert.py) features a
|
|
`--calibrate-kv-cache, -kv` option. Setting `-kv` will calibrate the model as
|
|
explained in [model transformation](#model-transformation) and export the
|
|
scaling factors needed for INT8 KV cache inference.
|
|
|
|
Example:
|
|
|
|
```bash
|
|
python3 hf_gpt_convert.py -i gpt2 -o ./c-model/gpt2 --calibrate-kv-cache -t float16
|
|
```
|
|
|
|
#### Build TensorRT engine(s)
|
|
|
|
[`build.py`](./build.py) add new options for the support of INT8 kv cache for models.
|
|
`--int8_kv_cache` forces KV cache to int8. INT8 KV cache can be used with or without gpt attention plugin.
|
|
Examples of build invocations:
|
|
|
|
```bash
|
|
# Build model for GPT with int8 kv cache.
|
|
python3 build.py --model_dir=./c-model/gpt2/1-gpu \
|
|
--int8_kv_cache --remove_input_padding --use_gpt_attention_plugin float16
|
|
```
|
|
|
|
Example of build invocations without gpt attention plugin
|
|
```bash
|
|
python3 build.py --model_dir=./c-model/gpt2/1-gpu --int8_kv_cache
|
|
```
|
|
|
|
## GPT-Next
|
|
|
|
NVIDIA has released a GPT-like model with some architectural improvements, that you can find here: [https://huggingface.co/nvidia/GPT-2B-001](https://huggingface.co/nvidia/GPT-2B-001)
|
|
This architecture is also supported by TensorRT-LLM
|
|
|
|
### 1. Download weights from HuggingFace Transformers
|
|
|
|
```bash
|
|
wget https://huggingface.co/nvidia/GPT-2B-001/resolve/main/GPT-2B-001_bf16_tp1.nemo
|
|
```
|
|
|
|
### 2. Convert weights from NeMo Checkpoint to FT format
|
|
|
|
TensorRT-LLM can convert `.nemo` to generic binary files with [`nemo_ckpt_convert.py`](./nemo_ckpt_convert.py) script. For example:
|
|
|
|
```bash
|
|
python3 nemo_ckpt_convert.py -i GPT-2B-001_bf16_tp1.nemo -o ./c-model/gpt-next-2B --tensor-parallelism 1 --storage-type bfloat16
|
|
```
|
|
|
|
### 3. Build TensorRT engine(s)
|
|
|
|
```bash
|
|
# Build a single-GPU bfloat16 engine using FT weights.
|
|
# --use_gpt_attention_plugin must be set for GPT-Next since Rotary positional embeddings (RoPE) is only supported by the gpt attention plugin at this time.
|
|
python3 build.py --model_dir=./c-model/gpt-next-2B/1-gpu \
|
|
--dtype bfloat16 \
|
|
--remove_input_padding \
|
|
--use_gpt_attention_plugin
|
|
|
|
# Build GPT-Next architecture engines using dummy weights, useful for performance tests.
|
|
# Enable several TensorRT-LLM plugins to increase runtime performance. It also helps with build time.
|
|
python3 build.py --vocab_size=256000 \
|
|
--n_layer=24 \
|
|
--n_embd=2048 \
|
|
--n_head=16 \
|
|
--max_batch_size=256 \
|
|
--dtype float16 \
|
|
--no_bias \
|
|
--hidden_act swiglu \
|
|
--rotary_pct 0.5 \
|
|
--remove_input_padding \
|
|
--use_gpt_attention_plugin \
|
|
--use_gemm_plugin \
|
|
--output_dir=gpt-next-2B
|
|
```
|
|
|
|
### 4. Run
|
|
|
|
```bash
|
|
# Run the GPT-Next model on a single GPU. Use custom tokenizer.
|
|
python3 ../run.py --max_output_len=8 \
|
|
--vocab_file=./c-model/gpt-next-2B/1-gpu/tokenizer.model \
|
|
--no_add_special_tokens
|
|
```
|
|
|
|
## Prompt-tuning
|
|
|
|
For efficient fine-tuning, the NeMo framework allows you to learn virtual tokens to accomplish a downstream task. For more details, please read the
|
|
NeMo documentation [here](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/nemo_megatron/prompt_learning.html).
|
|
|
|
TensorRT-LLM supports inference with those virtual tokens. To enable it, pass the prompt embedding table's maximum size at build time with
|
|
`--max_prompt_embedding_table_size N`. For example:
|
|
```bash
|
|
# Build a GPT-Next model with prompt-tuning enabled
|
|
python3 build.py --model_dir=./c-model/gpt-next-8B/1-gpu --remove_input_padding --use_gpt_attention_plugin --max_prompt_embedding_table_size 100
|
|
```
|
|
|
|
You can now export the learned embedding table with:
|
|
```bash
|
|
python3 nemo_prompt_convert.py -i email_composition.nemo -o email_composition.npy
|
|
```
|
|
It'll give you a summary of the different tasks in the table, that you can specify at runtime.
|
|
|
|
Finally, you can run inference on pre-defined tokens:
|
|
```bash
|
|
python3 ../run.py --input_file input.csv --prompt_table email_composition.npy --tasks 0 --max_output_len=8 --vocab_file=./c-model/gpt-next-8B/1-gpu/tokenizer.model --no_add_special_tokens
|
|
```
|
|
|
|
## Tensor Parallelism for Embedding Lookup Table.
|
|
Since the embedding lookup table can be several gigabytes in size. We can distribute this weight across multiple GPUs in order to reduce the memory consumption per GPU.
|
|
|
|
### 1. Enable this feature
|
|
To enable this feature, add the flag `--use_parallel_embedding` to `build.py`.
|
|
|
|
### 2. Choose the dimension for tensor parallelism
|
|
|
|
Assume the size of embedding lookup table is (vocab\_size \* hidden\_size), we can shard it along the vocab\_size (`--embedding_sharding_dim 0`) or hidden\_size (`--embedding_sharding_dim 1`) dimension.
|
|
|
|
2.1 To shard the embedding lookup table along the hidden\_size dimension, set the flag `--use_parallel_embedding --embedding_sharding_dim 1`. Here is an example:
|
|
```Bash
|
|
python3 build.py --model_dir=./c-model/gpt2/2-gpu --dtype float16 --world_size=2 --remove_input_padding --use_gpt_attention_plugin float16 --parallel_build --max_input_len 1000 \
|
|
--use_parallel_embedding --embedding_sharding_dim 1 \
|
|
--output_dir=trt_engine/gpt2/float16/2-gpu
|
|
```
|
|
2.2 To shard the embedding lookup table along the vocab\_size dimension, set the flag `--use_parallel_embedding --embedding_sharding_dim 0`.
|
|
|
|
Meanwhile, we provide a lookup plugin to support tensor parallelism on vocab\_size dimension.
|
|
|
|
- An example of sharing along vocab\_size dimension with lookup plugin:
|
|
|
|
```Bash
|
|
python3 build.py --model_dir=./c-model/gpt2/2-gpu --dtype float16 --world_size=2 --remove_input_padding --use_gpt_attention_plugin float16 --parallel_build --max_input_len 1000 \
|
|
--use_parallel_embedding --embedding_sharding_dim 0 --use_lookup_plugin float16 \
|
|
--output_dir=trt_engine/gpt2/float16/2-gpu
|
|
```
|
|
- An example of sharing along vocab\_size dimension without lookup plugin:
|
|
```Bash
|
|
python3 build.py --model_dir=./c-model/gpt2/2-gpu --dtype float16 --world_size=2 --remove_input_padding --use_gpt_attention_plugin float16 --parallel_build --max_input_len 1000 \
|
|
--use_parallel_embedding --embedding_sharding_dim 0 \
|
|
--output_dir=trt_engine/gpt2/float16/2-gpu
|
|
```
|
|
### 3. Embedding sharing
|
|
In some examples, the embedding lookup table is used both in embedding() and lm_head() layers. Sharing the embedding lookup table can reduce memory consumption.
|
|
|
|
With flag `--use_embedding_sharing` for `build.py`, we will try to enable this feature. However it only takes effect when the following criteria are met:
|
|
- The weight is shared between two layers. If we found the weight for lm_head() layer, we cannot enable it.
|
|
- For multiple processes case, `--use_parallel_embedding` must be set. And we only support sharing when the embedding lookup table is sharded along the vocab dimension (`--embedding_sharding_dim 0`, as is the default value), which minimizes the overall communication cost.
|
|
- For TensorRT 9.0 version, the engine size is expected to be reduced when the lookup and gemm plugin are enabled.
|
|
|
|
Here is an example for using embedding parallelism and sharing feature:
|
|
```Bash
|
|
python3 hf_gpt_convert.py -i gpt2 -o ./c-model/gpt2 --tensor-parallelism 2 --storage-type bfloat16
|
|
|
|
python3 build.py --model_dir=./c-model/gpt2/2-gpu --dtype bfloat16 --world_size=2 --remove_input_padding --use_gpt_attention_plugin --use_gemm_plugin --parallel_build --max_input_len 1000 --use_parallel_embedding --embedding_sharding_dim 0 --use_lookup_plugin --use_embedding_sharing --output_dir=trt_engine/gpt2/bfloat16/2-gpu
|
|
|
|
mpirun -np 2 python3 ../summarize.py --engine_dir trt_engine/gpt2/bfloat16/2-gpu --hf_model_dir gpt2 --batch_size 10 --test_trt_llm --check_accuracy --tensorrt_llm_rouge1_threshold=14 --dataset_path ./dataset --no_add_special_tokens
|
|
```
|
|
|
|
### Run MultiLoRA with the Nemo checkpoint
|
|
|
|
```bash
|
|
git clone https://huggingface.co/nvidia/GPT-2B-001
|
|
|
|
python3 examples/gpt/nemo_ckpt_convert.py -i GPT-2B-001_bf16_tp1.nemo -o gpt-2b-fp16-weights-tp1-pp1 -tp 1 -p 4 -t float16
|
|
|
|
python3 examples/gpt/build.py --model_dir=gpt-2b-fp16-weights-tp1-pp1/1-gpu \
|
|
--dtype float16 \
|
|
--remove_input_padding \
|
|
--use_gpt_attention_plugin float16 \
|
|
--use_inflight_batching \
|
|
--paged_kv_cache \
|
|
--output_dir gpt-2b-trt-fp16-tp1-pp1-test \
|
|
--use_lora_plugin \
|
|
--lora_target_modules attn_qkv \
|
|
--max_batch_size 4 \
|
|
--max_beam_width 2 \
|
|
--max_input_len 512 \
|
|
--max_output_len 50 \
|
|
--log_level verbose
|
|
|
|
# Run inference directly from NeMo LoRA checkpoint
|
|
# --lora_task_ids correspond to the index of the models given with --lora_dir. -1 means no LoRA
|
|
|
|
python3 examples/run.py --max_output_len=20 \
|
|
--use_py_session \
|
|
--vocab_file=gpt-2b-fp16-weights-tp1-pp1/1-gpu/tokenizer.model \
|
|
--engine_dir gpt-2b-trt-fp16-tp1-pp1-test/ \
|
|
--lora_dir gpt2b_lora-900.nemo gpt2b_lora-stories.nemo \
|
|
--lora_task_uids 0 -1 1 \
|
|
--lora_ckpt_source "nemo" \
|
|
--no_add_special_tokens \
|
|
--input_text "After Washington had returned to Williamsburg, Dinwiddie ordered him to lead a larger force to assist Trent in his work. While en route, Washington learned of Trent's retreat. Since Tanaghrisson had promised support to the British, Washington continued toward Fort Duquesne and met with the Mingo leader. Learning of a French scouting party in the area, Washington, with Tanaghrisson and his party, surprised the Canadians on May 28 in what became known as the Battle of Jumonville Glen. They killed many of the Canadians, including their commanding officer, Joseph Coulon de Jumonville, whose head was reportedly split open by Tanaghrisson with a tomahawk. The historian Fred Anderson suggests that Tanaghrisson was acting to gain the support of the British and regain authority over his own people. They had been inclined to support the French, with whom they had long trading relationships. One of Tanaghrisson's men told Contrecoeur that Jumonville had been killed by British musket fire. Question: Upon learning of a French scounting party in the area, what did Washington do? Answer:" "You hold the job title in the Wizarding World of Harry Potter where you say random words looking for spells" "You hold the job title in the Wizarding World of Harry Potter where you say random words looking for spells"
|
|
```
|
|
|
|
#### Example output
|
|
|
|
* Note that in this case the adapters have only been trained for a few epochs, so the result quality is poor.
|
|
|
|
```
|
|
Input [Text 0]: "After Washington had returned to Williamsburg, Dinwiddie ordered him to lead a larger force to assist Trent in his work. While en route, Washington learned of Trent's retreat. Since Tanaghrisson had promised support to the British, Washington continued toward Fort Duquesne and met with the Mingo leader. Learning of a French scouting party in the area, Washington, with Tanaghrisson and his party, surprise
|
|
d the Canadians on May 28 in what became known as the Battle of Jumonville Glen. They killed many of the Canadians, including their commanding officer, Joseph Coulon de Jumonville, whose head was reportedly split open by Tanaghrisson with a tomahawk. The historian Fred Anderson suggests that Tanaghrisson was acting to gain the support of the British and regain authority over his own people. They had been inclined to support the French, with whom they had long trading relati
|
|
onships. One of Tanaghrisson's men told Contrecoeur that Jumonville had been killed by British musket fire. Question: Upon learning of a French scounting party in the area, what did Washington do? Answer:"
|
|
Output [Text 0 Beam 0]: "He surprised the Canadians on May 28 in what became known as the Battle of Jumonville"
|
|
Input [Text 1]: "You hold the job title in the Wizarding World of Harry Potter where you say random words looking for spells"
|
|
Output [Text 1 Beam 0]: ".
|
|
|
|
The game is played with a deck of cards, and the player who has the most"
|
|
Input [Text 2]: "You hold the job title in the Wizarding World of Harry Potter where you say random words looking for spells"
|
|
Output [Text 2 Beam 0]: ".
|
|
|
|
You are a wizard who is a wizard.
|
|
|
|
You are a wizard who is"
|
|
```
|