mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-26 05:32:57 +08:00
* Update TensorRT-LLM --------- Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
235 lines
9.2 KiB
Markdown
235 lines
9.2 KiB
Markdown
# OPT
|
|
|
|
This document explains how to build the [OPT](https://huggingface.co/docs/transformers/model_doc/opt) model using TensorRT-LLM and run on a single GPU, a single node with
|
|
multiple GPUs or multiple nodes with multiple GPUs.
|
|
|
|
## Overview
|
|
|
|
The TensorRT-LLM OPT implementation can be found in [`tensorrt_llm/models/opt/model.py`](../../tensorrt_llm/models/opt/model.py). The TensorRT-LLM OPT example code is located in [`examples/opt`](./). There is one file:
|
|
|
|
* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format
|
|
|
|
In addition, there are two shared files in the parent folder [`examples`](../) for inference and evaluation:
|
|
|
|
* [`../run.py`](../run.py) to run the inference on an input text;
|
|
* [`../summarize.py`](../summarize.py) to summarize the articles in the [cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail) dataset.
|
|
|
|
## Support Matrix
|
|
* FP16
|
|
* INT8 & INT4 Weight-Only
|
|
* Tensor Parallel
|
|
|
|
## Usage
|
|
|
|
The next two sections describe how to convert the weights from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers)
|
|
format to the TensorRT-LLM format.
|
|
|
|
### 1. Download weights from HuggingFace Transformers
|
|
|
|
You have to make sure `git-lfs` is properly installed to load the checkpoints.
|
|
|
|
```bash
|
|
pip install -r requirements.txt && sudo apt-get install git-lfs
|
|
```
|
|
|
|
There are four different checkpoints available. Use one of the following commands to fetch the checkpoint you are interested in.
|
|
|
|
```bash
|
|
# OPT-125M
|
|
git-lfs clone https://huggingface.co/facebook/opt-125m
|
|
|
|
# OPT-350M
|
|
git-lfs clone https://huggingface.co/facebook/opt-350m
|
|
|
|
# OPT-2.7B
|
|
git-lfs clone https://huggingface.co/facebook/opt-2.7b
|
|
|
|
# OPT-66B
|
|
git-lfs clone https://huggingface.co/facebook/opt-66b
|
|
```
|
|
|
|
### 2. Convert weights from HF Transformers to TensorRT-LLM format
|
|
|
|
```bash
|
|
# OPT-125M
|
|
python3 convert_checkpoint.py --model_dir ./opt-125m \
|
|
--dtype float16 \
|
|
--output_dir ./opt/125M/trt_ckpt/fp16/1-gpu/
|
|
|
|
# OPT-350M
|
|
python3 convert_checkpoint.py --model_dir ./opt-350m \
|
|
--dtype float16 \
|
|
--output_dir ./opt/350M/trt_ckpt/fp16/1-gpu/
|
|
|
|
# OPT-2.7B
|
|
python3 convert_checkpoint.py --model_dir ./opt-2.7b \
|
|
--dtype float16 \
|
|
--output_dir ./opt/2.7B/trt_ckpt/fp16/1-gpu/
|
|
|
|
# OPT-66B
|
|
python3 convert_checkpoint.py --model_dir ./opt-66b \
|
|
--dtype float16 \
|
|
--tp_size 4 \
|
|
--output_dir ./opt/66B/trt_ckpt/fp16/4-gpu/ \
|
|
--workers 2
|
|
```
|
|
|
|
### 3. Build TensorRT engine(s)
|
|
|
|
```bash
|
|
# OPT-125M
|
|
trtllm-build --checkpoint_dir ./opt/125M/trt_ckpt/fp16/1-gpu/ \
|
|
--gemm_plugin float16 \
|
|
--max_batch_size 8 \
|
|
--max_input_len 924 \
|
|
--max_output_len 100 \
|
|
--output_dir ./opt/125M/trt_engines/fp16/1-gpu/
|
|
|
|
# OPT-350M
|
|
trtllm-build --checkpoint_dir ./opt/350M/trt_ckpt/fp16/1-gpu/ \
|
|
--gemm_plugin float16 \
|
|
--max_batch_size 8 \
|
|
--max_input_len 924 \
|
|
--max_output_len 100 \
|
|
--output_dir ./opt/350M/trt_engines/fp16/1-gpu/
|
|
|
|
# OPT-2.7B
|
|
trtllm-build --checkpoint_dir ./opt/2.7B/trt_ckpt/fp16/1-gpu/ \
|
|
--gemm_plugin float16 \
|
|
--max_batch_size 8 \
|
|
--max_input_len 924 \
|
|
--max_output_len 100 \
|
|
--output_dir ./opt/2.7B/trt_engines/fp16/1-gpu/
|
|
|
|
# OPT-66B
|
|
trtllm-build --checkpoint_dir ./opt/66B/trt_ckpt/fp16/4-gpu/ \
|
|
--gemm_plugin float16 \
|
|
--max_batch_size 8 \
|
|
--max_input_len 924 \
|
|
--max_output_len 100 \
|
|
--output_dir ./opt/66B/trt_engines/fp16/4-gpu/ \
|
|
--workers 2
|
|
```
|
|
|
|
### 4. Summarization using the OPT model
|
|
|
|
The following section describes how to run a TensorRT-LLM OPT model to summarize the articles from the
|
|
[cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail) dataset. For each summary, the script can compute the
|
|
[ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)) scores and use the `ROUGE-1` score to validate the implementation.
|
|
The script can also perform the same summarization using the HF OPT model.
|
|
|
|
```bash
|
|
# OPT-125M
|
|
python3 ../summarize.py --engine_dir ./opt/125M/trt_engines/fp16/1-gpu/ \
|
|
--test_hf \
|
|
--batch_size 1 \
|
|
--test_trt_llm \
|
|
--hf_model_dir opt-125m \
|
|
--data_type fp16 \
|
|
--check_accuracy \
|
|
--tensorrt_llm_rouge1_threshold=14
|
|
|
|
# OPT-350M
|
|
python3 ../summarize.py --engine_dir ./opt/350M/trt_engines/fp16/1-gpu/ \
|
|
--test_hf \
|
|
--batch_size 1 \
|
|
--test_trt_llm \
|
|
--hf_model_dir opt-350m \
|
|
--data_type fp16 \
|
|
--check_accuracy \
|
|
--tensorrt_llm_rouge1_threshold=20
|
|
|
|
# OPT-2.7B
|
|
python3 ../summarize.py --engine_dir ./opt/2.7B/trt_engines/fp16/1-gpu/ \
|
|
--test_hf \
|
|
--batch_size 1 \
|
|
--test_trt_llm \
|
|
--hf_model_dir opt-2.7b \
|
|
--data_type fp16 \
|
|
--check_accuracy \
|
|
--tensorrt_llm_rouge1_threshold=20
|
|
|
|
# OPT-66B
|
|
mpirun -n 4 --allow-run-as-root \
|
|
python3 ../summarize.py --engine_dir ./opt/66B/trt_engines/fp16/4-gpu/ \
|
|
--batch_size 1 \
|
|
--test_trt_llm \
|
|
--hf_model_dir opt-66b \
|
|
--data_type fp16 \
|
|
--check_accuracy \
|
|
--tensorrt_llm_rouge1_threshold=20
|
|
```
|
|
|
|
#### Fused MultiHead Attention (FMHA)
|
|
|
|
You can enable the FMHA kernels for OPT by adding `--enable_context_fmha` to the invocation of `trtllm-build`. Note that it is disabled by default because of possible accuracy issues due to the use of Flash Attention.
|
|
|
|
If you find that the default fp16 accumulation (`--enable_context_fmha`) cannot meet the requirement, you can try to enable fp32 accumulation by adding `--context_fmha_fp32_acc enable`. However, it is expected to see performance drop.
|
|
|
|
Note `--context_fmha enable` / `--context_fmha_fp32_acc enable` has to be used together with `--gpt_attention_plugin float16`.
|
|
|
|
## Tensor Parallelism for Embedding Lookup Table.
|
|
Since the embedding lookup table can be several gigabytes in size. We can distribute this weight across multiple GPUs in order to reduce the memory consumption per GPU.
|
|
|
|
### 1. Enable this feature
|
|
To enable this feature, add the flag `--use_parallel_embedding` to `trtllm-build`.
|
|
|
|
### 2. Choose the dimension for tensor parallelism
|
|
|
|
Assume the size of embedding lookup table is (vocab\_size \* hidden\_size), we can shard it along the vocab\_size (`--embedding_sharding_dim 0`) or hidden\_size (`--embedding_sharding_dim 1`) dimension.
|
|
|
|
2.1 To shard the embedding lookup table along the hidden\_size dimension, set the flag `--use_parallel_embedding --embedding_sharding_dim 1`. Here is an example:
|
|
|
|
```Bash
|
|
python3 convert_checkpoint.py --model_dir ./opt-125m \
|
|
--dtype float16 \
|
|
--output_dir ./opt/125M/trt_ckpt/fp16/2-gpu/ \
|
|
--tp_size 2 \
|
|
--use_parallel_embedding \
|
|
--embedding_sharding_dim 1
|
|
```
|
|
2.2 To shard the embedding lookup table along the vocab\_size dimension, set the flag `--use_parallel_embedding --embedding_sharding_dim 0`.
|
|
|
|
Meanwhile, we provide a lookup plugin to support tensor parallelism on vocab\_size dimension.
|
|
|
|
- An example of sharing along vocab\_size dimension with lookup plugin:
|
|
|
|
```Bash
|
|
python3 convert_checkpoint.py --model_dir ./opt-125m \
|
|
--dtype float16 \
|
|
--output_dir ./opt/125M/trt_ckpt/fp16/2-gpu/ \
|
|
--tp_size 2 \
|
|
--use_parallel_embedding \
|
|
--embedding_sharding_dim 0
|
|
|
|
trtllm-build --checkpoint_dir ./opt/125M/trt_ckpt/fp16/2-gpu/ \
|
|
--gemm_plugin float16 \
|
|
--lookup_plugin float16 \
|
|
--max_batch_size 8 \
|
|
--max_input_len 924 \
|
|
--max_output_len 100 \
|
|
--output_dir ./opt/125M/trt_engines/fp16/2-gpu/ \
|
|
--workers 2
|
|
|
|
mpirun -n 2 --allow-run-as-root \
|
|
python3 ../summarize.py --engine_dir ./opt/125M/trt_engines/fp16/2-gpu/ \
|
|
--batch_size 1 \
|
|
--test_trt_llm \
|
|
--hf_model_dir opt-125m \
|
|
--data_type fp16 \
|
|
--check_accuracy \
|
|
--tensorrt_llm_rouge1_threshold=14
|
|
```
|
|
|
|
- An example of sharing along vocab\_size dimension without lookup plugin:
|
|
|
|
```Bash
|
|
trtllm-build --checkpoint_dir ./opt/125M/trt_ckpt/fp16/2-gpu/ \
|
|
--gemm_plugin float16 \
|
|
--max_batch_size 8 \
|
|
--max_input_len 924 \
|
|
--max_output_len 100 \
|
|
--output_dir ./opt/125M/trt_engines/fp16/2-gpu/ \
|
|
--workers 2
|
|
```
|