mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

石晓伟 8f91cff22e TensorRT-LLM Release 0.15.0 (#2529 ) Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>		2024-12-04 13:44:56 +08:00
..
base_benchmark	Initial commit	2023-09-20 00:29:41 -07:00
base_with_attention_plugin_benchmark	Initial commit	2023-09-20 00:29:41 -07:00
large_benchmark	Initial commit	2023-09-20 00:29:41 -07:00
large_with_attention_plugin_benchmark	Initial commit	2023-09-20 00:29:41 -07:00
.gitignore	Initial commit	2023-09-20 00:29:41 -07:00
build.py	TensorRT-LLM Release 0.15.0 (#2529 )	2024-12-04 13:44:56 +08:00
README.md	TensorRT-LLM v0.12 Update (#2164 )	2024-08-29 17:25:07 +08:00
run_remove_input_padding.py	TensorRT-LLM v0.12 Update (#2164 )	2024-08-29 17:25:07 +08:00
run.py	TensorRT-LLM v0.12 Update (#2164 )	2024-08-29 17:25:07 +08:00
weight.py	Update TensorRT-LLM Release branch (#1192 )	2024-02-29 17:20:55 +08:00

README.md

BERT and BERT Variants

This document explains how to build the BERT family, specifically BERT and RoBERTa model using TensorRT-LLM. It also describes how to run on a single GPU and two GPUs.

Overview

The TensorRT-LLM BERT family implementation can be found in tensorrt_llm/models/bert/model.py. The TensorRT-LLM BERT family example code is located in examples/bert. There are two main files in that folder:

build.py to build the TensorRT engine(s) needed to run the model,
run.py to run the inference on an input text,

Build and run on a single GPU

TensorRT-LLM converts HuggingFace BERT family models into TensorRT engine(s). To build the TensorRT engine, use:

python3 build.py [--model <model_name> --dtype <data_type> ...]

Supported model_name options include: BertModel, BertForQuestionAnswering, BertForSequenceClassification, RobertaModel, RobertaForQuestionAnswering, and RobertaForSequenceClassification, with BertModel as the default.

Some examples are as follows:

# Build BertModel
python3 build.py --model BertModel --dtype=float16 --log_level=verbose

# Build RobertaModel
python3 build.py --model RobertaModel --dtype=float16 --log_level=verbose

# Build BertModel with TensorRT-LLM BERT Attention plugin for enhanced runtime performance
python3 build.py --dtype=float16 --log_level=verbose --use_bert_attention_plugin float16

# Build BertForSequenceClassification with TensorRT-LLM remove input padding knob for enhanced runtime performance
python3 build.py --model BertForSequenceClassification --remove_input_padding --use_bert_attention_plugin float16

The following command can be used to run the model on a single GPU:

python3 run.py

If the model built with --remove_input_padding knob, please run the model with below command

python3 run_remove_input_padding.py

Fused MultiHead Attention (FMHA)

You can enable the FMHA kernels for BERT by adding --enable_context_fmha to the invocation of build.py. Note that it is disabled by default because of possible accuracy issues due to the use of Flash Attention.

If you find that the default fp16 accumulation (--enable_context_fmha) cannot meet the requirement, you can try to enable fp32 accumulation by adding --enable_context_fmha_fp32_acc. However, it is expected to see performance drop.

Note --enable_context_fmha / --enable_context_fmha_fp32_acc has to be used together with --use_bert_attention_plugin float16.

Remove input padding

The remove input padding feature is enabled by adding --remove_input_padding into build command. When input padding is removed, the different tokens are packed together. It reduces both the amount of computations and memory consumption. For more details, see this Document.

Currently, this feature only enables for BertForSequenceClassification model.

Build and run on two GPUs

The following two commands can be used to build TensorRT engines to run BERT on two GPUs. The first command builds one engine for the first GPU. The second command builds another engine for the second GPU. For example, to build BertForQuestionAnswering with two GPUs, run:

python3 build.py --model BertForQuestionAnswering --world_size=2 --rank=0
python3 build.py --model BertForQuestionAnswering --world_size=2 --rank=1

The following command can be used to run the inference on 2 GPUs. It uses MPI with mpirun.

mpirun -n 2 python3 run.py