mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
112 lines
4.8 KiB
Markdown
112 lines
4.8 KiB
Markdown
# BERT and BERT Variants
|
|
|
|
This document explains how to build the BERT family, specifically [BERT](https://huggingface.co/docs/transformers/model_doc/bert) and [RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta) model using TensorRT-LLM. It also describes how to run on a single GPU and two GPUs.
|
|
|
|
## Overview
|
|
|
|
The TensorRT-LLM BERT family implementation can be found in [`tensorrt_llm/models/bert/model.py`](../../tensorrt_llm/models/bert/model.py).
|
|
The TensorRT-LLM BERT family example code is located in [`examples/bert`](./). There are two main files in that folder:
|
|
|
|
* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the BERT model into tensorrt-llm checkpoint format.
|
|
* [`run.py`](./run.py) to run the inference on an input text,
|
|
|
|
## Convert Weights
|
|
|
|
The `convert_checkpoint.py` script converts weights from HuggingFace format to TRT-LLM format. You need to prepare HuggingFace checkpoint files before you run the convert script.
|
|
|
|
Use `--model_dir` to specify the HuggingFace checkpoint directory.
|
|
|
|
Supported `model_name` options include: BertModel, BertForQuestionAnswering, BertForSequenceClassification. Use `--model` to specify the target BERT model. Please note that if you choose BertModel, [`convert_checkpoint.py`](./convert_checkpoint.py) will ignore the BERT model class specified in HuggingFace config file, and throw a warning to remind you.
|
|
|
|
Use `--output_dir` to specify the converted checkpoint and configuration directory. The default value is `./tllm_checkpoint`. This directory will be used for next engine building phase.
|
|
|
|
Take BertForQuestionAnswering for example,
|
|
|
|
```bash
|
|
export hf_model_dir=<HuggingFace_Model_Path>
|
|
export model_name='bertqa'
|
|
export model='BertForQuestionAnswering'
|
|
export dtype='float16'
|
|
|
|
# convert
|
|
python convert_checkpoint.py \
|
|
--model $model \
|
|
--model_dir $hf_model_dir \
|
|
--output_dir ${model_name}_${dtype}_tllm_checkpoint \
|
|
--dtype $dtype
|
|
|
|
# convert tp=2
|
|
python convert_checkpoint.py \
|
|
--model $model \
|
|
--model_dir $hf_model_dir \
|
|
--output_dir ${model_name}_${dtype}_tllm_checkpoint \
|
|
--dtype $dtype \
|
|
--tp_size 2
|
|
|
|
```
|
|
|
|
## Build TensorRT engine(s)
|
|
|
|
TensorRT-LLM converts HuggingFace BERT family models into TensorRT engine(s).
|
|
To build the TensorRT engine, the basic command is:
|
|
|
|
```bash
|
|
trtllm-build --checkpoint_dir ./${model_name}_${dtype}_tllm_checkpoint \
|
|
--output_dir ${model_name}_engine_outputs \
|
|
```
|
|
Beside the basic engine build, TensorRT-LLM provides these features by adding these flags with basic build command:
|
|
|
|
- To use `bert_attention_plugin`, add `--bert_attention_plugin` to command.
|
|
|
|
- To use remove input padding, add `--remove_input_padding=enable` and `--bert_attention_plugin` to command. Please note that remove input padding feature has to come with bert attention plugin.
|
|
|
|
- To use FMHA kernels, add `--context_fmha=enable` or `--bert_context_fmha_fp32_acc=enable`(to enable FP32 accumulation). Note that these two flags should be used with `--bert_attention_plugin`
|
|
|
|
Continue the BertForQuestionAnswering example:
|
|
```bash
|
|
# Build TensorRT engine for BertForQuestionAnswering model, with remove_input_padding enabled.
|
|
# TP=1 and TP=2 share the same build command
|
|
trtllm-build --checkpoint_dir ./${model_name}_${dtype}_tllm_checkpoint \
|
|
--output_dir=${model_name}_engine_outputs \
|
|
--remove_input_padding=enable \
|
|
--bert_attention_plugin=${dtype} \
|
|
--max_batch_size 8 \
|
|
--max_input_len 512
|
|
```
|
|
|
|
## Run TensorRT engine(s)
|
|
Run a TensorRT-LLM BERT model using the engines generated by build command mentioned above.
|
|
Note that during model deployment, only the TensorRT engine files are needed. Previously downloaded model checkpoints and converted weights can be removed.
|
|
|
|
[`run.py`](./run.py) provides an example for performing the inference and decoding the output. By default, it will use the task specific datasets as input text, for example, ['squad_v2'](https://huggingface.co/datasets/rajpurkar/squad_v2) for BertForQuestionAnswering.
|
|
|
|
To run the TensorRT engine, the basic command is:
|
|
|
|
```bash
|
|
run.py --engine_dir ./${model_name}_engine_outputs \
|
|
--hf_model_dir $hf_model_dir \ # used for loading tokenizer
|
|
```
|
|
Please note that:
|
|
|
|
- To use remove input padding, add `--remove_input_padding` to command. This flag is used to tell the runtime how to process the input and decode the output.
|
|
|
|
- To compare the result with HuggingFace model, add `--run_hf_test` to command. The runtime will load the HF model from `hf_model_dir` and compare the result. Refer to [`run.py`](./run.py) for more details.
|
|
|
|
Continue the BertForQuestionAnswering example:
|
|
```bash
|
|
# Run TensorRT engine
|
|
python run.py \
|
|
--engine_dir ./${model_name}_engine_outputs \
|
|
--hf_model_dir=$hf_model_dir \
|
|
--remove_input_padding \
|
|
--run_hf_test
|
|
|
|
# Run TP=2 inference
|
|
mpirun -n 2 \
|
|
python run.py \
|
|
--engine_dir ./${model_name}_engine_outputs \
|
|
--hf_model_dir=$hf_model_dir \
|
|
--remove_input_padding \
|
|
--run_hf_test
|
|
```
|