TensorRT-LLMs/examples/bert/README.md

# BERT and BERT Variants

This document explains how to build the BERT family, specifically [BERT](https://huggingface.co/docs/transformers/model_doc/bert) and [RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta) model using TensorRT-LLM. It also describes how to run on a single GPU and two GPUs.

## Overview

The TensorRT-LLM BERT family implementation can be found in [`tensorrt_llm/models/bert/model.py`](../../tensorrt_llm/models/bert/model.py).
The TensorRT-LLM BERT family example code is located in [`examples/bert`](./). There are two main files in that folder:

 * [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the BERT model into tensorrt-llm checkpoint format.
 * [`run.py`](./run.py) to run the inference on an input text,

## Convert Weights

The `convert_checkpoint.py` script converts weights from HuggingFace format to TRT-LLM format. You need to prepare HuggingFace checkpoint files before you run the convert script.

Use `--model_dir` to specify the HuggingFace checkpoint directory.

Supported `model_name` options include: BertModel, BertForQuestionAnswering, BertForSequenceClassification. Use `--model` to specify the target BERT model. Please note that if you choose BertModel, [`convert_checkpoint.py`](./convert_checkpoint.py) will ignore the BERT model class specified in HuggingFace config file, and throw a warning to remind you.

Use `--output_dir` to specify the converted checkpoint and configuration directory. The default value is `./tllm_checkpoint`. This directory will be used for next engine building phase.

Take BertForQuestionAnswering for example,

```bash
export hf_model_dir=<HuggingFace_Model_Path>
export model_name='bertqa'
export model='BertForQuestionAnswering'
export dtype='float16'

# convert
python convert_checkpoint.py \
--model $model \
--model_dir $hf_model_dir  \
--output_dir ${model_name}_${dtype}_tllm_checkpoint \
--dtype $dtype

# convert tp=2
python  convert_checkpoint.py \
--model $model \
--model_dir $hf_model_dir  \
--output_dir ${model_name}_${dtype}_tllm_checkpoint \
--dtype $dtype \
--tp_size 2

```

## Build TensorRT engine(s)

TensorRT-LLM converts HuggingFace BERT family models into TensorRT engine(s).
To build the TensorRT engine, the basic command is:

```bash
trtllm-build --checkpoint_dir ./${model_name}_${dtype}_tllm_checkpoint \
--output_dir ${model_name}_engine_outputs \
```
Beside the basic engine build, TensorRT-LLM provides these features by adding these flags with basic build command:

- To use `bert_attention_plugin`, add `--bert_attention_plugin` to command.

- To use remove input padding, add `--remove_input_padding=enable` and `--bert_attention_plugin` to command. Please note that remove input padding feature has to come with bert attention plugin.

- To use FMHA kernels, add `--context_fmha=enable` or `--bert_context_fmha_fp32_acc=enable`(to enable FP32 accumulation). Note that these two flags should be used with `--bert_attention_plugin`

Continue the BertForQuestionAnswering example:
```bash
# Build TensorRT engine for BertForQuestionAnswering model, with remove_input_padding enabled.
# TP=1 and TP=2 share the same build command
trtllm-build --checkpoint_dir ./${model_name}_${dtype}_tllm_checkpoint  \
--output_dir=${model_name}_engine_outputs \
--remove_input_padding=enable \
--bert_attention_plugin=${dtype} \
--max_batch_size 8 \
--max_input_len 512
```

## Run TensorRT engine(s)
Run a TensorRT-LLM BERT model using the engines generated by build command mentioned above.
Note that during model deployment, only the TensorRT engine files are needed. Previously downloaded model checkpoints and converted weights can be removed.

[`run.py`](./run.py) provides an example for performing the inference and decoding the output. By default, it will use the task specific datasets as input text, for example, ['squad_v2'](https://huggingface.co/datasets/rajpurkar/squad_v2) for BertForQuestionAnswering.

To run the TensorRT engine, the basic command is:

```bash
run.py --engine_dir ./${model_name}_engine_outputs \
--hf_model_dir $hf_model_dir \ # used for loading tokenizer
```
Please note that:

- To use remove input padding, add `--remove_input_padding` to command. This flag is used to tell the runtime how to process the input and decode the output.

- To compare the result with HuggingFace model, add `--run_hf_test` to command. The runtime will load the HF model from `hf_model_dir` and compare the result. Refer to [`run.py`](./run.py) for more details.

Continue the BertForQuestionAnswering example:
```bash
# Run TensorRT engine
python run.py \
--engine_dir ./${model_name}_engine_outputs \
--hf_model_dir=$hf_model_dir \
--remove_input_padding \
--run_hf_test

# Run TP=2 inference
mpirun -n 2 \
python run.py \
--engine_dir ./${model_name}_engine_outputs \
--hf_model_dir=$hf_model_dir \
--remove_input_padding \
--run_hf_test
```