mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-31 08:11:27 +08:00
* Update TensorRT-LLM --------- Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
150 lines
7.4 KiB
Markdown
150 lines
7.4 KiB
Markdown
# Phi
|
|
|
|
This document explains how to build the [Phi](https://huggingface.co/microsoft/phi-2) model using TensorRT-LLM and run on a single GPU.
|
|
Note that both [Phi-2](https://huggingface.co/microsoft/phi-2) and [Phi-1.5](https://huggingface.co/microsoft/phi-1_5/) are supported.
|
|
|
|
## Overview
|
|
|
|
The TensorRT-LLM Phi implementation can be found in [`tensorrt_llm/models/phi/model.py`](../../tensorrt_llm/models/phi/model.py). The TensorRT-LLM Phi example code is located in [`examples/phi`](./). There is one main file:
|
|
|
|
* [`build.py`](./build.py) to build the [TensorRT](https://developer.nvidia.com/tensorrt) engine(s) needed to run the Phi model.
|
|
|
|
In addition, there are two shared files in the parent folder [`examples`](../) for inference and evaluation:
|
|
|
|
* [`../run.py`](../run.py) to run the inference on an input text;
|
|
* [`../summarize.py`](../summarize.py) to summarize the articles in the [cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail) dataset.
|
|
|
|
## Support Matrix
|
|
* FP16
|
|
* BF16
|
|
* Tensor Parallel
|
|
|
|
## Usage
|
|
|
|
### 1. Download weights from HuggingFace (HF) Transformers
|
|
|
|
```bash
|
|
# Make sure you have git-lfs installed (https://git-lfs.com)
|
|
git lfs install
|
|
git clone https://huggingface.co/microsoft/phi-2
|
|
```
|
|
|
|
### 2. Build TensorRT engine(s)
|
|
|
|
TensorRT-LLM builds TensorRT engine(s) using a HF checkpoint. If no checkpoint directory is specified, TensorRT-LLM will build engine(s) using dummy weights.
|
|
|
|
Examples of build invocations:
|
|
|
|
```bash
|
|
# Build a float16 engine using a single GPU and HF weights.
|
|
# Enable several TensorRT-LLM plugins to increase runtime performance. It also helps with build time.
|
|
python3 build.py --dtype=float16 \
|
|
--log_level=verbose \
|
|
--use_gpt_attention_plugin float16 \
|
|
--use_gemm_plugin float16 \
|
|
--max_batch_size=16 \
|
|
--max_input_len=1024 \
|
|
--max_output_len=1024 \
|
|
--output_dir=phi_engine \
|
|
--model_dir=phi_model 2>&1 | tee build.log
|
|
|
|
# Build a bfloat16 engine using a single GPU and HF weights.
|
|
# Enable several TensorRT-LLM plugins to increase runtime performance. It also helps with build time.
|
|
python3 build.py --dtype=bfloat16 \
|
|
--log_level=verbose \
|
|
--use_gpt_attention_plugin bfloat16 \
|
|
--use_gemm_plugin bfloat16 \
|
|
--max_batch_size=16 \
|
|
--max_input_len=1024 \
|
|
--max_output_len=1024 \
|
|
--output_dir=phi_engine \
|
|
--model_dir=phi_model 2>&1 | tee build.log
|
|
|
|
# Build a float16 engine using a single GPU and dummy weights.
|
|
# Using dummy weights is useful for performance tests.
|
|
# Enable several TensorRT-LLM plugins to increase runtime performance. It also helps with build time.
|
|
python3 build.py --dtype=float16 \
|
|
--log_level=verbose \
|
|
--use_gpt_attention_plugin float16 \
|
|
--use_gemm_plugin float16 \
|
|
--max_batch_size=16 \
|
|
--max_input_len=1024 \
|
|
--max_output_len=1024 \
|
|
--output_dir=phi_engine_dummy_weights 2>&1 | tee build.log
|
|
|
|
# Build a float16 engine using 2-way tensor parallelism and HF weights.
|
|
# Enable several TensorRT-LLM plugins to increase runtime performance. It also helps with build time.
|
|
python3 build.py --dtype=float16 \
|
|
--log_level=verbose \
|
|
--use_gpt_attention_plugin float16 \
|
|
--use_gemm_plugin float16 \
|
|
--max_batch_size=16 \
|
|
--max_input_len=1024 \
|
|
--max_output_len=1024 \
|
|
--world_size=2 \
|
|
--output_dir=phi_engine_tp2 \
|
|
--model_dir=phi_model 2>&1 | tee build_tp2.log
|
|
```
|
|
#### Fused MultiHead Attention (FMHA)
|
|
|
|
You can enable the FMHA kernels by adding `--enable_context_fmha` to the invocation of `build.py`. Note that it is disabled by default because of possible accuracy issues due to the use of Flash Attention.
|
|
|
|
If you find that the default fp16 accumulation (`--enable_context_fmha`) cannot meet accuracy requirements, you can try to enable fp32 accumulation by adding `--enable_context_fmha_fp32_acc`. However, it may have an impact on performance.
|
|
|
|
Note `--enable_context_fmha` / `--enable_context_fmha_fp32_acc` have to be used together with `--use_gpt_attention_plugin float16`.
|
|
|
|
### 3. Run
|
|
|
|
To run a TensorRT-LLM Phi model using the engines generated by `build.py`:
|
|
|
|
```bash
|
|
# For a single GPU
|
|
python3 ../run.py --max_output_len=50 --engine_dir=phi_engine --tokenizer_dir=phi_model
|
|
|
|
# For 2-way tensor parallelism
|
|
mpirun -n 2 --allow-run-as-root python3 ../run.py --max_output_len=50 --engine_dir=phi_engine_tp2 --tokenizer_dir=phi_model
|
|
```
|
|
|
|
### 4. Summarization using the Phi model
|
|
|
|
The following section describes how to run a TensorRT-LLM Phi model to summarize the articles from the [cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail) dataset. For each summary, the script can compute the [ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)) scores and use the `ROUGE-1` score to validate the implementation.
|
|
The script can also perform the same summarization using the HF Phi model.
|
|
|
|
As previously explained, the first step is to build the TensorRT engine as described above using HF weights. You also have to install the requirements:
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
The summarization can be done using the [`../summarize.py`](../summarize.py) script as follows:
|
|
|
|
```bash
|
|
# Run the summarization task using a TensorRT-LLM model and a single GPU.
|
|
python3 ../summarize.py --engine_dir phi_engine \
|
|
--hf_model_dir phi_model \
|
|
--batch_size 1 \
|
|
--test_trt_llm \
|
|
--tensorrt_llm_rouge1_threshold 14 \
|
|
--data_type fp16 \
|
|
--check_accuracy 2>&1 | tee summary_trt_llm.log
|
|
|
|
# Run the summarization task using a HF model and a single GPU.
|
|
python3 ../summarize.py --engine_dir phi_engine \
|
|
--hf_model_dir phi_model \
|
|
--batch_size 1 \
|
|
--test_hf \
|
|
--tensorrt_llm_rouge1_threshold 14 \
|
|
--data_type fp16 \
|
|
--check_accuracy 2>&1 | tee summary_hf.log
|
|
|
|
# Run the summarization task using a TensorRT-LLM model and 2-way tensor parallelism.
|
|
mpirun -n 2 --allow-run-as-root \
|
|
python3 ../summarize.py --engine_dir phi_engine_tp2 \
|
|
--hf_model_dir phi_model \
|
|
--batch_size 1 \
|
|
--test_trt_llm \
|
|
--tensorrt_llm_rouge1_threshold 14 \
|
|
--data_type fp16 \
|
|
--check_accuracy 2>&1 | tee summary_trt_llm_tp2.log
|
|
```
|