mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-02-01 08:41:13 +08:00
106 lines
3.6 KiB
Markdown
106 lines
3.6 KiB
Markdown
# MPT
|
|
|
|
This document explains how to build the [MPT](https://huggingface.co/mosaicml/mpt-7b) model using TensorRT-LLM and run on a single GPU and a single node with multiple GPUs
|
|
|
|
## Overview
|
|
Currently we use `tensorrt_llm.models.GPTLMHeadModel` to build TRT engine for MPT models.
|
|
Support for float16, float32 and bfloat16 conversion. Just change `data_type` flags to any.
|
|
|
|
## Support Matrix
|
|
* FP16
|
|
* FP8
|
|
* INT8 & INT4 Weight-Only
|
|
* FP8 KV CACHE
|
|
* Tensor Parallel
|
|
* STRONGLY TYPED
|
|
|
|
#### MPT 7B
|
|
|
|
### 1. Convert weights from HF Tranformers to FT format
|
|
|
|
The [`hf_gpt_convert.py`](./convert_hf_mpt_to_ft.py) script allows you to convert weights from HF Tranformers format to FT format.
|
|
|
|
```bash
|
|
python convert_hf_mpt_to_ft.py -i mosaicml/mpt-7b -o ./ft_ckpts/mpt-7b/fp16/ -t float16
|
|
|
|
python convert_hf_mpt_to_ft.py -i mosaicml/mpt-7b -o ./ft_ckpts/mpt-7b/fp32/ --tensor_parallelism 4 -t float32
|
|
```
|
|
|
|
`--infer_gpu_num 4` is used to convert to FT format with 4-way tensor parallelism
|
|
|
|
|
|
### 2. Build TensorRT engine(s)
|
|
|
|
Examples of build invocations:
|
|
|
|
```bash
|
|
# Build a single-GPU float16 engine using FT weights.
|
|
python3 build.py --model_dir=./ft_ckpts/mpt-7b/fp16/1-gpu \
|
|
--max_batch_size 64 \
|
|
--use_gpt_attention_plugin \
|
|
--use_gemm_plugin \
|
|
--output_dir ./trt_engines/mpt-7b/fp16/1-gpu
|
|
|
|
# Build 4-GPU MPT-7B float32 engines
|
|
# Enable several TensorRT-LLM plugins to increase runtime performance. It also helps with build time.
|
|
python3 build.py --world_size=4 \
|
|
--parallel_build \
|
|
--max_batch_size 64 \
|
|
--max_input_len 512 \
|
|
--max_output_len 64 \
|
|
--use_gpt_attention_plugin \
|
|
--use_gemm_plugin \
|
|
--model_dir ./ft_ckpts/mpt-7b/fp32/4-gpu \
|
|
--output_dir=./trt_engines/mpt-7b/fp32/4-gpu
|
|
```
|
|
|
|
### 3. Run TRT engine to check if the build was correct
|
|
|
|
```bash
|
|
python run.py --engine_dir ./trt_engines/mpt-7b/fp16/1-gpu/ --max_output_len 10
|
|
|
|
# Run 4-GPU MPT7B TRT engine on a sample input prompt
|
|
mpirun -n 4 --allow-run-as-root python run.py --engine_dir ./trt_engines/mpt-7b/fp32/4-gpu/ --max_output_len 10
|
|
```
|
|
|
|
#### MPT 30B
|
|
|
|
Same commands can be changed to convert MPT 30B to TRT LLM format. Below is an example to build MPT30B fp16 4-way tensor parallelized TRT engine
|
|
|
|
### 1. Convert weights from HF Tranformers to FT format
|
|
|
|
The [`convert_hf_mpt_to_ft.py`](./convert_hf_mpt_to_ft.py) script allows you to convert weights from HF Tranformers format to FT format.
|
|
|
|
|
|
```bash
|
|
python convert_hf_mpt_to_ft.py -i mosaicml/mpt-30b -o ./ft_ckpts/mpt-7b/fp16/ --tensor_parallelism 4 -t float16
|
|
```
|
|
|
|
`--infer_gpu_num 4` is used to convert to FT format with 4-way tensor parallelism
|
|
|
|
|
|
### 2. Build TensorRT engine(s)
|
|
|
|
Examples of build invocations:
|
|
|
|
```bash
|
|
# Build 4-GPU MPT-30B float16 engines
|
|
# ALiBi is not supported with GPT attention plugin so we can't use that plugin to increase runtime performance
|
|
python3 build.py --world_size=4 \
|
|
--parallel_build \
|
|
--max_batch_size 64 \
|
|
--max_input_len 512 \
|
|
--max_output_len 64 \
|
|
--use_gpt_attention_plugin \
|
|
--use_gemm_plugin \
|
|
--model_dir ./ft_ckpts/mpt-30b/fp16/4-gpu \
|
|
--output_dir=./trt_engines/mpt-30b/fp16/4-gpu
|
|
```
|
|
|
|
### 3. Run TRT engine to check if the build was correct
|
|
|
|
```bash
|
|
# Run 4-GPU MPT7B TRT engine on a sample input prompt
|
|
mpirun -n 4 --allow-run-as-root python run.py --engine_dir ./trt_engines/mpt-30b/fp16/4-gpu/ --max_output_len 10
|
|
```
|