mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Kevin Xie 6e9e318e91 Update code		2023-09-28 09:00:05 -07:00
..
.gitignore	Initial commit	2023-09-20 00:29:41 -07:00
build.py	Update code	2023-09-28 09:00:05 -07:00
convert_hf_mpt_to_ft.py	Initial commit	2023-09-20 00:29:41 -07:00
README.md	Initial commit	2023-09-20 00:29:41 -07:00
requirements.txt	Initial commit	2023-09-20 00:29:41 -07:00
run.py	Update code	2023-09-28 09:00:05 -07:00
weight.py	Initial commit	2023-09-20 00:29:41 -07:00

README.md

MPT

This document explains how to build the MPT model using TensorRT-LLM and run on a single GPU and a single node with multiple GPUs

Overview

Currently we use tensorrt_llm.models.GPTLMHeadModel to build TRT engine for MPT models. Support for float16, float32 and bfloat16 conversion. Just change data_type flags to any.

MPT 7B

1. Convert weights from HF Tranformers to FT format

The hf_gpt_convert.py script allows you to convert weights from HF Tranformers format to FT format.

python convert_hf_mpt_to_ft.py -i mosaicml/mpt-7b -o ./ft_ckpts/mpt-7b/fp32/ --tensor_parallelism 4 -t float32

--infer_gpu_num 4 is used to convert to FT format with 4-way tensor parallelism

2. Build TensorRT engine(s)

Examples of build invocations:

# Build a single-GPU float16 engine using FT weights.
# ALiBi is not supported with GPT attention plugin so we can't use that plugin to increase runtime performance
python3 build.py --model_dir=./ft_ckpts/mpt-7b/fp16/1-gpu \
                 --dtype float16 \
                 --use_gemm_plugin float16 \
                 --world_size 1 \
                 --output_dir ./trt_engines/mpt-7b/fp32/1-gpu

# Build 4-GPU MPT-7B float32 engines
# Enable several TensorRT-LLM plugins to increase runtime performance. It also helps with build time.
python3 build.py --world_size=4 \
                 --log_level=verbose \
                 --max_batch_size 64 \
                 --max_input_len 512 \
                 --max_output_len 64 \
                 --dtype float32 \
                 --use_gemm_plugin float32 \
                 --model_dir ./ft_ckpts/mpt-7b/fp32/4-gpu \
                 --output_dir=./trt_engines/mpt-7b/fp32/4-gpu 2>&1 | tee build.log

3. Run TRT engine to check if the build was correct

# Run 4-GPU MPT7B TRT engine on a sample input prompt
mpirun -n 4 --allow-run-as-root python run.py --engine_dir ./trt_engines/mpt-7b/fp32/4-gpu/ --max_output_len 10

4. Run HF model to compare outputs for same input

# Run MPT-7b fp32 model on the same input prompt to compare inputs
# Try different prompts by setting --input_text
python run_hf.py --data_type fp32 --max_output_len 10 --model_dir mosaicml/mpt-7b

MPT 30B

Same commands can be changed to convert MPT 30B to TRT LLM format. Below is an example to build MPT30B fp16 4-way tensor parallelized TRT engine