mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

* Update TensorRT-LLM

---------

Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

2023-11-30 16:46:22 +08:00

7.7 KiB

Raw Blame History

MPT

This document explains how to build the MPT model using TensorRT-LLM and run on a single GPU and a single node with multiple GPUs.

Overview

Currently we use tensorrt_llm.models.GPTLMHeadModel to build TRT engine for MPT models. Support for float16, float32 and bfloat16 conversion. Just change data_type flags to any.

Support Matrix

FP16
FP8
INT8 & INT4 Weight-Only
FP8 KV CACHE
Tensor Parallel
MHA, MQA & GQA
STRONGLY TYPED

MPT 7B

1. Convert weights from HF Transformers to FT format

The convert_hf_mpt_to_ft.py script allows you to convert weights from HF Transformers format to FT format.

python convert_hf_mpt_to_ft.py -i mosaicml/mpt-7b -o ./ft_ckpts/mpt-7b/fp16/ -t float16

python convert_hf_mpt_to_ft.py -i mosaicml/mpt-7b -o ./ft_ckpts/mpt-7b/fp32/ --tensor_parallelism 4 -t float32

--infer_gpu_num 4 is used to convert to FT format with 4-way tensor parallelism

2. Build TensorRT engine(s)

Examples of build invocations:

# Build a single-GPU float16 engine using FT weights.
python3 build.py --model_dir=./ft_ckpts/mpt-7b/fp16/1-gpu \
                 --max_batch_size 64 \
                 --use_gpt_attention_plugin \
                 --use_gemm_plugin \
                 --output_dir ./trt_engines/mpt-7b/fp16/1-gpu

# Build 4-GPU MPT-7B float32 engines
# Enable several TensorRT-LLM plugins to increase runtime performance. It also helps with build time.
python3 build.py --world_size=4 \
                 --parallel_build \
                 --max_batch_size 64 \
                 --max_input_len 512 \
                 --max_output_len 64 \
                 --use_gpt_attention_plugin \
                 --use_gemm_plugin \
                 --model_dir ./ft_ckpts/mpt-7b/fp32/4-gpu \
                 --output_dir=./trt_engines/mpt-7b/fp32/4-gpu

3. Run TRT engine to check if the build was correct

python run.py --engine_dir ./trt_engines/mpt-7b/fp16/1-gpu/ --max_output_len 10

# Run 4-GPU MPT7B TRT engine on a sample input prompt
mpirun -n 4 --allow-run-as-root python run.py --engine_dir ./trt_engines/mpt-7b/fp32/4-gpu/ --max_output_len 10

MPT 30B

Same commands can be changed to convert MPT 30B to TRT LLM format. Below is an example to build MPT30B fp16 4-way tensor parallelized TRT engine

1. Convert weights from HF Transformers to FT format

The convert_hf_mpt_to_ft.py script allows you to convert weights from HF Transformers format to FT format.

python convert_hf_mpt_to_ft.py -i mosaicml/mpt-30b -o ./ft_ckpts/mpt-7b/fp16/ --tensor_parallelism 4 -t float16

--infer_gpu_num 4 is used to convert to FT format with 4-way tensor parallelism

2. Build TensorRT engine(s)

Examples of build invocations:

# Build 4-GPU MPT-30B float16 engines
python3 build.py --world_size=4 \
                 --parallel_build \
                 --max_batch_size 64 \
                 --max_input_len 512 \
                 --max_output_len 64 \
                 --use_gpt_attention_plugin \
                 --use_gemm_plugin \
                 --model_dir ./ft_ckpts/mpt-30b/fp16/4-gpu \
                 --output_dir=./trt_engines/mpt-30b/fp16/4-gpu

3. Run TRT engine to check if the build was correct

# Run 4-GPU MPT7B TRT engine on a sample input prompt
mpirun -n 4 --allow-run-as-root python run.py --engine_dir ./trt_engines/mpt-30b/fp16/4-gpu/ --max_output_len 10

Replit Code V-1.5 3B

Same commands can be changed to convert Replit Code V-1.5 3B to TRT LLM format. Below is an example to build Replit Code V-1.5 3B fp16 2-way tensor parallelized TRT engine.

1. Convert weights from HF Transformers to FT format

The convert_hf_mpt_to_ft.py script allows you to convert weights from HF Transformers format to FT format.

python convert_hf_mpt_to_ft.py -i ./replit-code-v1_5-3b -o ./ft_ckpts/replit-code-v1_5-3b/bf16/ --tensor_parallelism 2 -t bfloat16

--infer_gpu_num 2 is used to convert to FT format with 2-way tensor parallelism

2. Build TensorRT engine(s)

Examples of build invocations:

# Build 2-GPU Replit Code V-1.5 3B bfloat16 engines
python3 build.py --world_size=2 \
                 --parallel_build \
                 --max_batch_size 16 \
                 --max_input_len 512 \
                 --max_output_len 64 \
                 --use_gpt_attention_plugin \
                 --use_gemm_plugin \
                 --model_dir ./ft_ckpts/replit-code-v1_5-3b/bf16/2-gpu \
                 --output_dir=./trt_engines/replit-code-v1_5-3b/bf16/2-gpu

Here is the partial output of above command.

[11/15/2023-02:47:50] [TRT] [I] Total Activation Memory: 738233344
[11/15/2023-02:47:51] [TRT] [I] Total Weights Memory: 3523622456
[11/15/2023-02:47:51] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +64, now: CPU 8316, GPU 5721 (MiB)
[11/15/2023-02:47:51] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +64, now: CPU 8316, GPU 5785 (MiB)
[11/15/2023-02:47:51] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 192 MiB, GPU 3361 MiB
[11/15/2023-02:47:51] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +3361, now: CPU 0, GPU 3361 (MiB)
[11/15/2023-02:47:51] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 12851 MiB
[11/15/2023-02:47:51] [TRT-LLM] [I] Total time of building gpt_bfloat16_tp2_rank1.engine: 00:00:04
[11/15/2023-02:47:51] [TRT-LLM] [I] Serializing engine to trt_engines/replit-code-v1_5-3b/bf16/2-gpu/gpt_bfloat16_tp2_rank1.engine...
[11/15/2023-02:48:02] [TRT-LLM] [I] Engine serialized. Total time: 00:00:10
[11/15/2023-02:48:02] [TRT-LLM] [I] Timing cache serialized to model.cache
[11/15/2023-02:48:02] [TRT-LLM] [I] Total time of building all 2 engines: 00:01:21

3. Run TRT engine to check if the build was correct

# Run 2-GPU Replit Code V-1.5 3B TRT engine on a sample input prompt
mpirun -n 2 --allow-run-as-root python run.py --engine_dir ./trt_engines/replit-code-v1_5-3b/bf16/2-gpu/ --max_output_len 64  --input_text "def fibonacci"  --tokenizer ./replit-code-v1_5-3b/

Here is the output of above command.

Input: "def fibonacci"
Output: "(n):
    if n == 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fibonacci(n-1) + fibonacci(n-2)

print(fibonacci(10))"

FP8 Post-Training Quantization

The example below uses the NVIDIA AMMO (AlgorithMic Model Optimization) toolkit for the model quantization process.

First make sure AMMO toolkit is installed (see examples/quantization/README.md)

After successfully running the script, the output should be in .npz format, e.g. quantized_fp8/llama_tp_1_rank0.npz, where FP8 scaling factors are stored.

# Quantize MPT 7B into FP8 and export a single-rank checkpoint
python examples/quantization/quantize.py --model_dir .mosaicml/mpt-7b \
                                         --dtype float16 \
                                         --qformat fp8 \
                                         --export_path ./quantized_fp8

# Build MPT 7B TP using binary checkpoint + PTQ scaling factors from the single-rank checkpoint
python build.py --model_dir ft_ckpts/mpt-7b/fp16 \
                --quantized_fp8_model_path ./quantized_fp8/mpt_tp1_rank0.npz \
                --use_gpt_attention_plugin \
                --use_gemm_plugin \
                --output_dir trt_engines/mpt-7b/fp8/1-gpu/ \
                --remove_input_padding \
                --enable_fp8 \
                --fp8_kv_cache

7.7 KiB Raw Blame History

MPT

Overview

Support Matrix

MPT 7B

1. Convert weights from HF Transformers to FT format

2. Build TensorRT engine(s)

3. Run TRT engine to check if the build was correct

MPT 30B

1. Convert weights from HF Transformers to FT format

2. Build TensorRT engine(s)

3. Run TRT engine to check if the build was correct

Replit Code V-1.5 3B

1. Convert weights from HF Transformers to FT format

2. Build TensorRT engine(s)

3. Run TRT engine to check if the build was correct

FP8 Post-Training Quantization

7.7 KiB

Raw Blame History