# MPT This document explains how to build the [MPT](https://huggingface.co/mosaicml/mpt-7b) model using TensorRT-LLM and run on a single GPU and a single node with multiple GPUs. ## Overview The TensorRT-LLM MPT implementation can be found in [`tensorrt_llm/models/mpt/model.py`](../../tensorrt_llm/models/mpt/model.py). The TensorRT-LLM MPT example code is located in [`examples/mpt`](./). There is one main file: * [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format. In addition, there are two shared files in the parent folder [`examples`](../) for inference and evaluation: * [`../run.py`](../run.py) to run the inference on an input text; * [`../summarize.py`](../summarize.py) to summarize the articles in the [cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail) dataset. ## Support Matrix * FP16 * FP8 (with FP8 KV Cache) * INT8 & INT4 Weight-Only * INT8 Smooth Quant * INT4 AWQ * Tensor Parallel * MHA, MQA & GQA * STRONGLY TYPED ### MPT 7B The [`convert_checkpoint.py`](./convert_checkpoint.py) script allows you to convert weights from HF Transformers format to TRTLLM checkpoints. #### 1.1 Convert from HF Transformers in FP ```bash # Generate FP16 checkpoints. python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/fp16/ --dtype float16 # Generate FP32 checkpoints with TP=4. python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/fp32_tp4/ --dtype float32 --tp_size 4 ``` #### 1.2 Convert from HF Transformers with weight-only quantization ```bash # Use int8 weight-only quantization. python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/int8_wo/ --use_weight_only # Use int4 weight-only quantization. python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/int4_wo/ --use_weight_only --weight_only_precision int4 ``` #### 1.3 Convert from HF Transformers with SmoothQuant quantization ```bash # Use int8 smoothquant (weight and activation) quantization. python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/int8_sq/ --smoothquant 0.5 ``` #### 1.4 Convert from HF Transformers with INT8 KV cache quantization ```bash # Use int8 kv cache quantization. python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/fp16_int8kv/ --dtype float16 --calibrate_kv_cache ``` ***INT8-KV-cache can be used with SQ and Weight-only at the same time*** ***We now introduce AMMO to do all quantization*** First make sure AMMO toolkit is installed (see [examples/quantization/README.md](/examples/quantization/README.md#preparation)) #### 1.5 AWQ weight-only quantization with AMMO ```bash # INT4 AWQ quantization using AMMO. python ../quantization/quantize.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/int4_awq/ --qformat int4_awq ``` #### 1.6 FP8 Post-Training Quantization with AMMO ```bash # FP8 quantization using AMMO. python ../quantization/quantize.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/fp8/ --qformat fp8 --kv_cache_dtype fp8 ``` #### 1.6 Weight-only quantization with AMMO ```bash # INT8 Weight-only quantization using AMMO with TP=2. python ../quantization/quantize.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/int8_wo/ --qformat int8_wo --tp_size 2 # INT4 Weight-only quantization using AMMO. python ../quantization/quantize.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/int4_wo/ --qformat int4_wo ``` #### 1.7 SmoothQuant and INT8 KV cache with AMMO ```bash # Use int4 awq quantization. python ../quantization/quantize.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/sq_int8kv/ --qformat int8_sq --kv_cache_dtype int8 ``` ***INT8-KV-cache can also be used with Weight-only at the same time*** ### 2.1 Build TensorRT engine(s) All of the checkpoint generated by `convert_checkpoint.py` or `quantize.py` (AMMO) can share the same building commands. ```bash # Build a single-GPU float16 engine using TRTLLM checkpoints. trtllm-build --checkpoint_dir=./ft_ckpts/mpt-7b/fp16/1-gpu \ --max_batch_size 32 \ --max_input_len 1024 \ --max_output_len 512 \ --gpt_attention_plugin \ --gemm_plugin --workers 1 \ --output_dir ./trt_engines/mpt-7b/fp16/1-gpu ``` ### MPT 30B Same commands can be changed to convert MPT 30B to TRT LLM format. Below is an example to build MPT30B fp16 4-way tensor parallelized TRT engine #### 1. Convert weights from HF Transformers to TRTLLM format The [`convert_checkpoint.py`](./convert_checkpoint.py) script allows you to convert weights from HF Transformers format to TRTLLM format. ```bash python convert_checkpoint.py --model_dir mosaicml/mpt-30b --output_dir ./ft_ckpts/mpt-30b/fp16_tp4/ --tp_szie 4 --dtype float16 ``` #### 2. Build TensorRT engine(s) Examples of build invocations: ```bash # Build 4-GPU MPT-30B float16 engines trtllm-build --checkpoint_dir ./ft_ckpts/mpt-30b/fp16_tp4 \ --max_batch_size 32 \ --max_input_len 1024 \ --max_output_len 512 \ --gpt_attention_plugin \ --gemm_plugin --workers 4 \ --output_dir ./trt_engines/mpt-30b/fp16_tp4 ``` #### 3. Run TRT engine to check if the build was correct ```bash # Run 4-GPU MPT-30B TRT engine on a sample input prompt mpirun -n 4 --allow-run-as-root \ python ../run.py --max_output_len 10 \ --engine_dir ./trt_engines/mpt-30b/fp16/4-gpu/ \ --tokenizer_dir mosaicml/mpt-30b ``` ### Replit Code V-1.5 3B Same commands can be changed to convert [Replit Code V-1.5 3B](https://huggingface.co/replit/replit-code-v1_5-3b) to TRT LLM format. Below is an example to build Replit Code V-1.5 3B fp16 2-way tensor parallelized TRT engine. #### 1. Convert weights from HF Transformers to TRTLLM format The [`convert_checkpoint.py`](./convert_checkpoint.py) script allows you to convert weights from HF Transformers format to TRTLLM format. ```bash python convert_checkpoint.py --model_dir ./replit-code-v1_5-3b --output_dir ./ft_ckpts/replit-code-v1_5-3b/bf16_tp2/ --tp_size 2 --dtype bfloat16 ``` #### 2. Build TensorRT engine(s) Examples of build invocations: ```bash # Build 2-GPU Replit Code V-1.5 3B bfloat16 engines trtllm-build --checkpoint_dir ./ft_ckpts/replit-code-v1_5-3b/bf16_tp2 \ --max_batch_size 32 \ --max_input_len 1024 \ --max_output_len 512 \ --gpt_attention_plugin \ --gemm_plugin \ --workers 2 \ --output_dir ./trt_engines/replit-code-v1_5-3b/bf16_tp2 ``` #### 3. Run TRT engine to check if the build was correct ```bash # Run 2-GPU Replit Code V-1.5 3B TRT engine on a sample input prompt mpirun -n 2 --allow-run-as-root \ python ../run.py --max_output_len 64 \ --input_text "def fibonacci" \ --engine_dir ./trt_engines/replit-code-v1_5-3b/bf16_tp2 \ --tokenizer_dir ./replit-code-v1_5-3b/ ``` Here is the output of above command. ```bash Input: "def fibonacci" Output: "(n): if n == 0: return 0 elif n == 1: return 1 else: return fibonacci(n-1) + fibonacci(n-2) print(fibonacci(10))" ```