mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Kaiyu Xie 3c0462002c TensorRT-LLM v0.18 release (#3231 )		2025-04-02 17:01:16 +08:00
..
__init__.py	Update TensorRT-LLM v0.14.0 (#2401 )	2024-11-01 19:48:44 +08:00
convert_checkpoint.py	TensorRT-LLM v0.16 Release	2024-12-24 15:58:43 +08:00
README.md	open source 09df54c0cc99354a60bbc0303e3e8ea33a96bef0 (#2725 )	2025-02-11 02:21:51 +00:00
requirements.txt	TensorRT-LLM v0.18 release (#3231 )	2025-04-02 17:01:16 +08:00

README.md

Deepseek-v1

This document shows how to build and run deepseek-v1 model in TensorRT-LLM.

Deepseek-v1

Prerequisite

First, please download Deepseek-v1 weights from HF https://huggingface.co/deepseek-ai/deepseek-moe-16b-base.

git lfs install
git clone https://huggingface.co/deepseek-ai/deepseek-moe-16b-base

Hardware

The Deepseek-v1 model requires 1x80G GPU memory.

Overview

The TensorRT-LLM Deepseek-v1 implementation can be found in tensorrt_llm/models/deepseek_v1/model.py. The TensorRT-LLM Deepseek-v1 example code is located in example/deepseek_v1. There is one main file:

convert_checkpoint.py to convert the Deepseek-v1 model into tensorrt-llm checkpoint format.

In addition, there are three shared files in the parent folder examples can be used for inference and evaluation:

../run.py to run the model inference output by given an input text.
../summarize.py to summarize the article from cnn_dailmail dataset, it can running the summarize from HF model and TensorRT-LLM model.
../mmlu.py to running score script from https://github.com/declare-lab/instruct-eval to compare HF model and TensorRT-LLM model on the MMLU dataset.

Support Matrix

FP16
TENSOR PARALLEL
FP8

Usage

The TensorRT-LLM Deepseek-v1 example code locates at examples/deepseek_v1. It takes PyTorch weights as input, and builds corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.

Build TensorRT engine(s)

Below is the step-by-step to run Deepseek-v1 with TensorRT-LLM.

First the checkpoint will be converted to the TensorRT-LLM checkpoint format by apply convert_checkpoint.py. After that, the TensorRT engine(s) can be build with TensorRT-LLM checkpoint.

# Build the bfloat16 engine from Deepseek-v1 HF weights.
python convert_checkpoint.py --model_dir ./deepseek_moe_16b/ \
                            --output_dir ./trtllm_checkpoint_deepseek_v1_1gpu_bf16 \
                            --dtype bfloat16 \
                            --tp_size 1
trtllm-build --checkpoint_dir ./trtllm_checkpoint_deepseek_v1_1gpu_bf16 \
            --output_dir ./trtllm_engines/deepseek_v1/bf16/tp1 \
            --gpt_attention_plugin bfloat16 \
            --gemm_plugin bfloat16 \
            --moe_plugin bfloat16 \

Then, test the engine with run.py script:

python ../run.py --engine_dir ./trtllm_engines/deepseek_v1/bf16/tp1 \
                --tokenizer_dir ./deepseek_moe_16b/ \
                --max_output_len 32 \
                --top_p 0 \
                --input_text "The president of the United States is person who"

FP8 Quantization

The ../quantization/quantize.py script can be used to quantize the models and export TensorRT-LLM checkpoints.

# Deepseek-v1: single gpu, fp8 quantization
python ../quantization/quantize.py --model_dir deepseek_moe_16b \
        --dtype float16 \
        --qformat fp8 \
        --kv_cache_dtype fp8 \
        --output_dir trt_ckpt/deepseek_moe_16b/fp8/1-gpu \
        --calib_size 512

# Deepseek-v1: single-gpu engine with fp8 quantization, GPT Attention plugin, Gemm plugin
trtllm-build --checkpoint_dir ./trt_ckpt/deepseek_moe_16b/fp8/1-gpu \
             --gemm_plugin float16 \
             --gpt_attention_plugin bfloat16 \
             --output_dir ./trt_engines/fp8/1-gpu/

Credits

This Deepseek-v1 model example exists thanks to @akhoroshev(https://github.com/akhoroshev) community contribution!