mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

bhsueh_NV 322ac565fc chore: clean some ci of qa test (#3083 ) * move some models to examples/models/contrib Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * update the document Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * remove arctic, blip2, cogvlm, dbrx from qa test list Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * remove tests of dit, mmdit and stdit from qa test Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * remove grok, jais, sdxl, skywork, smaug from qa test list Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * re-organize the glm examples Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix issues after running pre-commit Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix some typo in glm_4_9b readme Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bug Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>		2025-03-31 14:30:41 +08:00
..
convert_checkpoint.py	chore: clean some ci of qa test (#3083 )	2025-03-31 14:30:41 +08:00
README.md	chore: clean some ci of qa test (#3083 )	2025-03-31 14:30:41 +08:00
requirements.txt	chore: clean some ci of qa test (#3083 )	2025-03-31 14:30:41 +08:00

README.md

Grok-1

This document shows how to build and run grok-1 model in TensorRT-LLM on both single GPU, single node multi-GPU and multi-node multi-GPU.

Grok1

Prerequisite

First of all, please clone the official grok-1 code repo with below commands and install the dependencies.

git clone https://github.com/xai-org/grok-1.git /path/to/folder

And then downloading the weights per instructions.

Hardware

The grok-1 model requires a node with 8x80GB GPU memory(at least).

Overview

The TensorRT-LLM Grok-1 implementation can be found in tensorrt_llm/models/grok/model.py. The TensorRT-LLM Grok-1 example code is located in examples/grok. There is one main file:

convert_checkpoint.py to convert the Grok-1 model into tensorrt-llm checkpoint format.

In addition, there are two shared files in the parent folder examples for inference and evaluation:

../../../run.py to run the inference on an input text;
../../../summarize.py to summarize the articles in the cnn_dailymail dataset.

Support Matrix

INT8 Weight-Only
Tensor Parallel
STRONGLY TYPED

Usage

The TensorRT-LLM Grok-1 example code locates at examples/grok. It takes xai weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.

Build TensorRT engine(s)

Please install required packages first to make sure the example uses matched tensorrt_llm version:

pip install -r requirements.txt

Need to prepare the Grok-1 checkpoint by following the guides here https://github.com/xai-org/grok-1.

TensorRT-LLM Grok-1 builds TensorRT engine(s) from Xai's checkpoints.

Normally trtllm-build only requires single GPU, but if you've already got all the GPUs needed for inference, you could enable parallel building to make the engine building process faster by adding --workers argument. Please note that currently workers feature only supports single node.

Below is the step-by-step to run Grok-1 with TensorRT LLM.

# Build the bfloat16 engine from xai official weights.
python convert_checkpoint.py --model_dir ./tmp/grok-1/ \
                              --output_dir ./tllm_checkpoint_8gpus_bf16 \
                              --dtype bfloat16 \
                              --use_weight_only \
                              --tp_size 8 \
                              --workers 8

trtllm-build --checkpoint_dir ./tllm_checkpoint_8gpus_bf16 \
            --output_dir ./tmp/grok-1/trt_engines/bf16/8-gpus \
            --gpt_attention_plugin bfloat16 \
            --gemm_plugin bfloat16 \
            --moe_plugin bfloat16 \
            --paged_kv_cache enable \
            --remove_input_padding enable \
            --workers 8


# Run Grok-1 with 8 GPUs
mpirun -n 8 --allow-run-as-root \
    python ../../../run.py \
    --input_text "The answer to life the universe and everything is of course" \
    --engine_dir ./tmp/grok-1/trt_engines/bf16/8-gpus \
    --max_output_len 50 --top_p 1 --top_k 8 --temperature 0.3 \
    --vocab_file  ./tmp/grok-1/tokenizer.model