TensorRT-LLMs/examples/models/contrib/grok/README.md

# Grok-1

This document shows how to build and run grok-1 model in TensorRT LLM on both single GPU, single node multi-GPU and multi-node multi-GPU.

- [Grok1](#Grok-1)
  - [Prerequisite](#prerequisite)
  - [Hardware](#hardware)
  - [Overview](#overview)
  - [Support Matrix](#support-matrix)
  - [Usage](#usage)
    - [Build TensorRT engine(s)](#build-tensorrt-engines)

## Prerequisite
First of all, please clone the official grok-1 code repo with below commands and install the dependencies.
```bash
git clone https://github.com/xai-org/grok-1.git /path/to/folder
```
And then downloading the weights per [instructions](https://github.com/xai-org/grok-1?tab=readme-ov-file#downloading-the-weights).

## Hardware
The grok-1 model requires a node with 8x80GB GPU memory(at least).

## Overview

The TensorRT LLM Grok-1 implementation can be found in [tensorrt_llm/models/grok/model.py](../../../../tensorrt_llm/models/grok/model.py). The TensorRT LLM Grok-1 example code is located in [`examples/models/contrib/grok`](./). There is one main file:

* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the Grok-1 model into TensorRT LLM checkpoint format.

In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:

* [`../../../run.py`](../../../run.py) to run the inference on an input text;
* [`../../../summarize.py`](../../../summarize.py) to summarize the articles in the [cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset.

## Support Matrix
  * INT8 Weight-Only
  * Tensor Parallel
  * STRONGLY TYPED

## Usage

The TensorRT LLM Grok-1 example code locates at [examples/models/contrib/grok](./). It takes xai weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.

### Build TensorRT engine(s)

Please install required packages first to make sure the example uses matched `tensorrt_llm` version:

```bash
pip install -r requirements.txt
```

Need to prepare the Grok-1 checkpoint by following the guides here https://github.com/xai-org/grok-1.

TensorRT LLM Grok-1 builds TensorRT engine(s) from Xai's checkpoints.

Normally `trtllm-build` only requires single GPU, but if you've already got all the GPUs needed for inference, you could enable parallel building to make the engine building process faster by adding `--workers` argument. Please note that currently `workers` feature only supports single node.


Below is the step-by-step to run Grok-1 with TensorRT LLM.

```bash
# Build the bfloat16 engine from xai official weights.
python convert_checkpoint.py --model_dir ./tmp/grok-1/ \
                              --output_dir ./tllm_checkpoint_8gpus_bf16 \
                              --dtype bfloat16 \
                              --use_weight_only \
                              --tp_size 8 \
                              --workers 8

trtllm-build --checkpoint_dir ./tllm_checkpoint_8gpus_bf16 \
            --output_dir ./tmp/grok-1/trt_engines/bf16/8-gpus \
            --gpt_attention_plugin bfloat16 \
            --gemm_plugin bfloat16 \
            --moe_plugin bfloat16 \
            --paged_kv_cache enable \
            --remove_input_padding enable \
            --workers 8


# Run Grok-1 with 8 GPUs
mpirun -n 8 --allow-run-as-root \
    python ../../../run.py \
    --input_text "The answer to life the universe and everything is of course" \
    --engine_dir ./tmp/grok-1/trt_engines/bf16/8-gpus \
    --max_output_len 50 --top_p 1 --top_k 8 --temperature 0.3 \
    --vocab_file  ./tmp/grok-1/tokenizer.model
```