TensorRT-LLMs/examples/models/contrib/jais/README.md
Guoming Zhang 9f0f52249e [None][doc] Rename TensorRT-LLM to TensorRT LLM for homepage and the … (#7850)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00

120 lines
4.5 KiB
Markdown

# Jais
This document elaborates how to build Jais model to runnable engines on multi-GPU node and perform a summarization task using these engines.
Currently it has been tested on
- [Jais-13b-chat](https://huggingface.co/core42/jais-13b-chat)
- [Jais-30b-chat-v3](https://huggingface.co/core42/jais-30b-chat-v3)
- [Jais](#jais)
- [Overview](#overview)
- [Support Matrix](#support-matrix)
- [Usage](#usage)
- [Build TensorRT engine(s)](#build-tensorrt-engines)
- [Run inference](#run)
## Overview
The TensorRT LLM support for Jais is based on the GPT model, the implementation can be found in [tensorrt_llm/models/gpt/model.py](../../../../tensorrt_llm/models/gpt/model.py). Jais model resembles GPT very much except it uses alibi embedding, embedding scale, swiglu, and logits scale, we therefore reuse the [GPT example code](../../../gpt) for Jais,
* [`convert_checkpoint.py`](../../../gpt/convert_checkpoint.py) to convert the Jais model into TensorRT LLM checkpoint format.
In addition, there are two shared files in the parent folder [`examples`](../) for inference and evaluation:
* [`../../../run.py`](../../../run.py) to run the inference on an input text;
* [`../../../summarize.py`](../../../summarize.py) to summarize the articles in the [cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset.
## Support Matrix
The tested configurations are:
* FP16
* FP8
* Inflight Batching
* Tensor Parallel
## Usage
This section gives a whole process where we convert HF models, build TensorRT LLM engines and ultimately perform summarization.
### Build TensorRT engine(s)
Run the following commands and TRT-LLM will first transforms a HF model into its own checkpoint format, then builds a TRT engine based on the checkpoint
```bash
# single gpu, dtype float16 for jais-13b-chat
python3 ../../../gpt/convert_checkpoint.py --model_dir core42/jais-13b-chat \
--dtype float16 \
--output_dir jais-13b-chat/trt_ckpt/fp16/1-gpu
# 2-way tensor parallelism for jais-30b-chat-v3
python3 ../../../gpt/convert_checkpoint.py --model_dir core42/jais-30b-chat-v3 \
--dtype float16 \
--tp_size 2 \
--output_dir jais-30b-chat-v3/trt_ckpt/fp16/2-gpu
```
```bash
# Build a single-GPU float16 engine from TensorRT LLM checkpoint for jais-13b-chat
# Enable the special TensorRT LLM GPT Attention plugin (--gpt_attention_plugin) to increase runtime performance.
# It is recommend to use --remove_input_padding along with --gpt_attention_plugin for better performance
trtllm-build --checkpoint_dir jais-13b-chat/trt_ckpt/fp16/1-gpu \
--gpt_attention_plugin float16 \
--remove_input_padding enable \
--output_dir jais-13b-chat/trt_engines/fp16/1-gpu
# Build 2-way tensor parallelism engines from TensorRT LLM checkpoint for jais-30b-chat-v3
trtllm-build --checkpoint_dir jais-30b-chat-v3/trt_ckpt/fp16/2-gpu \
--gpt_attention_plugin float16 \
--remove_input_padding enable \
--output_dir jais-30b-chat-v3/trt_engines/fp16/2-gpu
```
### Run
The [`../../../run.py`](../../../run.py) script can be used to run inference with the built engine(s).
```bash
python3 ../../../run.py --engine_dir jais-13b-chat/trt_engines/fp16/1-gpu \
--tokenizer_dir core42/jais-13b-chat \
--max_output_len 10
```
If the engines are run successfully, you will see output like:
```
......
Input [Text 0]: "Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: " chef in Paris before moving to England in 1816"
```
```bash
python3 ../../../run.py --engine_dir jais-13b-chat/trt_engines/fp16/1-gpu \
--tokenizer_dir core42/jais-13b-chat \
--max_output_len 8 \
--input_text "ولد في 1304 ميلادياً ابن بطوطه, لقد ذهب"
```
If the engines are run successfully, you will see output like:
```
.....
Input [Text 0]: "ولد في 1304 ميلادياً ابن بطوطه, لقد ذهب"
Output [Text 0 Beam 0]: " في جميع أنحاء العالم المعروف في ذلك الوقت"
```
To run a 2 TP model you can do the following
```bash
mpirun -np 2 \
python3 ../../../run.py --engine_dir jais-30b-chat-v3/trt_engines/fp16/2-gpu \
--tokenizer_dir core42/jais-30b-chat-v3 \
--max_output_len 30
```
If the engines are run successfully, you will see output like:
```
Input [Text 0]: "Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: " chef, working in a series of high-end establishments.
Soyer's career took him to work in a number of establishments across Europe,"
```