mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Linda 94f0252b46 Doc: Update invalid hugging face URLs (#5683 ) Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>		2025-07-04 13:14:13 +08:00
..
README.md	Doc: Update invalid hugging face URLs (#5683 )	2025-07-04 13:14:13 +08:00
requirements.txt	feat: adding multimodal (only image for now) support in trtllm-bench (#3490 )	2025-04-18 07:06:16 +08:00
tokenization_chatglm.py	chore: clean some ci of qa test (#3083 )	2025-03-31 14:30:41 +08:00

README.md

ChatGLM

This document explains how to build the ChatGLM3-6B, ChatGLM3-6B-Base, ChatGLM3-6B-32k models using TensorRT-LLM and run on a single GPU, a single node with multiple GPUs or multiple nodes with multiple GPUs.

ChatGLM

Overview

The TensorRT-LLM ChatGLM implementation can be found in tensorrt_llm/models/chatglm/model.py. The TensorRT-LLM ChatGLM example code is located in examples/models/contrib/chatglm3-6b-32k. There is one main file:

examples/models/core/glm-4-9b/convert_checkpoint.py to convert a checkpoint from the HuggingFace (HF) Transformers format to the TensorRT-LLM format.

In addition, there are two shared files in the parent folder examples for inference and evaluation:

../../../run.py to run the inference on an input text;
../../../summarize.py to summarize the articles in the cnn_dailymail dataset.

Support Matrix

Model Name	FP16	FMHA	WO	SQ	AWQ	FP8	TP	ST	C++	benchmark	IFB
chatglm3_6b	Y	Y	Y	Y	Y	Y	Y	Y	Y	Y	Y
chatglm3_6b_base	Y	Y	Y	Y	Y	Y	Y	Y	Y	Y	Y
chatglm3_6b_32k	Y	Y	Y	Y	Y	Y	Y	Y	Y	Y	Y

Model Name: the name of the model, the same as the name on HuggingFace
FMHA: Fused MultiHead Attention (see introduction below)
WO: Weight Only Quantization (int8 / int4)
SQ: Smooth Quantization (int8)
AWQ: Activation Aware Weight Quantization (int4)
FP8: FP8 Quantization
TP: Tensor Parallel
PP: Pipeline Parallel
ST: Strongly Typed
C++: C++ Runtime
benchmark: benchmark by python / C++ Runtime
IFB: In-flight Batching (see introduction below)

Model comparison

Name	nL	nAH	nKH	nHW	nH	nF	nMSL	nV	bP2D	bBQKV	bBDense	Comments
chatglm3_6b	28	32	2	128	4096	13696	8192	65024	N	Y	N	Different in preprocess and postprocess than chatglm2_6b
chatglm3_6b_base	28	32	2	128	4096	13696	32768	65024	N	Y	N
chatglm3_6b_32k	28	32	2	128	4096	13696	32768	65024	N	Y	N	RoPE base=500000 rather than 10000 in chatglm3_6b

nL: number of layers
nAH: number of attention heads
nKH: number of kv heads (less than nAH if multi_query_attention is used)
nHW: head width
nH: hidden size
nF: FFN hidden size
nMSL: max sequence length (input + output)
nV: vocabulary size
bP2D: use position_encoding_2d (Y: Yes, N: No)
bBQKV: use bias for QKV multiplication in self-attention
bBDense: use bias for Dense multiplication in self-attention

Tokenizer and special tokens comparison

Name	Tokenizer	bos	eos	mask
chatglm3_6b	ChatGLMTokenizer	1	2	130000
chatglm3_6b_base	ChatGLMTokenizer	1	2	130000
chatglm3_6b_32k	ChatGLMTokenizer	1	2	130000

Usage

The next section describe how to build the engine and run the inference demo.

1. Download repo and weights from HuggingFace Transformers

pip install -r requirements.txt
apt-get update
apt-get install git-lfs
rm -rf chatglm*

# clone one or more models we want to build
git clone https://huggingface.co/THUDM/chatglm3-6b      chatglm3_6b
git clone https://huggingface.co/THUDM/chatglm3-6b-base chatglm3_6b_base
git clone https://huggingface.co/THUDM/chatglm3-6b-32k  chatglm3_6b_32k

For more example codes, please refer to the examples/models/core/glm-4-9b/README.md.