mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Doc: Update invalid hugging face URLs (#5683 )

Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>

2025-07-03 09:37:01 +02:00

5.5 KiB

Raw Blame History

ChatGLM

This document explains how to build the ChatGLM2-6B, ChatGLM2-6B-32k models using TensorRT-LLM and run on a single GPU, a single node with multiple GPUs or multiple nodes with multiple GPUs.

ChatGLM

Overview

The TensorRT-LLM ChatGLM implementation can be found in tensorrt_llm/models/chatglm/model.py. The TensorRT-LLM ChatGLM example code is located in examples/models/contrib/chatglm2-6b. There is one main file:

examples/models/core/glm-4-9b/convert_checkpoint.py to convert a checkpoint from the HuggingFace (HF) Transformers format to the TensorRT-LLM format.

In addition, there are two shared files in the parent folder examples for inference and evaluation:

../../../run.py to run the inference on an input text;
../../../summarize.py to summarize the articles in the cnn_dailymail dataset.

Support Matrix

Model Name	FP16	FMHA	WO	SQ	AWQ	FP8	TP	PP	ST	C++	benchmark	IFB
chatglm2_6b	Y	Y	Y	Y	Y	Y	Y		Y	Y	Y	Y
chatglm2_6b_32k	Y	Y	Y		Y	Y	Y		Y	Y	Y	Y

Model Name: the name of the model, the same as the name on HuggingFace
FMHA: Fused MultiHead Attention (see introduction below)
WO: Weight Only Quantization (int8 / int4)
SQ: Smooth Quantization (int8)
AWQ: Activation Aware Weight Quantization (int4)
FP8: FP8 Quantization
TP: Tensor Parallel
PP: Pipeline Parallel
ST: Strongly Typed
C++: C++ Runtime
benchmark: benchmark by python / C++ Runtime
IFB: In-flight Batching (see introduction below)

Model comparison

Name	nL	nAH	nKH	nHW	nH	nF	nMSL	nV	bP2D	bBQKV	bBDense	Comments
chatglm2_6b	28	32	2	128	4096	13696	32768	65024	N	Y	N	Multi_query_attention, RMSNorm rather than LayerNorm in chatglm_6b
chatglm2_6b_32k	28	32	2	128	4096	13696	32768	65024	N	Y	N	RoPE base=160000 rather than 10000 in chatglm2_6b

nL: number of layers
nAH: number of attention heads
nKH: number of kv heads (less than nAH if multi_query_attention is used)
nHW: head width
nH: hidden size
nF: FFN hidden size
nMSL: max sequence length (input + output)
nV: vocabulary size
bP2D: use position_encoding_2d (Y: Yes, N: No)
bBQKV: use bias for QKV multiplication in self-attention
bBDense: use bias for Dense multiplication in self-attention

Tokenizer and special tokens comparison

Name	Tokenizer	bos	eos	pad	cls	startofpiece	endofpiece	mask	smask	gmask
chatglm2_6b	ChatGLMTokenizer	1	2	0
chatglm2_6b_32k	ChatGLMTokenizer	1	2	0

Usage

The next section describe how to build the engine and run the inference demo.

1. Download repo and weights from HuggingFace Transformers

pip install -r requirements.txt
apt-get update
apt-get install git-lfs
rm -rf chatglm*

# clone one or more models we want to build
git clone https://huggingface.co/THUDM/chatglm2-6b      chatglm2_6b
git clone https://huggingface.co/THUDM/chatglm2-6b-32k  chatglm2_6b_32k

For more example codes, please refer to the examples/models/core/glm-4-9b/README.md.

5.5 KiB Raw Blame History