mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

chore: clean some ci of qa test (#3083 )

* move some models to examples/models/contrib

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* update the document

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* remove arctic, blip2, cogvlm, dbrx from qa test list

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* remove tests of dit, mmdit and stdit from qa test

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* remove grok, jais, sdxl, skywork, smaug from qa test list

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* re-organize the glm examples

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix issues after running pre-commit

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix some typo in glm_4_9b readme

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix bug

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

2025-03-31 14:30:41 +08:00

5.6 KiB

Raw Blame History

ChatGLM

This document explains how to build the ChatGLM-6B models using TensorRT-LLM and run on a single GPU, a single node with multiple GPUs or multiple nodes with multiple GPUs.

ChatGLM

Overview

The TensorRT-LLM ChatGLM implementation can be found in tensorrt_llm/models/chatglm/model.py. The TensorRT-LLM ChatGLM example code is located in examples/models/contrib/chatglm-6b. There is one main file:

examples/glm-4-9b/convert_checkpoint.py to convert a checkpoint from the HuggingFace (HF) Transformers format to the TensorRT-LLM format.

In addition, there are two shared files in the parent folder examples for inference and evaluation:

../../../run.py to run the inference on an input text;
../../../summarize.py to summarize the articles in the cnn_dailymail dataset.

Support Matrix

Model Name	FP16	FMHA	WO	SQ	AWQ	FP8	TP	PP	ST	C++	benchmark	IFB
chatglm_6b	Y		Y				Y		Y	Y	Y	Y
glm_10b	Y	Y	Y		Y	Y	Y		Y	Y	Y	Y

Model Name: the name of the model, the same as the name on HuggingFace
FMHA: Fused MultiHead Attention (see introduction below)
WO: Weight Only Quantization (int8 / int4)
SQ: Smooth Quantization (int8)
AWQ: Activation Aware Weight Quantization (int4)
FP8: FP8 Quantization
TP: Tensor Parallel
PP: Pipeline Parallel
ST: Strongly Typed
C++: C++ Runtime
benchmark: benchmark by python / C++ Runtime
IFB: In-flight Batching (see introduction below)

Model comparison

Name	nL	nAH	nKH	nHW	nH	nF	nMSL	nV	bP2D	bBQKV	bBDense	Comments
chatglm_6b	28	32	32	128	4096	16384	2048	130528	Y	Y	Y
glm_10b	48	64	32	64	4096	16384	1024	50304	Y	Y	Y

nL: number of layers
nAH: number of attention heads
nKH: number of kv heads (less than nAH if multi_query_attention is used)
nHW: head width
nH: hidden size
nF: FFN hidden size
nMSL: max sequence length (input + output)
nV: vocabulary size
bP2D: use position_encoding_2d (Y: Yes, N: No)
bBQKV: use bias for QKV multiplication in self-attention
bBDense: use bias for Dense multiplication in self-attention

Tokenizer and special tokens comparison

Name	Tokenizer	bos	eos	pad	cls	startofpiece	endofpiece	mask	smask	gmask
chatglm_6b	ChatGLMTokenizer	130004	130005	3		130004	130005	130000		130001
glm_10b	GLMGPT2Tokenizer	50257	50256	50256	50259	50257	50258	50260	50264	50263

Usage

The next section describe how to build the engine and run the inference demo.

1. Download repo and weights from HuggingFace Transformers

pip install -r requirements.txt
apt-get update
apt-get install git-lfs
rm -rf chatglm*

# clone one or more models we want to build
git clone https://huggingface.co/THUDM/chatglm-6b       chatglm_6b
git clone https://huggingface.co/THUDM/glm-10b          glm_10b

# replace tokenization file if using transformers-4.36.1 for model ChatGLM-6B (this might be needless in the future)
cp chatglm_6b/tokenization_chatglm.py chatglm_6b/tokenization_chatglm.py-backup
cp tokenization_chatglm.py chatglm_6b

For more example codes, please refer to the examples/glm-4-9b/README.md.

5.6 KiB Raw Blame History