TensorRT-LLMs/examples/models/contrib/chatglm3-6b-32k
Linda 94f0252b46 Doc: Update invalid hugging face URLs (#5683)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-04 13:14:13 +08:00
..
README.md Doc: Update invalid hugging face URLs (#5683) 2025-07-04 13:14:13 +08:00
requirements.txt feat: adding multimodal (only image for now) support in trtllm-bench (#3490) 2025-04-18 07:06:16 +08:00
tokenization_chatglm.py chore: clean some ci of qa test (#3083) 2025-03-31 14:30:41 +08:00

ChatGLM

This document explains how to build the ChatGLM3-6B, ChatGLM3-6B-Base, ChatGLM3-6B-32k models using TensorRT-LLM and run on a single GPU, a single node with multiple GPUs or multiple nodes with multiple GPUs.

Overview

The TensorRT-LLM ChatGLM implementation can be found in tensorrt_llm/models/chatglm/model.py. The TensorRT-LLM ChatGLM example code is located in examples/models/contrib/chatglm3-6b-32k. There is one main file:

In addition, there are two shared files in the parent folder examples for inference and evaluation:

Support Matrix

Model Name FP16 FMHA WO SQ AWQ FP8 TP PP ST C++ benchmark IFB
chatglm3_6b Y Y Y Y Y Y Y Y Y Y Y
chatglm3_6b_base Y Y Y Y Y Y Y Y Y Y Y
chatglm3_6b_32k Y Y Y Y Y Y Y Y Y Y Y
  • Model Name: the name of the model, the same as the name on HuggingFace
  • FMHA: Fused MultiHead Attention (see introduction below)
  • WO: Weight Only Quantization (int8 / int4)
  • SQ: Smooth Quantization (int8)
  • AWQ: Activation Aware Weight Quantization (int4)
  • FP8: FP8 Quantization
  • TP: Tensor Parallel
  • PP: Pipeline Parallel
  • ST: Strongly Typed
  • C++: C++ Runtime
  • benchmark: benchmark by python / C++ Runtime
  • IFB: In-flight Batching (see introduction below)

Model comparison

Name nL nAH nKH nHW nH nF nMSL nV bP2D bBQKV bBDense Comments
chatglm3_6b 28 32 2 128 4096 13696 8192 65024 N Y N Different in preprocess and postprocess than chatglm2_6b
chatglm3_6b_base 28 32 2 128 4096 13696 32768 65024 N Y N
chatglm3_6b_32k 28 32 2 128 4096 13696 32768 65024 N Y N RoPE base=500000 rather than 10000 in chatglm3_6b
  • nL: number of layers
  • nAH: number of attention heads
  • nKH: number of kv heads (less than nAH if multi_query_attention is used)
  • nHW: head width
  • nH: hidden size
  • nF: FFN hidden size
  • nMSL: max sequence length (input + output)
  • nV: vocabulary size
  • bP2D: use position_encoding_2d (Y: Yes, N: No)
  • bBQKV: use bias for QKV multiplication in self-attention
  • bBDense: use bias for Dense multiplication in self-attention

Tokenizer and special tokens comparison

Name Tokenizer bos eos pad cls startofpiece endofpiece mask smask gmask
chatglm3_6b ChatGLMTokenizer 1 2 0 130000
chatglm3_6b_base ChatGLMTokenizer 1 2 0 130000
chatglm3_6b_32k ChatGLMTokenizer 1 2 0 130000

Usage

The next section describe how to build the engine and run the inference demo.

1. Download repo and weights from HuggingFace Transformers

pip install -r requirements.txt
apt-get update
apt-get install git-lfs
rm -rf chatglm*

# clone one or more models we want to build
git clone https://huggingface.co/THUDM/chatglm3-6b      chatglm3_6b
git clone https://huggingface.co/THUDM/chatglm3-6b-base chatglm3_6b_base
git clone https://huggingface.co/THUDM/chatglm3-6b-32k  chatglm3_6b_32k

For more example codes, please refer to the examples/models/core/glm-4-9b/README.md.