TensorRT-LLMs/examples/models/contrib/chatglm2-6b/README.md
bhsueh_NV 322ac565fc
chore: clean some ci of qa test (#3083)
* move some models to examples/models/contrib

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* update the document

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* remove arctic, blip2, cogvlm, dbrx from qa test list

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* remove tests of dit, mmdit and stdit from qa test

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* remove grok, jais, sdxl, skywork, smaug from qa test list

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* re-organize the glm examples

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix issues after running pre-commit

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix some typo in glm_4_9b readme

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix bug

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-03-31 14:30:41 +08:00

5.4 KiB

ChatGLM

This document explains how to build the ChatGLM2-6B, ChatGLM2-6B-32k models using TensorRT-LLM and run on a single GPU, a single node with multiple GPUs or multiple nodes with multiple GPUs.

Overview

The TensorRT-LLM ChatGLM implementation can be found in tensorrt_llm/models/chatglm/model.py. The TensorRT-LLM ChatGLM example code is located in examples/models/contrib/chatglm2-6b. There is one main file:

In addition, there are two shared files in the parent folder examples for inference and evaluation:

Support Matrix

Model Name FP16 FMHA WO SQ AWQ FP8 TP PP ST C++ benchmark IFB
chatglm2_6b Y Y Y Y Y Y Y Y Y Y Y
chatglm2_6b_32k Y Y Y Y Y Y Y Y Y Y
  • Model Name: the name of the model, the same as the name on HuggingFace
  • FMHA: Fused MultiHead Attention (see introduction below)
  • WO: Weight Only Quantization (int8 / int4)
  • SQ: Smooth Quantization (int8)
  • AWQ: Activation Aware Weight Quantization (int4)
  • FP8: FP8 Quantization
  • TP: Tensor Parallel
  • PP: Pipeline Parallel
  • ST: Strongly Typed
  • C++: C++ Runtime
  • benchmark: benchmark by python / C++ Runtime
  • IFB: In-flight Batching (see introduction below)

Model comparison

Name nL nAH nKH nHW nH nF nMSL nV bP2D bBQKV bBDense Comments
chatglm2_6b 28 32 2 128 4096 13696 32768 65024 N Y N Multi_query_attention, RMSNorm rather than LayerNorm in chatglm_6b
chatglm2_6b_32k 28 32 2 128 4096 13696 32768 65024 N Y N RoPE base=160000 rather than 10000 in chatglm2_6b
  • nL: number of layers
  • nAH: number of attention heads
  • nKH: number of kv heads (less than nAH if multi_query_attention is used)
  • nHW: head width
  • nH: hidden size
  • nF: FFN hidden size
  • nMSL: max sequence length (input + output)
  • nV: vocabulary size
  • bP2D: use position_encoding_2d (Y: Yes, N: No)
  • bBQKV: use bias for QKV multiplication in self-attention
  • bBDense: use bias for Dense multiplication in self-attention

Tokenizer and special tokens comparison

Name Tokenizer bos eos pad cls startofpiece endofpiece mask smask gmask
chatglm2_6b ChatGLMTokenizer 1 2 0
chatglm2_6b_32k ChatGLMTokenizer 1 2 0

Usage

The next section describe how to build the engine and run the inference demo.

1. Download repo and weights from HuggingFace Transformers

pip install -r requirements.txt
apt-get update
apt-get install git-lfs
rm -rf chatglm*

# clone one or more models we want to build
git clone https://huggingface.co/THUDM/chatglm2-6b      chatglm2_6b
git clone https://huggingface.co/THUDM/chatglm2-6b-32k  chatglm2_6b_32k

For more example codes, please refer to the examples/glm-4-9b/README.md.