TensorRT-LLMs/examples/models/contrib/chatglm2-6b
rakib-hasan ff3b741045
feat: adding multimodal (only image for now) support in trtllm-bench (#3490)
* feat: adding multimodal (only image for now) support in trtllm-bench

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* fix: add  in load_dataset() calls to maintain the v2.19.2 behavior

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* re-adding prompt_token_ids and using that for prompt_len

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* updating the datasets version in examples as well

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* api changes are not needed

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* moving datasets requirement and removing a missed api change

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* addressing review comments

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* refactoring the quickstart example

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

---------

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
2025-04-18 07:06:16 +08:00
..
README.md chore: clean some ci of qa test (#3083) 2025-03-31 14:30:41 +08:00
requirements.txt feat: adding multimodal (only image for now) support in trtllm-bench (#3490) 2025-04-18 07:06:16 +08:00
tokenization_chatglm.py chore: clean some ci of qa test (#3083) 2025-03-31 14:30:41 +08:00

ChatGLM

This document explains how to build the ChatGLM2-6B, ChatGLM2-6B-32k models using TensorRT-LLM and run on a single GPU, a single node with multiple GPUs or multiple nodes with multiple GPUs.

Overview

The TensorRT-LLM ChatGLM implementation can be found in tensorrt_llm/models/chatglm/model.py. The TensorRT-LLM ChatGLM example code is located in examples/models/contrib/chatglm2-6b. There is one main file:

In addition, there are two shared files in the parent folder examples for inference and evaluation:

Support Matrix

Model Name FP16 FMHA WO SQ AWQ FP8 TP PP ST C++ benchmark IFB
chatglm2_6b Y Y Y Y Y Y Y Y Y Y Y
chatglm2_6b_32k Y Y Y Y Y Y Y Y Y Y
  • Model Name: the name of the model, the same as the name on HuggingFace
  • FMHA: Fused MultiHead Attention (see introduction below)
  • WO: Weight Only Quantization (int8 / int4)
  • SQ: Smooth Quantization (int8)
  • AWQ: Activation Aware Weight Quantization (int4)
  • FP8: FP8 Quantization
  • TP: Tensor Parallel
  • PP: Pipeline Parallel
  • ST: Strongly Typed
  • C++: C++ Runtime
  • benchmark: benchmark by python / C++ Runtime
  • IFB: In-flight Batching (see introduction below)

Model comparison

Name nL nAH nKH nHW nH nF nMSL nV bP2D bBQKV bBDense Comments
chatglm2_6b 28 32 2 128 4096 13696 32768 65024 N Y N Multi_query_attention, RMSNorm rather than LayerNorm in chatglm_6b
chatglm2_6b_32k 28 32 2 128 4096 13696 32768 65024 N Y N RoPE base=160000 rather than 10000 in chatglm2_6b
  • nL: number of layers
  • nAH: number of attention heads
  • nKH: number of kv heads (less than nAH if multi_query_attention is used)
  • nHW: head width
  • nH: hidden size
  • nF: FFN hidden size
  • nMSL: max sequence length (input + output)
  • nV: vocabulary size
  • bP2D: use position_encoding_2d (Y: Yes, N: No)
  • bBQKV: use bias for QKV multiplication in self-attention
  • bBDense: use bias for Dense multiplication in self-attention

Tokenizer and special tokens comparison

Name Tokenizer bos eos pad cls startofpiece endofpiece mask smask gmask
chatglm2_6b ChatGLMTokenizer 1 2 0
chatglm2_6b_32k ChatGLMTokenizer 1 2 0

Usage

The next section describe how to build the engine and run the inference demo.

1. Download repo and weights from HuggingFace Transformers

pip install -r requirements.txt
apt-get update
apt-get install git-lfs
rm -rf chatglm*

# clone one or more models we want to build
git clone https://huggingface.co/THUDM/chatglm2-6b      chatglm2_6b
git clone https://huggingface.co/THUDM/chatglm2-6b-32k  chatglm2_6b_32k

For more example codes, please refer to the examples/glm-4-9b/README.md.