mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
5.5 KiB
5.5 KiB
ChatGLM
This document explains how to build the ChatGLM2-6B, ChatGLM2-6B-32k models using TensorRT-LLM and run on a single GPU, a single node with multiple GPUs or multiple nodes with multiple GPUs.
- ChatGLM
- Overview
- Support Matrix
- Model comparison
- Tokenizer and special tokens comparison
- Usage
- 1. Download repo and weights from HuggingFace Transformers
- 2. Convert weights from HF Transformers to TensorRT-LLM format
- 3. Build TensorRT engine(s)
- 4. Run inference
- 5. Run summarization task
- Weight Only quantization
- Smooth Quantization (SQ)
- Activation-aware Weight Quantization (AWQ)
- FP8 Quantization
- Benchmark
Overview
The TensorRT-LLM ChatGLM implementation can be found in tensorrt_llm/models/chatglm/model.py.
The TensorRT-LLM ChatGLM example code is located in examples/models/contrib/chatglm2-6b. There is one main file:
examples/models/core/glm-4-9b/convert_checkpoint.pyto convert a checkpoint from the HuggingFace (HF) Transformers format to the TensorRT-LLM format.
In addition, there are two shared files in the parent folder examples for inference and evaluation:
../../../run.pyto run the inference on an input text;../../../summarize.pyto summarize the articles in the cnn_dailymail dataset.
Support Matrix
| Model Name | FP16 | FMHA | WO | SQ | AWQ | FP8 | TP | PP | ST | C++ | benchmark | IFB |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| chatglm2_6b | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | |
| chatglm2_6b_32k | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y |
- Model Name: the name of the model, the same as the name on HuggingFace
- FMHA: Fused MultiHead Attention (see introduction below)
- WO: Weight Only Quantization (int8 / int4)
- SQ: Smooth Quantization (int8)
- AWQ: Activation Aware Weight Quantization (int4)
- FP8: FP8 Quantization
- TP: Tensor Parallel
- PP: Pipeline Parallel
- ST: Strongly Typed
- C++: C++ Runtime
- benchmark: benchmark by python / C++ Runtime
- IFB: In-flight Batching (see introduction below)
Model comparison
| Name | nL | nAH | nKH | nHW | nH | nF | nMSL | nV | bP2D | bBQKV | bBDense | Comments |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| chatglm2_6b | 28 | 32 | 2 | 128 | 4096 | 13696 | 32768 | 65024 | N | Y | N | Multi_query_attention, RMSNorm rather than LayerNorm in chatglm_6b |
| chatglm2_6b_32k | 28 | 32 | 2 | 128 | 4096 | 13696 | 32768 | 65024 | N | Y | N | RoPE base=160000 rather than 10000 in chatglm2_6b |
- nL: number of layers
- nAH: number of attention heads
- nKH: number of kv heads (less than nAH if multi_query_attention is used)
- nHW: head width
- nH: hidden size
- nF: FFN hidden size
- nMSL: max sequence length (input + output)
- nV: vocabulary size
- bP2D: use position_encoding_2d (Y: Yes, N: No)
- bBQKV: use bias for QKV multiplication in self-attention
- bBDense: use bias for Dense multiplication in self-attention
Tokenizer and special tokens comparison
| Name | Tokenizer | bos | eos | pad | cls | startofpiece | endofpiece | mask | smask | gmask |
|---|---|---|---|---|---|---|---|---|---|---|
| chatglm2_6b | ChatGLMTokenizer | 1 | 2 | 0 | ||||||
| chatglm2_6b_32k | ChatGLMTokenizer | 1 | 2 | 0 |
Usage
The next section describe how to build the engine and run the inference demo.
1. Download repo and weights from HuggingFace Transformers
pip install -r requirements.txt
apt-get update
apt-get install git-lfs
rm -rf chatglm*
# clone one or more models we want to build
git clone https://huggingface.co/THUDM/chatglm2-6b chatglm2_6b
git clone https://huggingface.co/THUDM/chatglm2-6b-32k chatglm2_6b_32k
For more example codes, please refer to the examples/models/core/glm-4-9b/README.md.