mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-02-07 11:41:47 +08:00
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
107 lines
5.6 KiB
Markdown
107 lines
5.6 KiB
Markdown
# ChatGLM
|
|
|
|
This document explains how to build the [ChatGLM-6B](https://huggingface.co/THUDM/chatglm-6b) models using TensorRT LLM and run on a single GPU, a single node with multiple GPUs or multiple nodes with multiple GPUs.
|
|
|
|
- [ChatGLM](#chatglm)
|
|
- [Overview](#overview)
|
|
- [Support Matrix](#support-matrix)
|
|
- [Model comparison](#model-comparison)
|
|
- [Tokenizer and special tokens comparison](#tokenizer-and-special-tokens-comparison)
|
|
- [Usage](#usage)
|
|
- [1. Download repo and weights from HuggingFace Transformers](#1-download-repo-and-weights-from-huggingface-transformers)
|
|
- [2. Convert weights from HF Transformers to TensorRT LLM format](#2-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
|
|
- [3. Build TensorRT engine(s)](#3-build-tensorrt-engines)
|
|
- [Enable plugins](#enable-plugins)
|
|
- [In-flight batching](#in-flight-batching)
|
|
- [4. Run inference](#4-run-inference)
|
|
- [Single node, single GPU](#single-node-single-gpu)
|
|
- [Single node, multi GPU](#single-node-multi-gpu)
|
|
- [5. Run summarization task](#5-run-summarization-task)
|
|
- [Weight Only quantization](#weight-only-quantization)
|
|
- [Smooth Quantization (SQ)](#smooth-quantization-sq)
|
|
- [Activation-aware Weight Quantization (AWQ)](#activation-aware-weight-quantization-awq)
|
|
- [FP8 Quantization](#fp8-quantization)
|
|
- [Benchmark](#benchmark)
|
|
|
|
|
|
## Overview
|
|
|
|
The TensorRT LLM ChatGLM implementation can be found in [`tensorrt_llm/models/chatglm/model.py`](../../tensorrt_llm/models/chatglm/model.py).
|
|
The TensorRT LLM ChatGLM example code is located in [`examples/models/contrib/chatglm-6b`](./). There is one main file:
|
|
|
|
* [`examples/models/core/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format.
|
|
|
|
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
|
|
|
|
* [`../../../run.py`](../../../run.py) to run the inference on an input text;
|
|
* [`../../../summarize.py`](../../../summarize.py) to summarize the articles in the [cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset.
|
|
|
|
## Support Matrix
|
|
|
|
| Model Name | FP16 | FMHA | WO | SQ | AWQ | FP8 | TP | PP | ST | C++ | benchmark | IFB |
|
|
| :--------------: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :-------: | :---: |
|
|
| chatglm_6b | Y | | Y | | | | Y | | Y | Y | Y | Y |
|
|
| glm_10b | Y | Y | Y | | Y | Y | Y | | Y | Y | Y | Y |
|
|
|
|
* Model Name: the name of the model, the same as the name on HuggingFace
|
|
* FMHA: Fused MultiHead Attention (see introduction below)
|
|
* WO: Weight Only Quantization (int8 / int4)
|
|
* SQ: Smooth Quantization (int8)
|
|
* AWQ: Activation Aware Weight Quantization (int4)
|
|
* FP8: FP8 Quantization
|
|
* TP: Tensor Parallel
|
|
* PP: Pipeline Parallel
|
|
* ST: Strongly Typed
|
|
* C++: C++ Runtime
|
|
* benchmark: benchmark by python / C++ Runtime
|
|
* IFB: In-flight Batching (see introduction below)
|
|
|
|
## Model comparison
|
|
|
|
| Name | nL | nAH | nKH | nHW | nH | nF | nMSL | nV | bP2D | bBQKV | bBDense | Comments |
|
|
| :--------------: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :----: | :---: | :---: | :-----: | :----------------------------------------------------------------- |
|
|
| chatglm_6b | 28 | 32 | 32 | 128 | 4096 | 16384 | 2048 | 130528 | Y | Y | Y | |
|
|
| glm_10b | 48 | 64 | 32 | 64 | 4096 | 16384 | 1024 | 50304 | Y | Y | Y | |
|
|
|
|
* nL: number of layers
|
|
* nAH: number of attention heads
|
|
* nKH: number of kv heads (less than nAH if multi_query_attention is used)
|
|
* nHW: head width
|
|
* nH: hidden size
|
|
* nF: FFN hidden size
|
|
* nMSL: max sequence length (input + output)
|
|
* nV: vocabulary size
|
|
* bP2D: use position_encoding_2d (Y: Yes, N: No)
|
|
* bBQKV: use bias for QKV multiplication in self-attention
|
|
* bBDense: use bias for Dense multiplication in self-attention
|
|
|
|
## Tokenizer and special tokens comparison
|
|
|
|
| Name | Tokenizer | bos | eos | pad | cls | startofpiece | endofpiece | mask | smask | gmask |
|
|
| :--------------: | :--------------: | :----: | :----: | :---: | :---: | :----------: | :--------: | :----: | :---: | :----: |
|
|
| chatglm_6b | ChatGLMTokenizer | 130004 | 130005 | 3 | | 130004 | 130005 | 130000 | | 130001 |
|
|
| glm_10b | GLMGPT2Tokenizer | 50257 | 50256 | 50256 | 50259 | 50257 | 50258 | 50260 | 50264 | 50263 |
|
|
|
|
## Usage
|
|
|
|
The next section describe how to build the engine and run the inference demo.
|
|
|
|
### 1. Download repo and weights from HuggingFace Transformers
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
apt-get update
|
|
apt-get install git-lfs
|
|
rm -rf chatglm*
|
|
|
|
# clone one or more models we want to build
|
|
git clone https://huggingface.co/THUDM/chatglm-6b chatglm_6b
|
|
git clone https://huggingface.co/THUDM/glm-10b glm_10b
|
|
|
|
# replace tokenization file if using transformers-4.36.1 for model ChatGLM-6B (this might be needless in the future)
|
|
cp chatglm_6b/tokenization_chatglm.py chatglm_6b/tokenization_chatglm.py-backup
|
|
cp tokenization_chatglm.py chatglm_6b
|
|
```
|
|
|
|
For more example codes, please refer to the [examples/models/core/glm-4-9b/README.md](../../../glm-4-9b/README.md).
|