TensorRT-LLMs/examples/chatglm6b
Kaiyu Xie d8b408e6dc
Update TensorRT-LLM (#148)
* Update TensorRT-LLM

---------

Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2023-10-27 12:10:00 +08:00
..
.gitignore Update TensorRT-LLM (#148) 2023-10-27 12:10:00 +08:00
build.py Update TensorRT-LLM (#148) 2023-10-27 12:10:00 +08:00
convert.py Initial commit 2023-09-20 00:29:41 -07:00
exportLM.py Kaiyu/update main (#5) 2023-10-18 22:38:53 +08:00
hf_chatglm6b_convert.py Initial commit 2023-09-20 00:29:41 -07:00
modeling_chatglm.py Kaiyu/update main (#5) 2023-10-18 22:38:53 +08:00
README.md Update TensorRT-LLM (#148) 2023-10-27 12:10:00 +08:00
requirements.txt Update TensorRT-LLM (#148) 2023-10-27 12:10:00 +08:00
run.py Update TensorRT-LLM (#148) 2023-10-27 12:10:00 +08:00
smoothquant.py Initial commit 2023-09-20 00:29:41 -07:00
summarize.py Update TensorRT-LLM (#148) 2023-10-27 12:10:00 +08:00
weight.py Update TensorRT-LLM (#148) 2023-10-27 12:10:00 +08:00

ChatGLM-6B

This document explains how to build the ChatGLM-6B model using TensorRT-LLM and run on a single GPU

Overview

The TensorRT-LLM ChatGLM-6B implementation can be found in tensorrt_llm/models/chatglm6b/model.py. The TensorRT-LLM ChatGLM-6B example code is located in examples/chatglm6b. There are 3 main files in that folder:

Usage

1. Prepare environment and download weights from HuggingFace Transformers

pip install -r requirements.txt
apt-get update
apt-get install git-lfs
git clone https://huggingface.co/THUDM/chatglm-6b pyTorchModel

2. Build TensorRT engine(s)

  • This ChatGLM-6B example in TensorRT-LLM builds TensorRT engine(s) using HF checkpoint directly (rather than using FT checkpoints such as GPT example).
  • If no checkpoint directory is specified, TensorRT-LLM will build engine(s) using dummy weights.
  • The build.py script requires a single GPU to build the TensorRT engine(s).
  • You can enable parallel builds to accelerate the engine building process if you have more than one GPU in your system (of the same model).
  • For parallel building, add the --parallel_build argument to the build command (this feature cannot take advantage of more than a single node).
  • The number of TensorRT engines depends on the number of GPUs that will be used to run inference.

Examples of build invocations:

# Build a single-GPU float16 engine using FT weights.
# --use_gpt_attention_plugin must be used to deal with inputs with different length in one batch
# --use_gemm_plugin, --use_layernorm_plugin, --enable_context_fmha, --enable_context_fmha_fp32_acc are used to improve accuracy or performance.
python3 build.py --dtype float16 \
                 --use_gpt_attention_plugin float16 \
                 --use_gemm_plugin float16

Fused MultiHead Attention (FMHA)

  • Use --enable_context_fmha or --enable_context_fmha_fp32_acc to enable FMHA kernels, which can provide better performance and low GPU memory occupation.

  • Switch --use_gpt_attention_plugin float16 must be used when using FMHA.

  • --enable_context_fmha uses FP16 accumulator, which might cause low accuracy. In this case, --enable_context_fmha_fp32_acc should be used to protect accuracy at a cost of small performance drop.

In-flight batching and paged KV cache

  • The engine must be built accordingly if in-flight batching in C++ runtime will be used.

  • Use --use_inflight_batching to enable In-flight Batching.

  • Switch --use_gpt_attention_plugin=float16, --paged_kv_cache, --remove_input_padding will be set when using In-flight Batching.

  • It is possible to use --use_gpt_attention_plugin float32 In-flight Batching.

  • The size of the block in paged KV cache can be conteoled additionally by using --tokens_per_block=N.

3. Run

Single node, single GPU

Run TensorRT-LLM ChatGLM-6B model on a single GPU

# Run the ChatGLM-6B model on a single GPU.
python3 run.py

Run comparison of performance and accuracy

# Run the summarization task.
python3 summarize.py

Benchmark

  • [TODO] The TensorRT-LLM ChatGLM-6B benchmark is located in benchmarks/