* Update TensorRT-LLM --------- Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| .gitignore | ||
| build.py | ||
| convert.py | ||
| exportLM.py | ||
| hf_chatglm6b_convert.py | ||
| modeling_chatglm.py | ||
| README.md | ||
| requirements.txt | ||
| run.py | ||
| smoothquant.py | ||
| summarize.py | ||
| weight.py | ||
ChatGLM-6B
This document explains how to build the ChatGLM-6B model using TensorRT-LLM and run on a single GPU
Overview
The TensorRT-LLM ChatGLM-6B implementation can be found in tensorrt_llm/models/chatglm6b/model.py.
The TensorRT-LLM ChatGLM-6B example code is located in examples/chatglm6b. There are 3 main files in that folder:
build.pyto build the TensorRT engine(s) needed to run the ChatGLM-6B model.run.pyto run the inference on an input text.summarize.pyto summarize the articles in the cnn_dailymail dataset using the model.
Usage
1. Prepare environment and download weights from HuggingFace Transformers
pip install -r requirements.txt
apt-get update
apt-get install git-lfs
git clone https://huggingface.co/THUDM/chatglm-6b pyTorchModel
2. Build TensorRT engine(s)
- This ChatGLM-6B example in TensorRT-LLM builds TensorRT engine(s) using HF checkpoint directly (rather than using FT checkpoints such as GPT example).
- If no checkpoint directory is specified, TensorRT-LLM will build engine(s) using dummy weights.
- The
build.pyscript requires a single GPU to build the TensorRT engine(s). - You can enable parallel builds to accelerate the engine building process if you have more than one GPU in your system (of the same model).
- For parallel building, add the
--parallel_buildargument to the build command (this feature cannot take advantage of more than a single node). - The number of TensorRT engines depends on the number of GPUs that will be used to run inference.
Examples of build invocations:
# Build a single-GPU float16 engine using FT weights.
# --use_gpt_attention_plugin must be used to deal with inputs with different length in one batch
# --use_gemm_plugin, --use_layernorm_plugin, --enable_context_fmha, --enable_context_fmha_fp32_acc are used to improve accuracy or performance.
python3 build.py --dtype float16 \
--use_gpt_attention_plugin float16 \
--use_gemm_plugin float16
Fused MultiHead Attention (FMHA)
-
Use
--enable_context_fmhaor--enable_context_fmha_fp32_accto enable FMHA kernels, which can provide better performance and low GPU memory occupation. -
Switch
--use_gpt_attention_plugin float16must be used when using FMHA. -
--enable_context_fmhauses FP16 accumulator, which might cause low accuracy. In this case,--enable_context_fmha_fp32_accshould be used to protect accuracy at a cost of small performance drop.
In-flight batching and paged KV cache
-
The engine must be built accordingly if in-flight batching in C++ runtime will be used.
-
Use
--use_inflight_batchingto enable In-flight Batching. -
Switch
--use_gpt_attention_plugin=float16,--paged_kv_cache,--remove_input_paddingwill be set when using In-flight Batching. -
It is possible to use
--use_gpt_attention_plugin float32In-flight Batching. -
The size of the block in paged KV cache can be conteoled additionally by using
--tokens_per_block=N.
3. Run
Single node, single GPU
Run TensorRT-LLM ChatGLM-6B model on a single GPU
# Run the ChatGLM-6B model on a single GPU.
python3 run.py
Run comparison of performance and accuracy
# Run the summarization task.
python3 summarize.py
Benchmark
- [TODO] The TensorRT-LLM ChatGLM-6B benchmark is located in benchmarks/