TensorRT-LLMs/triton_backend/ci/README.md
Guoming Zhang 9f0f52249e [None][doc] Rename TensorRT-LLM to TensorRT LLM for homepage and the … (#7850)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00

108 lines
4.3 KiB
Markdown

<!--
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-->
# Testing TensorRT LLM backend
Tests in this CI directory can be run manually to provide extensive testing.
## Run QA Tests
Run the testing within the Triton container.
```bash
docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/opt/tritonserver/tensorrtllm_backend nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3 bash
# Change directory to the test and run the test.sh script
cd /opt/tritonserver/tensorrtllm_backend/ci/L0_backend_trtllm
bash -x ./test.sh
```
## Run the e2e/benchmark_core_model to benchmark
These two tests are ran in the [L0_backend_trtllm](./L0_backend_trtllm/)
test. Below are the instructions to run the tests manually.
### Generate the model repository
Follow the instructions in the
[Create the model repository](../README.md#prepare-the-model-repository)
section to prepare the model repository.
### Modify the model configuration
Follow the instructions in the
[Modify the model configuration](../README.md#modify-the-model-configuration)
section to modify the model configuration based on the needs.
### End to end test
[End to end test script](../tools/inflight_batcher_llm/end_to_end_test.py) sends
requests to the deployed `ensemble` model.
Ensemble model is ensembled by three models: `preprocessing`, `tensorrt_llm` and `postprocessing`:
- "preprocessing": This model is used for tokenizing, meaning the conversion from prompts(string) to input_ids(list of ints).
- "tensorrt_llm": This model is a wrapper of your TensorRT LLM model and is used for inferencing
- "postprocessing": This model is used for de-tokenizing, meaning the conversion from output_ids(list of ints) to outputs(string).
The end to end latency includes the total latency of the three parts of an ensemble model.
```bash
cd tools/inflight_batcher_llm
python3 end_to_end_test.py --dataset <dataset path>
```
Expected outputs
```
[INFO] Functionality test succeed.
[INFO] Warm up for benchmarking.
[INFO] Start benchmarking on 125 prompts.
[INFO] Total Latency: 11099.243 ms
```
### benchmark_core_model
[benchmark_core_model script](../tools/inflight_batcher_llm/benchmark_core_model.py)
sends requests directly to the deployed `tensorrt_llm` model, the benchmark_core_model
latency indicates the inference latency of TensorRT-LLM, not including the
pre/post-processing latency which is usually handled by a third-party library
such as HuggingFace.
```bash
cd tools/inflight_batcher_llm
python3 benchmark_core_model.py dataset --dataset <dataset path>
```
Expected outputs
```
[INFO] Warm up for benchmarking.
[INFO] Start benchmarking on 125 prompts.
[INFO] Total Latency: 10213.462 ms
```
*Please note that the expected outputs in that document are only for reference, specific performance numbers depend on the GPU you're using.*