# Testing TensorRT LLM backend Tests in this CI directory can be run manually to provide extensive testing. ## Run QA Tests Run the testing within the Triton container. ```bash docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/opt/tritonserver/tensorrtllm_backend nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3 bash # Change directory to the test and run the test.sh script cd /opt/tritonserver/tensorrtllm_backend/ci/L0_backend_trtllm bash -x ./test.sh ``` ## Run the e2e/benchmark_core_model to benchmark These two tests are ran in the [L0_backend_trtllm](./L0_backend_trtllm/) test. Below are the instructions to run the tests manually. ### Generate the model repository Follow the instructions in the [Create the model repository](../README.md#prepare-the-model-repository) section to prepare the model repository. ### Modify the model configuration Follow the instructions in the [Modify the model configuration](../README.md#modify-the-model-configuration) section to modify the model configuration based on the needs. ### End to end test [End to end test script](../tools/inflight_batcher_llm/end_to_end_test.py) sends requests to the deployed `ensemble` model. Ensemble model is ensembled by three models: `preprocessing`, `tensorrt_llm` and `postprocessing`: - "preprocessing": This model is used for tokenizing, meaning the conversion from prompts(string) to input_ids(list of ints). - "tensorrt_llm": This model is a wrapper of your TensorRT LLM model and is used for inferencing - "postprocessing": This model is used for de-tokenizing, meaning the conversion from output_ids(list of ints) to outputs(string). The end to end latency includes the total latency of the three parts of an ensemble model. ```bash cd tools/inflight_batcher_llm python3 end_to_end_test.py --dataset ``` Expected outputs ``` [INFO] Functionality test succeed. [INFO] Warm up for benchmarking. [INFO] Start benchmarking on 125 prompts. [INFO] Total Latency: 11099.243 ms ``` ### benchmark_core_model [benchmark_core_model script](../tools/inflight_batcher_llm/benchmark_core_model.py) sends requests directly to the deployed `tensorrt_llm` model, the benchmark_core_model latency indicates the inference latency of TensorRT-LLM, not including the pre/post-processing latency which is usually handled by a third-party library such as HuggingFace. ```bash cd tools/inflight_batcher_llm python3 benchmark_core_model.py dataset --dataset ``` Expected outputs ``` [INFO] Warm up for benchmarking. [INFO] Start benchmarking on 125 prompts. [INFO] Total Latency: 10213.462 ms ``` *Please note that the expected outputs in that document are only for reference, specific performance numbers depend on the GPU you're using.*