mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Yueh-Ting (eop) Chen 4882815fa1 [TLLM-6777][feature] Support SWA KV cache reuse OOW block detach (#7922 ) This MR is a continuation of #6768. In the previous merge request, OOW (out-of-window) blocks are only detached when reuse is not enabled, that is, the block movement behavior is identical between SWA and full attention when reuse is enabled. This merge request attempts to enable OOW block detach when reuse is enabled. The required changes are: - Let KV cache manager keep track of which block is used by which sequence - Remove restriction for the eviction policy to be able to release a non-leaf block Along with the development, bugs inside freeChildren and offload mechanism under getFreeBlock is resolved because they will affect the functionality this merge request is trying to achieve. When a block goes OOW, it is released from the sequence, it will be available to be reclaimed and the block is held by the eviction policy for another sequence to acquire upon calling. On the other hand, we want to potentially store the sequence for reuse. To safely achieve this, the record of block ownership is done under WindowBlockManager::getFreeBlock. If the block acquired was originally owned by another sequence that is live inside the manager, then we invalidate the sequence for store for reuse. At the end of a sequence (when removeSequence is called toward it), the KV cache manager will check if the sequence has all blocks not reclaimed by another sequence. If so, then the sequence is safe to be stored for reuse and store for reuse action will be performed. Signed-off-by: eopXD <yuehtingc@nvidia.com>		2025-10-13 09:18:12 -07:00
..
e2e_tests	[None] [refactor] Minor cleanup and improvements (#7619 )	2025-10-03 11:40:06 +02:00
resources	[TRTLLM-1316] refactor: Remove unnecessary pipeline parallelism logic from postProcessRequest (#5489 )	2025-07-02 10:13:31 +02:00
unit_tests	[TLLM-6777][feature] Support SWA KV cache reuse OOW block detach (#7922 )	2025-10-13 09:18:12 -07:00
utils	[TRTLLM-1316] refactor: Remove unnecessary pipeline parallelism logic from postProcessRequest (#5489 )	2025-07-02 10:13:31 +02:00
CMakeLists.txt	[None] [ci] Reorganize CMake and Python integration test infrastructure for C++ tests (#6754 )	2025-08-24 20:53:17 +02:00
README.md	Test: Improve model re-use in C++ DGX tests for CI stability (#4263 )	2025-05-19 14:20:21 +01:00

README.md

C++ Tests

This document explains how to build and run the C++ tests, and the included resources.

Pytest Scripts

The unit tests can be launched via the Pytest script in test_unit_tests.py. These do not require engines to be built. The Pytest script will also build TRT-LLM.

The Pytest scripts in test_e2e.py and test_multi_gpu.py build TRT-LLM, build engines, and generate expected outputs and execute the end-to-end C++ tests all in one go. test_e2e.py and test_multi_gpu.py contain single and multi-device tests, respectively.

To get an overview of the tests and their parameterization, call:

pytest tests/integration/defs/cpp/test_unit_tests.py --collect-only
pytest tests/integration/defs/cpp/test_e2e.py --collect-only
pytest tests/integration/defs/cpp/test_multi_gpu.py --collect-only

All tests take the number of the CUDA architecture of the GPU you wish to use as a parameter e.g. 90 for Hopper.

It is possible to choose unit tests or a single model for end-to-end tests. Example calls could look like this:

export LLM_MODELS_ROOT="/path/to/model_cache"

pytest tests/integration/defs/cpp/test_unit_tests.py::test_unit_tests[runtime-90]

pytest tests/integration/defs/cpp/test_e2e.py::test_model[llama-90]

pytest tests/integration/defs/cpp/test_e2e.py::test_benchmarks[gpt-90]

pytest tests/integration/defs/cpp/test_multi_gpu.py::TestDisagg::test_symmetric_executor[gpt-mpi_kvcache-90]

Manual steps

Compile

From the top-level directory call:

CPP_BUILD_DIR=cpp/build
python3 scripts/build_wheel.py -a "80-real;86-real" --build_dir ${CPP_BUILD_DIR}
pip install -r requirements-dev.txt
pip install build/tensorrt_llm*.whl
cd $CPP_BUILD_DIR && make -j$(nproc) google-tests

Single tests can be executed from CPP_BUILD_DIR/tests, e.g.

./$CPP_BUILD_DIR/tests/allocatorTest

End-to-end tests

trtGptModelRealDecoderTest and executorTest require pre-built TensorRT engines, which are loaded in the tests. They also require data files which are stored in cpp/tests/resources/data.

Build engines

Scripts are provided that download the GPT2 and GPT-J models from Huggingface and convert them to TensorRT engines. The weights and built engines are stored under cpp/tests/resources/models. To build the engines from the top-level directory:

PYTHONPATH=examples/models/core/gpt:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gpt_engines.py
PYTHONPATH=examples/models/core/llama:$PYTHONPATH python3 cpp/tests/resources/scripts/build_llama_engines.py
PYTHONPATH=examples/medusa:$PYTHONPATH python3 cpp/tests/resources/scripts/build_medusa_engines.py
PYTHONPATH=examples/eagle:$PYTHONPATH python3 cpp/tests/resources/scripts/build_eagle_engines.py
PYTHONPATH=examples/redrafter:$PYTHONPATH python3 cpp/tests/resources/scripts/build_redrafter_engines.py

It is possible to build engines with tensor and pipeline parallelism for LLaMA using 4 GPUs.

PYTHONPATH=examples/models/core/llama python3 cpp/tests/resources/scripts/build_llama_engines.py --only_multi_gpu

Generate expected output

End-to-end tests read inputs and expected outputs from Numpy files located at cpp/tests/resources/data. The expected outputs can be generated using scripts which employ the Python runtime to run the built engines:

PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_gpt_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_llama_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_medusa_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_eagle_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_redrafter_output.py

Generate data with tensor and pipeline parallelism

It is possible to generate tensor and pipeline parallelism data for LLaMA using 4 GPUs. To generate results from the top-level directory:

PYTHONPATH=examples mpirun -n 4 python3 cpp/tests/resources/scripts/generate_expected_llama_output.py --only_multi_gpu

Run test

After building the engines and generating the expected output execute the tests

./$CPP_BUILD_DIR/tests/batch_manager/trtGptModelRealDecoderTest

Run all tests with ctest

To run all tests and produce an xml report, call

./$CPP_BUILD_DIR/ctest --output-on-failure --output-junit "cpp-test-report.xml"