mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Yueh-Ting (eop) Chen 020fed97b6 [TRTLLM-6341][chore] Preliminary refactors on the kv cache manager before supporting swa kv cache reuse (#6767 ) This MR is a preliminary MR for implementing the SWA reuse mechanism for the kv cache manager. Please be aware that no functional change is intended in this merge request. The purpose of the clean-up is to decouple and remove existing functions for the up-coming SWA KV cache reuse change to be more natural and easier to review. Right now, (1) streamLLM, and (2) beam search with SWA, are broken. We do not want to complicate the code base by stacking more features upon something that does not work. This MR prunes out the logic and add assertions so we can come back and re-support the broken feature and remove the assertion. Since streamLLM (sink attention) is broken now, assertion is added under `KVCacheManager` ctor to guard for the value of `mSinkBlockTokenLength` and `mSinkBubbleLength`. Compute logics relate to it are pruned. The beam search with SWA will still be broke when introducing the SWA KV cache reuse. We will revisit this problem in the future. On top of this, we should make an effort to update the [supporting matrix](https://github.com/NVIDIA/TensorRT-LLM/blob/feat/1.0_doc_dev/docs/source/1.0/features/feature-combination-matrix.md) of the kv cache manager after merging the support of SWA KV cache reuse. Changes are listed as following: - Separate `KVCacheManager::updateToken` into `KVCacheManager::addToken` and `KVCacheManager::removeToken`. The functionality should be decoupled. - Push utility `cacheSequenceBlockOffsets` and `cacheNewBlockOffset` from `KVCacheManager` down to `WindowBlockManager`. `KVCacheManager`-exposed functions should be real utilities that users of the structure can leverage. Implementation-detailed function calls should not exist at this level. - Simplify "is shared last context block" logic under `KVCacheManager::addSequence`. Since no functional change is intended in this merge request, no test case is added. Several comments are added for future test coverage reminder. For `LlmRequestTest.ParamTest`, `streaming=True` is commented out because we guard sink attention with assertion now. In `capacitySchedulerTest`, `addToken` action to `crossKVCacheManager` is removed because in encoder-decoder model, generation tokens are added only to the decoder and not to the encoder. Signed-off-by: eopXD <yuehtingc@nvidia.com>		2025-08-20 13:57:57 +08:00
..
batch_manager	[None][fix]revert kvcache transfer (#6709 )	2025-08-08 07:18:53 -04:00
executor	chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service (#5234 )	2025-07-17 17:42:07 +08:00
kernels	feat: Misc Opt for large scale EP (#5374 )	2025-06-20 13:11:31 +08:00
layers	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
resources	[TRTLLM-1316] refactor: Remove unnecessary pipeline parallelism logic from postProcessRequest (#5489 )	2025-07-02 10:13:31 +02:00
runtime	[None][refactor] Simplify decoder state initialization (#6559 )	2025-08-12 21:44:41 +02:00
unit_tests	[TRTLLM-6341][chore] Preliminary refactors on the kv cache manager before supporting swa kv cache reuse (#6767 )	2025-08-20 13:57:57 +08:00
utils	[TRTLLM-1316] refactor: Remove unnecessary pipeline parallelism logic from postProcessRequest (#5489 )	2025-07-02 10:13:31 +02:00
CMakeLists.txt	feat: large-scale EP(part 2: MoE Load Balancer - core utilities) (#4384 )	2025-05-20 17:53:48 +08:00
README.md	Test: Improve model re-use in C++ DGX tests for CI stability (#4263 )	2025-05-19 14:20:21 +01:00

README.md

C++ Tests

This document explains how to build and run the C++ tests, and the included resources.

Pytest Scripts

The unit tests can be launched via the Pytest script in test_unit_tests.py. These do not require engines to be built. The Pytest script will also build TRT-LLM.

The Pytest scripts in test_e2e.py and test_multi_gpu.py build TRT-LLM, build engines, and generate expected outputs and execute the end-to-end C++ tests all in one go. test_e2e.py and test_multi_gpu.py contain single and multi-device tests, respectively.

To get an overview of the tests and their parameterization, call:

pytest tests/integration/defs/cpp/test_unit_tests.py --collect-only
pytest tests/integration/defs/cpp/test_e2e.py --collect-only
pytest tests/integration/defs/cpp/test_multi_gpu.py --collect-only

All tests take the number of the CUDA architecture of the GPU you wish to use as a parameter e.g. 90 for Hopper.

It is possible to choose unit tests or a single model for end-to-end tests. Example calls could look like this:

export LLM_MODELS_ROOT="/path/to/model_cache"

pytest tests/integration/defs/cpp/test_unit_tests.py::test_unit_tests[runtime-90]

pytest tests/integration/defs/cpp/test_e2e.py::test_model[llama-90]

pytest tests/integration/defs/cpp/test_e2e.py::test_benchmarks[gpt-90]

pytest tests/integration/defs/cpp/test_multi_gpu.py::TestDisagg::test_symmetric_executor[gpt-mpi_kvcache-90]

Manual steps

Compile

From the top-level directory call:

CPP_BUILD_DIR=cpp/build
python3 scripts/build_wheel.py -a "80-real;86-real" --build_dir ${CPP_BUILD_DIR}
pip install -r requirements-dev.txt
pip install build/tensorrt_llm*.whl
cd $CPP_BUILD_DIR && make -j$(nproc) google-tests

Single tests can be executed from CPP_BUILD_DIR/tests, e.g.

./$CPP_BUILD_DIR/tests/allocatorTest

End-to-end tests

trtGptModelRealDecoderTest and executorTest require pre-built TensorRT engines, which are loaded in the tests. They also require data files which are stored in cpp/tests/resources/data.

Build engines

Scripts are provided that download the GPT2 and GPT-J models from Huggingface and convert them to TensorRT engines. The weights and built engines are stored under cpp/tests/resources/models. To build the engines from the top-level directory:

PYTHONPATH=examples/models/core/gpt:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gpt_engines.py
PYTHONPATH=examples/models/core/llama:$PYTHONPATH python3 cpp/tests/resources/scripts/build_llama_engines.py
PYTHONPATH=examples/medusa:$PYTHONPATH python3 cpp/tests/resources/scripts/build_medusa_engines.py
PYTHONPATH=examples/eagle:$PYTHONPATH python3 cpp/tests/resources/scripts/build_eagle_engines.py
PYTHONPATH=examples/redrafter:$PYTHONPATH python3 cpp/tests/resources/scripts/build_redrafter_engines.py

It is possible to build engines with tensor and pipeline parallelism for LLaMA using 4 GPUs.

PYTHONPATH=examples/models/core/llama python3 cpp/tests/resources/scripts/build_llama_engines.py --only_multi_gpu

Generate expected output

End-to-end tests read inputs and expected outputs from Numpy files located at cpp/tests/resources/data. The expected outputs can be generated using scripts which employ the Python runtime to run the built engines:

PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_gpt_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_llama_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_medusa_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_eagle_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_redrafter_output.py

Generate data with tensor and pipeline parallelism

It is possible to generate tensor and pipeline parallelism data for LLaMA using 4 GPUs. To generate results from the top-level directory:

PYTHONPATH=examples mpirun -n 4 python3 cpp/tests/resources/scripts/generate_expected_llama_output.py --only_multi_gpu

Run test

After building the engines and generating the expected output execute the tests

./$CPP_BUILD_DIR/tests/batch_manager/trtGptModelRealDecoderTest

Run all tests with ctest

To run all tests and produce an xml report, call

./$CPP_BUILD_DIR/ctest --output-on-failure --output-junit "cpp-test-report.xml"