This MR is a preliminary MR for implementing the SWA reuse mechanism for the kv cache manager. Please be aware that **no functional change is intended** in this merge request. The purpose of the clean-up is to decouple and remove existing functions for the up-coming SWA KV cache reuse change to be more natural and easier to review. Right now, (1) streamLLM, and (2) beam search with SWA, are broken. We do not want to complicate the code base by stacking more features upon something that does not work. This MR prunes out the logic and add assertions so we can come back and re-support the broken feature and remove the assertion. Since streamLLM (sink attention) is broken now, assertion is added under `KVCacheManager` ctor to guard for the value of `mSinkBlockTokenLength` and `mSinkBubbleLength`. Compute logics relate to it are pruned. The beam search with SWA will still be broke when introducing the SWA KV cache reuse. We will revisit this problem in the future. On top of this, we should make an effort to update the [supporting matrix](https://github.com/NVIDIA/TensorRT-LLM/blob/feat/1.0_doc_dev/docs/source/1.0/features/feature-combination-matrix.md) of the kv cache manager after merging the support of SWA KV cache reuse. Changes are listed as following: - Separate `KVCacheManager::updateToken` into `KVCacheManager::addToken` and `KVCacheManager::removeToken`. The functionality should be decoupled. - Push utility `cacheSequenceBlockOffsets` and `cacheNewBlockOffset` from `KVCacheManager` down to `WindowBlockManager`. `KVCacheManager`-exposed functions should be real utilities that users of the structure can leverage. Implementation-detailed function calls should not exist at this level. - Simplify "is shared last context block" logic under `KVCacheManager::addSequence`. Since no functional change is intended in this merge request, no test case is added. Several comments are added for future test coverage reminder. For `LlmRequestTest.ParamTest`, `streaming=True` is commented out because we guard sink attention with assertion now. In `capacitySchedulerTest`, `addToken` action to `crossKVCacheManager` is removed because in encoder-decoder model, generation tokens are added only to the decoder and not to the encoder. Signed-off-by: eopXD <yuehtingc@nvidia.com> |
||
|---|---|---|
| .. | ||
| batch_manager | ||
| executor | ||
| kernels | ||
| layers | ||
| resources | ||
| runtime | ||
| unit_tests | ||
| utils | ||
| CMakeLists.txt | ||
| README.md | ||
C++ Tests
This document explains how to build and run the C++ tests, and the included resources.
Pytest Scripts
The unit tests can be launched via the Pytest script in test_unit_tests.py. These do not require engines to be built. The Pytest script will also build TRT-LLM.
The Pytest scripts in test_e2e.py and test_multi_gpu.py build TRT-LLM, build engines, and generate expected outputs and execute the end-to-end C++ tests all in one go.
test_e2e.py and test_multi_gpu.py contain single and multi-device tests, respectively.
To get an overview of the tests and their parameterization, call:
pytest tests/integration/defs/cpp/test_unit_tests.py --collect-only
pytest tests/integration/defs/cpp/test_e2e.py --collect-only
pytest tests/integration/defs/cpp/test_multi_gpu.py --collect-only
All tests take the number of the CUDA architecture of the GPU you wish to use as a parameter e.g. 90 for Hopper.
It is possible to choose unit tests or a single model for end-to-end tests. Example calls could look like this:
export LLM_MODELS_ROOT="/path/to/model_cache"
pytest tests/integration/defs/cpp/test_unit_tests.py::test_unit_tests[runtime-90]
pytest tests/integration/defs/cpp/test_e2e.py::test_model[llama-90]
pytest tests/integration/defs/cpp/test_e2e.py::test_benchmarks[gpt-90]
pytest tests/integration/defs/cpp/test_multi_gpu.py::TestDisagg::test_symmetric_executor[gpt-mpi_kvcache-90]
Manual steps
Compile
From the top-level directory call:
CPP_BUILD_DIR=cpp/build
python3 scripts/build_wheel.py -a "80-real;86-real" --build_dir ${CPP_BUILD_DIR}
pip install -r requirements-dev.txt
pip install build/tensorrt_llm*.whl
cd $CPP_BUILD_DIR && make -j$(nproc) google-tests
Single tests can be executed from CPP_BUILD_DIR/tests, e.g.
./$CPP_BUILD_DIR/tests/allocatorTest
End-to-end tests
trtGptModelRealDecoderTest and executorTest require pre-built TensorRT engines, which are loaded in the tests. They also require data files which are stored in cpp/tests/resources/data.
Build engines
Scripts are provided that download the GPT2 and GPT-J models from Huggingface and convert them to TensorRT engines. The weights and built engines are stored under cpp/tests/resources/models. To build the engines from the top-level directory:
PYTHONPATH=examples/models/core/gpt:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gpt_engines.py
PYTHONPATH=examples/models/core/llama:$PYTHONPATH python3 cpp/tests/resources/scripts/build_llama_engines.py
PYTHONPATH=examples/medusa:$PYTHONPATH python3 cpp/tests/resources/scripts/build_medusa_engines.py
PYTHONPATH=examples/eagle:$PYTHONPATH python3 cpp/tests/resources/scripts/build_eagle_engines.py
PYTHONPATH=examples/redrafter:$PYTHONPATH python3 cpp/tests/resources/scripts/build_redrafter_engines.py
It is possible to build engines with tensor and pipeline parallelism for LLaMA using 4 GPUs.
PYTHONPATH=examples/models/core/llama python3 cpp/tests/resources/scripts/build_llama_engines.py --only_multi_gpu
Generate expected output
End-to-end tests read inputs and expected outputs from Numpy files located at cpp/tests/resources/data. The expected outputs can be generated using scripts which employ the Python runtime to run the built engines:
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_gpt_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_llama_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_medusa_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_eagle_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_redrafter_output.py
Generate data with tensor and pipeline parallelism
It is possible to generate tensor and pipeline parallelism data for LLaMA using 4 GPUs. To generate results from the top-level directory:
PYTHONPATH=examples mpirun -n 4 python3 cpp/tests/resources/scripts/generate_expected_llama_output.py --only_multi_gpu
Run test
After building the engines and generating the expected output execute the tests
./$CPP_BUILD_DIR/tests/batch_manager/trtGptModelRealDecoderTest
Run all tests with ctest
To run all tests and produce an xml report, call
./$CPP_BUILD_DIR/ctest --output-on-failure --output-junit "cpp-test-report.xml"