mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Robin Kobus ceec4924d9 refactor: batch slot management in decoder classes (#3300 ) * refactor: batch slot management in decoder classes - Changed `forwardBatchSlots` from a single `TensorPtr` to a `std::vector<TensorPtr>` in `decoderBuffers.h` and updated its initialization in `decoderBuffers.cpp`. - Updated `batchSlots` in `iGptDecoderBatched.h` to a `std::vector<TensorPtr>` for better handling of batch sizes. - Modified `mBatchSlotsDecoder` in `statefulGptDecoderBatched.h` to use a `std::vector<TensorPtr>` and adjusted its initialization in `statefulGptDecoderBatched.cpp`. - Ensured proper reshaping of tensors in the setup methods to accommodate the new vector structure. These changes enhance flexibility in managing tensor buffers across different batch sizes. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Setup batch slots outside of the decoder - Refactored batch slot management to utilize `makeBatchSlots`, enhancing clarity and functionality in batch processing. - Introduced `DecoderState` to `MakeDecodingBatchInputOutput` for improved state handling during decoding. - Updated the `operator()` method to include `decoderState` as a parameter, facilitating better integration with the decoding process. - Modified related tests to accommodate changes in batch slot handling and ensure proper functionality. These updates improve the overall structure and efficiency of the decoding process in the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Enhance decoder input structure with maxDecodingEngineTokens - Updated the `Input` class in `iGptDecoderBatched.h` to include a new parameter `maxDecodingEngineTokens` for better control over decoding limits. - Modified the `MakeDecodingBatchInputOutput` algorithm to compute the maximum number of decoding tokens based on active slots. - Adjusted the `GptDecoderBatched` class to utilize the new `maxDecodingEngineTokens` parameter, improving clarity in token management during decoding. - Updated Python bindings to reflect changes in the `Input` class constructor. - Enhanced tests to ensure proper handling of the new parameter. These changes improve the flexibility and efficiency of the decoding process in the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Streamline decoder input creation and batch slot management - Introduced a new function `createDecoderInputs` to encapsulate the logic for creating decoder inputs, improving code organization. - Updated the `operator()` method to utilize the new `createDecoderInputs` function, simplifying the decoding input setup process. - Removed the `maxOfActiveSlots` template function to streamline the logic for determining the maximum number of active decoding engine tokens. - Introduced a direct calculation of `maxActiveDecodingEngineTokens` within the `createDecoderInputs` function, enhancing clarity and reducing complexity. These changes enhance the maintainability and readability of the decoding process in the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update logits handling in decoder batch - Modified the `decoder_batch::Input` to accept a vector of vectors for logits, enhancing flexibility in tensor management. - Adjusted the `createDecoderInputs` function to accommodate the new logits structure, ensuring proper batch processing. - Updated Python bindings to reflect changes in the `Input` class constructor, maintaining compatibility with existing interfaces. - Refactored the `GptDecoderBatched` and `StatefulGptDecoderBatched` classes to utilize the updated logits structure, improving clarity in tensor slicing and batch size management. - Enhanced tests to validate the new input structure and ensure correct functionality across various decoding scenarios. These changes streamline the decoding process and improve the overall maintainability of the codebase. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Rename maxDecodingEngineTokens to maxDecoderSteps - Updated the `Input` class in `iGptDecoderBatched.h` to rename `maxDecodingEngineTokens` to `maxDecoderSteps` for improved clarity. - Adjusted the `createDecoderInputs` function to reflect the new naming, ensuring consistency in the decoding process. - Modified the `GptDecoderBatched` class to utilize `maxDecoderSteps` in its logic, enhancing readability and maintainability. - Updated Python bindings to expose the renamed parameter, maintaining compatibility with existing interfaces. These changes enhance the clarity of the decoding parameters and improve the overall structure of the codebase. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: remove usage of `active` vector from prepareForward Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Removed the `active` vector from `decoder_batch::Input` - Removed the `active` vector from the `Input` class constructor in `iGptDecoderBatched.h`, streamlining the input handling for decoding. - Updated the `createDecoderInputs` function and related tests to reflect the changes in the `Input` class, ensuring compatibility and maintaining functionality. - Adjusted Python bindings to accommodate the new constructor signature, enhancing clarity in the interface. These changes improve the maintainability and readability of the decoding process in the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: remove usage of `active` vector from gptDecoderBatchedTest Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Unify the creation of decoder batch inputs in algorithm and tests - Added a new static method `createDecoderBatchInputs` to streamline the creation of decoder batch inputs, enhancing clarity and maintainability. - Updated the implementation to utilize active slots directly, simplifying the logic for managing batch slots and logits. - Refactored the `operator()` method to leverage the new input creation function, ensuring compatibility with existing decoding processes. - Enhanced tests to validate the new input handling approach, ensuring correct functionality across various scenarios. These changes improve the overall structure and readability of the decoding process in the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: remove usage of active vector from createDecoderBatchInputs Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update maxDecoderSteps calculation - Replaced integer division with `common::ceilDiv` for calculating `maxDecoderSteps` and `numDecoderSteps`, ensuring correct handling of token counts. These changes enhance the robustness of the decoding batch input creation process. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>		2025-04-13 05:05:13 +08:00
..
batch_manager	feat: Allow individual gatherContext for each additional output (#3374 )	2025-04-12 17:00:36 +08:00
executor	fix: conditional disagg test name (#3161 )	2025-04-02 15:34:30 +08:00
kernels	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
layers	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
resources	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
runtime	refactor: batch slot management in decoder classes (#3300 )	2025-04-13 05:05:13 +08:00
unit_tests	feat: Allow individual gatherContext for each additional output (#3374 )	2025-04-12 17:00:36 +08:00
utils	feat: Allow individual gatherContext for each additional output (#3374 )	2025-04-12 17:00:36 +08:00
CMakeLists.txt	chore: Stabilize ABI boundary for internal kernel library (#3117 )	2025-04-11 15:07:50 +08:00
README.md	Update (#2978 )	2025-03-23 16:39:35 +08:00

README.md

C++ Tests

This document explains how to build and run the C++ tests, and the included resources.

All-in-one script

The Pytest script test_cpp.py builds TRT-LLM, builds engines, and generates expected outputs and executes the C++ tests all in one go. To get an overview of the tests and their parameterization, call:

pytest tests/integration/defs/test_cpp.py --collect-only

All tests take the number of the CUDA architecture of the GPU you wish to use as a parameter e.g. 90 for Hopper.

It is possible to choose unit tests or a single model for end-to-end tests. Example calls could look like this:

export LLM_MODELS_ROOT="/path/to/model_cache"

pytest tests/integration/defs/test_cpp.py::test_unit_tests[90]

pytest tests/integration/defs/test_cpp.py::test_model[llama-90]

pytest tests/integration/defs/test_cpp.py::test_benchmarks[gpt-90]

pytest tests/integration/defs/test_cpp.py::test_multi_gpu[90]

Manual steps

Compile

From the top-level directory call:

CPP_BUILD_DIR=cpp/build
python3 scripts/build_wheel.py -a "80-real;86-real" --build_dir ${CPP_BUILD_DIR}
pip install -r requirements-dev.txt
pip install build/tensorrt_llm*.whl
cd $CPP_BUILD_DIR && make -j$(nproc) google-tests

Single tests can be executed from CPP_BUILD_DIR/tests, e.g.

./$CPP_BUILD_DIR/tests/allocatorTest

End-to-end tests

gptSessionTest,trtGptModelRealDecoderTest and executorTest require pre-built TensorRT engines, which are loaded in the tests. They also require data files which are stored in cpp/tests/resources/data.

Build engines

Scripts are provided that download the GPT2 and GPT-J models from Huggingface and convert them to TensorRT engines. The weights and built engines are stored under cpp/tests/resources/models. To build the engines from the top-level directory:

PYTHONPATH=examples/gpt:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gpt_engines.py
PYTHONPATH=examples/gptj:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gptj_engines.py
PYTHONPATH=examples/llama:$PYTHONPATH python3 cpp/tests/resources/scripts/build_llama_engines.py
PYTHONPATH=examples/chatglm:$PYTHONPATH python3 cpp/tests/resources/scripts/build_chatglm_engines.py
PYTHONPATH=examples/medusa:$PYTHONPATH python3 cpp/tests/resources/scripts/build_medusa_engines.py
PYTHONPATH=examples/eagle:$PYTHONPATH python3 cpp/tests/resources/scripts/build_eagle_engines.py
PYTHONPATH=examples/redrafter:$PYTHONPATH python3 cpp/tests/resources/scripts/build_redrafter_engines.py --has_tllm_checkpoint

It is possible to build engines with tensor and pipeline parallelism for LLaMA using 4 GPUs.

PYTHONPATH=examples/llama python3 cpp/tests/resources/scripts/build_llama_engines.py --only_multi_gpu

If there is an issue finding model_spec.so in engine building, manually build model_spec.so by

make -C cpp/build/ modelSpec

Generate expected output

End-to-end tests read inputs and expected outputs from Numpy files located at cpp/tests/resources/data. The expected outputs can be generated using scripts which employ the Python runtime to run the built engines:

PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_gpt_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_gptj_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_llama_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_chatglm_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_medusa_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_eagle_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_redrafter_output.py

Generate data with tensor and pipeline parallelism

It is possible to generate tensor and pipeline parallelism data for LLaMA using 4 GPUs. To generate results from the top-level directory:

PYTHONPATH=examples mpirun -n 4 python3 cpp/tests/resources/scripts/generate_expected_llama_output.py --only_multi_gpu

Run test

After building the engines and generating the expected output execute the tests

./$CPP_BUILD_DIR/tests/gptSessionTest

Run all tests with ctest

To run all tests and produce an xml report, call

./$CPP_BUILD_DIR/ctest --output-on-failure --output-junit "cpp-test-report.xml"