mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-24 04:33:04 +08:00

History

liji-nv dca6397d1e feat: Introduce UB allocator for pytorch flow (#3257 ) * Instead of allocating UserBuffers at beginning of runtime, UB buffers are now managed with global allocator. The allocator will dynamically assign free UB buffer or allocate new buffer for torch tensor. It makes userbuffers easier to use. * In common usecase, the Userbuffers will be allocated correctly during warm up stage. There is no dynamic allocation during inference. * UB fusion pattern is rewroten using the new UB Allocator. It contains following passes: 1. Fuse Quant with allreduce, replace with UB impl, and insert a copy_to_userbuffers. Currently the normal allreduce still does not support FP8 quant. So this need to be done in UB pass 2. Convert all supported allreduce with UB and insert copy_to_userbuffers. 3. Fuse op before ar with the copy_to_userbuffers. So the op directly writes to the userbuffer 4. Remove userbuffers finalize if the output is connect to another UB allreduce. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>		2025-04-08 18:39:49 +08:00
..
batch_manager	chore: remove usernames from comments (#3291 )	2025-04-05 13:44:28 +08:00
executor	fix: conditional disagg test name (#3161 )	2025-04-02 15:34:30 +08:00
kernels	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
layers	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
resources	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
runtime	Update (#2978 )	2025-03-23 16:39:35 +08:00
unit_tests	feat: Introduce UB allocator for pytorch flow (#3257 )	2025-04-08 18:39:49 +08:00
utils	Update TensorRT-LLM (#2849 )	2025-03-04 18:44:00 +08:00
CMakeLists.txt	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
README.md	Update (#2978 )	2025-03-23 16:39:35 +08:00

README.md

C++ Tests

This document explains how to build and run the C++ tests, and the included resources.

All-in-one script

The Pytest script test_cpp.py builds TRT-LLM, builds engines, and generates expected outputs and executes the C++ tests all in one go. To get an overview of the tests and their parameterization, call:

pytest tests/integration/defs/test_cpp.py --collect-only

All tests take the number of the CUDA architecture of the GPU you wish to use as a parameter e.g. 90 for Hopper.

It is possible to choose unit tests or a single model for end-to-end tests. Example calls could look like this:

export LLM_MODELS_ROOT="/path/to/model_cache"

pytest tests/integration/defs/test_cpp.py::test_unit_tests[90]

pytest tests/integration/defs/test_cpp.py::test_model[llama-90]

pytest tests/integration/defs/test_cpp.py::test_benchmarks[gpt-90]

pytest tests/integration/defs/test_cpp.py::test_multi_gpu[90]

Manual steps

Compile

From the top-level directory call:

CPP_BUILD_DIR=cpp/build
python3 scripts/build_wheel.py -a "80-real;86-real" --build_dir ${CPP_BUILD_DIR}
pip install -r requirements-dev.txt
pip install build/tensorrt_llm*.whl
cd $CPP_BUILD_DIR && make -j$(nproc) google-tests

Single tests can be executed from CPP_BUILD_DIR/tests, e.g.

./$CPP_BUILD_DIR/tests/allocatorTest

End-to-end tests

gptSessionTest,trtGptModelRealDecoderTest and executorTest require pre-built TensorRT engines, which are loaded in the tests. They also require data files which are stored in cpp/tests/resources/data.

Build engines

Scripts are provided that download the GPT2 and GPT-J models from Huggingface and convert them to TensorRT engines. The weights and built engines are stored under cpp/tests/resources/models. To build the engines from the top-level directory:

PYTHONPATH=examples/gpt:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gpt_engines.py
PYTHONPATH=examples/gptj:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gptj_engines.py
PYTHONPATH=examples/llama:$PYTHONPATH python3 cpp/tests/resources/scripts/build_llama_engines.py
PYTHONPATH=examples/chatglm:$PYTHONPATH python3 cpp/tests/resources/scripts/build_chatglm_engines.py
PYTHONPATH=examples/medusa:$PYTHONPATH python3 cpp/tests/resources/scripts/build_medusa_engines.py
PYTHONPATH=examples/eagle:$PYTHONPATH python3 cpp/tests/resources/scripts/build_eagle_engines.py
PYTHONPATH=examples/redrafter:$PYTHONPATH python3 cpp/tests/resources/scripts/build_redrafter_engines.py --has_tllm_checkpoint

It is possible to build engines with tensor and pipeline parallelism for LLaMA using 4 GPUs.

PYTHONPATH=examples/llama python3 cpp/tests/resources/scripts/build_llama_engines.py --only_multi_gpu

If there is an issue finding model_spec.so in engine building, manually build model_spec.so by

make -C cpp/build/ modelSpec

Generate expected output

End-to-end tests read inputs and expected outputs from Numpy files located at cpp/tests/resources/data. The expected outputs can be generated using scripts which employ the Python runtime to run the built engines:

PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_gpt_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_gptj_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_llama_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_chatglm_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_medusa_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_eagle_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_redrafter_output.py

Generate data with tensor and pipeline parallelism

It is possible to generate tensor and pipeline parallelism data for LLaMA using 4 GPUs. To generate results from the top-level directory:

PYTHONPATH=examples mpirun -n 4 python3 cpp/tests/resources/scripts/generate_expected_llama_output.py --only_multi_gpu

Run test

After building the engines and generating the expected output execute the tests

./$CPP_BUILD_DIR/tests/gptSessionTest

Run all tests with ctest

To run all tests and produce an xml report, call

./$CPP_BUILD_DIR/ctest --output-on-failure --output-junit "cpp-test-report.xml"