* Instead of allocating UserBuffers at beginning of runtime, UB buffers are now managed with global allocator. The allocator will dynamically assign free UB buffer or allocate new buffer for torch tensor. It makes userbuffers easier to use. * In common usecase, the Userbuffers will be allocated correctly during warm up stage. There is no dynamic allocation during inference. * UB fusion pattern is rewroten using the new UB Allocator. It contains following passes: 1. Fuse Quant with allreduce, replace with UB impl, and insert a copy_to_userbuffers. Currently the normal allreduce still does not support FP8 quant. So this need to be done in UB pass 2. Convert all supported allreduce with UB and insert copy_to_userbuffers. 3. Fuse op before ar with the copy_to_userbuffers. So the op directly writes to the userbuffer 4. Remove userbuffers finalize if the output is connect to another UB allreduce. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| batch_manager | ||
| executor | ||
| kernels | ||
| layers | ||
| resources | ||
| runtime | ||
| unit_tests | ||
| utils | ||
| CMakeLists.txt | ||
| README.md | ||
C++ Tests
This document explains how to build and run the C++ tests, and the included resources.
All-in-one script
The Pytest script test_cpp.py builds TRT-LLM, builds engines, and generates expected outputs and executes the C++ tests all in one go. To get an overview of the tests and their parameterization, call:
pytest tests/integration/defs/test_cpp.py --collect-only
All tests take the number of the CUDA architecture of the GPU you wish to use as a parameter e.g. 90 for Hopper.
It is possible to choose unit tests or a single model for end-to-end tests. Example calls could look like this:
export LLM_MODELS_ROOT="/path/to/model_cache"
pytest tests/integration/defs/test_cpp.py::test_unit_tests[90]
pytest tests/integration/defs/test_cpp.py::test_model[llama-90]
pytest tests/integration/defs/test_cpp.py::test_benchmarks[gpt-90]
pytest tests/integration/defs/test_cpp.py::test_multi_gpu[90]
Manual steps
Compile
From the top-level directory call:
CPP_BUILD_DIR=cpp/build
python3 scripts/build_wheel.py -a "80-real;86-real" --build_dir ${CPP_BUILD_DIR}
pip install -r requirements-dev.txt
pip install build/tensorrt_llm*.whl
cd $CPP_BUILD_DIR && make -j$(nproc) google-tests
Single tests can be executed from CPP_BUILD_DIR/tests, e.g.
./$CPP_BUILD_DIR/tests/allocatorTest
End-to-end tests
gptSessionTest,trtGptModelRealDecoderTest and executorTest require pre-built TensorRT engines, which are loaded in the tests. They also require data files which are stored in cpp/tests/resources/data.
Build engines
Scripts are provided that download the GPT2 and GPT-J models from Huggingface and convert them to TensorRT engines. The weights and built engines are stored under cpp/tests/resources/models. To build the engines from the top-level directory:
PYTHONPATH=examples/gpt:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gpt_engines.py
PYTHONPATH=examples/gptj:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gptj_engines.py
PYTHONPATH=examples/llama:$PYTHONPATH python3 cpp/tests/resources/scripts/build_llama_engines.py
PYTHONPATH=examples/chatglm:$PYTHONPATH python3 cpp/tests/resources/scripts/build_chatglm_engines.py
PYTHONPATH=examples/medusa:$PYTHONPATH python3 cpp/tests/resources/scripts/build_medusa_engines.py
PYTHONPATH=examples/eagle:$PYTHONPATH python3 cpp/tests/resources/scripts/build_eagle_engines.py
PYTHONPATH=examples/redrafter:$PYTHONPATH python3 cpp/tests/resources/scripts/build_redrafter_engines.py --has_tllm_checkpoint
It is possible to build engines with tensor and pipeline parallelism for LLaMA using 4 GPUs.
PYTHONPATH=examples/llama python3 cpp/tests/resources/scripts/build_llama_engines.py --only_multi_gpu
If there is an issue finding model_spec.so in engine building, manually build model_spec.so by
make -C cpp/build/ modelSpec
Generate expected output
End-to-end tests read inputs and expected outputs from Numpy files located at cpp/tests/resources/data. The expected outputs can be generated using scripts which employ the Python runtime to run the built engines:
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_gpt_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_gptj_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_llama_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_chatglm_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_medusa_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_eagle_output.py
PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_redrafter_output.py
Generate data with tensor and pipeline parallelism
It is possible to generate tensor and pipeline parallelism data for LLaMA using 4 GPUs. To generate results from the top-level directory:
PYTHONPATH=examples mpirun -n 4 python3 cpp/tests/resources/scripts/generate_expected_llama_output.py --only_multi_gpu
Run test
After building the engines and generating the expected output execute the tests
./$CPP_BUILD_DIR/tests/gptSessionTest
Run all tests with ctest
To run all tests and produce an xml report, call
./$CPP_BUILD_DIR/ctest --output-on-failure --output-junit "cpp-test-report.xml"