mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-02-07 19:51:50 +08:00
* init trtllm attn no cache Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * fix: fix the seq_len issue and attn metadata prepare for qwen reward model test fix: fix minor bugs after rebase Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * refactor: remove unnecessary debug logs and clean up commented code refactor: update max_seq_len documentation and remove max_seq_len for decoder model contructor in PyTorchModelEngine Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * refactor: update calculate_ref_result function to accept tensor inputs and mask type, enhance test_attention_no_cache to support FULL and CAUSAL masks Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * refactor: remove unused BERT attention metadata conversion method and add type assertion for no cache attention in PyTorchModelEngine Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * refactor: remove use_kv_cache parameter from attention function and related classes, update documentation for KV cache handling Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * refactor: implement setAttentionMaskType method for better mask type handling and remove unused conversion function Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * refactor: streamline KV cache handling by replacing direct member access with useKVCache method and simplify token per block assignment remove Debug code. Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * refactor: Resolve comments for Python code Simplify no cache attention metadata preparation and streamline related attributes in TrtllmAttentionMetadata Removed the private method for converting to no cache attention metadata and integrated its logic into the prepare method. Updated the test for BERT sequence classification to reflect these changes and ensure proper handling of attention metadata. Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * docs: Add is_dummy_attention field to attention metadata for simulation operations Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * refactor: add KVCacheParams to attention backend interface and import relevant metadata classes Updated the attention backend interface to include KVCacheParams and imported TrtllmAttentionMetadata and VanillaAttentionMetadata in model_engine.py for enhanced functionality. Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * fix: fix rebase format issue Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * fix: extend attention mask type handling in MHARunnerFixedParams Added support for additional attention mask types (BIDIRECTIONAL, BIDIRECTIONALGLM, BLOCKSPARSE) in the MHARunnerFixedParams structure to fix the mapping issue between ContextAttentionMaskType and AttentionMaskType Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * fix: enhance attention mask type handling in TllmGenFmhaRunnerParams Updated the setAttentionMaskType method to include a switch-case structure for better handling of attention mask types, ensuring proper mapping and error handling for invalid types. Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> --------- Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> |
||
|---|---|---|
| .. | ||
| l0_a10.yml | ||
| l0_a30.yml | ||
| l0_a100.yml | ||
| l0_b200.yml | ||
| l0_dgx_h100.yml | ||
| l0_gh200.yml | ||
| l0_h100.yml | ||
| l0_l40s.yml | ||
| l0_perf.yml | ||
| l0_sanity_check.yml | ||
| README.md | ||
Description
This folder contains test definition which is consumed by trt-test-db tool based on system specifications.
Installation
Install trt-test-db using the following command:
pip3 install --extra-index-url https://urm.nvidia.com/artifactory/api/pypi/sw-tensorrt-pypi/simple --ignore-installed trt-test-db==1.8.5+bc6df7
Test Definition
Test definitions are stored in YAML files located in ${TRT_LLM_ROOT}/tests/integration/test_lists/test-db/. These files define test conditions and the tests to be executed.
Example YAML Structure
version: 0.0.1
l0_e2e:
- condition:
terms:
supports_fp8: true
ranges:
system_gpu_count:
gte: 4
lte: 4
wildcards:
gpu:
- '*h100*'
linux_distribution_name: ubuntu*
tests:
- examples/test_llama.py::test_llm_llama_v3_1_1node_multi_gpus[llama-3.1-8b-enable_fp8]
- examples/test_llama.py::test_llm_llama_v3_1_1node_multi_gpus[llama-3.1-70b-enable_fp8]
Generating Test Lists
Use trt-test-db to generate a test list based on the system configuration:
trt-test-db -d /TensorRT-LLM/src/tests/integration/test_lists/test-db \
--context l0_e2e \
--test-names \
--output /TensorRT-LLM/src/l0_e2e.txt \
--match-exact '{"chip":"ga102gl-a","compute_capability":"8.6","cpu":"x86_64","gpu":"A10","gpu_memory":"23028.0","host_mem_available_mib":"989937","host_mem_total_mib":"1031949","is_aarch64":false,"is_linux":true,"linux_distribution_name":"ubuntu","linux_version":"22.04","supports_fp8":false,"supports_int8":true,"supports_tf32":true,"sysname":"Linux","system_gpu_count":"1",...}'
This command generates a test list file (l0_e2e.txt) based on the specified context and system configuration.
Running Tests
Execute the tests using pytest with the generated test list:
pytest -v --test-list=/TensorRT-LLM/src/l0_e2e.txt --output-dir=/tmp/logs
This command runs the tests specified in the test list and outputs the results to the specified directory.
Additional Information
- The
--contextparameter in thetrt-test-dbcommand specifies which context to search in the YAML files. - The
--match-exactparameter provides system information used to filter tests based on the conditions defined in the YAML files. - Modify the YAML files to add or update test conditions and test cases as needed.
For more detailed information on
trt-test-dbandpytestusage, refer to their respective documentation.