mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-25 21:22:57 +08:00

History

Yueh-Ting (eop) Chen cf100933cc [TRTLLM-6341][feature] Support SWA KV cache reuse (#6768 ) This merge request attempts to support more SWA KV cache functionality inside the KV cache manager. Before this merge request, the KV cache for sliding window attention (SWA) only holds "window size" number of blocks and reuse them in a cyclic manner. We will not be able to utilize more GPU memory with this design, leading to a limited max batch size throughput. Additionally, we will not be able to support KV cache reuse with this design. In this MR, we change such behavior to let the manager write blocks in a linear manner. With a linear block writing behavior, as the attention window moves on, the out-of-window (OOW) blocks will be detached. Right now for the sake of a correct feature first, we directly offload the OOW block from the primary block pool (GPU memory) to the secondary block pool (host memory). We will improve this in the future by delegating the block movement to the eviction policy. KV cache reuse for SWA is not developed in this merge request and will be amended in a follow-up merge request. Writing the blocks linearly, the maximum number of blocks allocated for a sequence(`GenerationRequest`) is the "max sequence length" specified. The `GenerationRequest` that stores the cache block bookkeeping structure will now keep "max sequence length" tokens of blocks. Given the above, main changes are (more context in the MR): - Remove "cyclic" concept under the kv cache manager, such concept originally guards the block reuse under kv cache manager. - Add detach mechanism and have it under `KVCacheManager::addToken`. Please note that detach is still guarded off for SWA when reuse is enabled. A follow-up merge request will proceed to improve this. - Enforce "max sequence length" to be a non-optional parameter to the `KVCacheManager`/`BlockManager` - Let all window size resource pool get identical proportion of memory - Fix free memory calculation under `resource_manager.py` Signed-off-by: eopXD <yuehtingc@nvidia.com> Co-authored-by: Tomer Asida <tasida@nvidia.com>		2025-09-24 14:28:24 +08:00
..
l0_a10.yml	[TRTLLM-1302][feat] Topk logprobs for TRT backend and top1 logprob for PyT backend (#6097 )	2025-09-12 15:32:34 +08:00
l0_a30.yml	[None][ci] move some test cases from l40s to a30 (#7684 )	2025-09-11 07:22:34 +08:00
l0_a100.yml	[None][chore] move some cases from post-merge to pre-merge to detect errors in early stage (#7699 )	2025-09-15 15:37:58 +08:00
l0_b200.yml	[None][ci] move qwen3 tests from GB200 to B200 (#7733 )	2025-09-16 08:12:28 +08:00
l0_dgx_b200.yml	[None][fix] Fix and add test for TRTLLM MoE backend (#7755 )	2025-09-23 11:26:25 +08:00
l0_dgx_b300.yml	[TRTLLM-4629] [feat] Add support of CUDA13 and sm103 devices (#7568 )	2025-09-16 09:56:18 +08:00
l0_dgx_h100.yml	[None][feat] support gpt-oss with fp8 kv cache (#7612 )	2025-09-15 02:17:37 +08:00
l0_dgx_h200.yml	[None][ci] move some test cases of DGX H100 to post merge (#7569 )	2025-09-06 01:03:38 -04:00
l0_gb200_multi_gpus.yml	[https://nvbugs/5516665 ][fix] Fix CUTLASS moe fake impl errors (#7714 )	2025-09-22 11:08:39 -07:00
l0_gb200_multi_nodes.yml	[None][fix] Fix and add test for TRTLLM MoE backend (#7755 )	2025-09-23 11:26:25 +08:00
l0_gb202.yml	[None][ci] move unittests to sub-directories (#6635 )	2025-08-20 05:42:22 -04:00
l0_gb203.yml	[TRTLLM-5277] chore: refine llmapi examples for 1.0 (part1) (#5431 )	2025-07-01 19:06:41 +08:00
l0_gb300_multi_gpus.yml	[TRTLLM-4629] [feat] Add support of CUDA13 and sm103 devices (#7568 )	2025-09-16 09:56:18 +08:00
l0_gh200.yml	[Infra] - Set default timeout to 1hr and remove some specific settings (#5667 )	2025-07-02 08:37:54 -04:00
l0_h100.yml	[TRTLLM-6341][feature] Support SWA KV cache reuse (#6768 )	2025-09-24 14:28:24 +08:00
l0_l40s.yml	[TRTLLM-7385][feat] Optimize Qwen2/2.5-VL performance (#7250 )	2025-09-22 03:40:02 -07:00
l0_perf.yml	CI: Performance regression tests update (#3531 )	2025-06-01 09:47:55 +03:00
l0_rtx_pro_6000.yml	[https://nvbugs/5503423 ][waive] Waive Llama3.1-70B-FP8 test on RTX PRO 6000 (#7603 )	2025-09-22 14:28:38 +08:00
l0_sanity_check.yml	[TRTLLM-7989][infra] Bundle UCX and NIXL libs in the TRTLLM python package (#7766 )	2025-09-22 16:43:35 +08:00
README.md	Update (#2978 )	2025-03-23 16:39:35 +08:00

README.md

Description

This folder contains test definition which is consumed by trt-test-db tool based on system specifications.

Installation

Install trt-test-db using the following command:

pip3 install --extra-index-url https://urm.nvidia.com/artifactory/api/pypi/sw-tensorrt-pypi/simple --ignore-installed trt-test-db==1.8.5+bc6df7

Test Definition

Test definitions are stored in YAML files located in ${TRT_LLM_ROOT}/tests/integration/test_lists/test-db/. These files define test conditions and the tests to be executed.

Example YAML Structure

version: 0.0.1
l0_e2e:
  - condition:
      terms:
        supports_fp8: true
      ranges:
        system_gpu_count:
          gte: 4
          lte: 4
      wildcards:
        gpu:
          - '*h100*'
        linux_distribution_name: ubuntu*
    tests:
      - examples/test_llama.py::test_llm_llama_v3_1_1node_multi_gpus[llama-3.1-8b-enable_fp8]
      - examples/test_llama.py::test_llm_llama_v3_1_1node_multi_gpus[llama-3.1-70b-enable_fp8]

Generating Test Lists

Use trt-test-db to generate a test list based on the system configuration:

trt-test-db -d /TensorRT-LLM/src/tests/integration/test_lists/test-db \
            --context l0_e2e \
            --test-names \
            --output /TensorRT-LLM/src/l0_e2e.txt \
            --match-exact '{"chip":"ga102gl-a","compute_capability":"8.6","cpu":"x86_64","gpu":"A10","gpu_memory":"23028.0","host_mem_available_mib":"989937","host_mem_total_mib":"1031949","is_aarch64":false,"is_linux":true,"linux_distribution_name":"ubuntu","linux_version":"22.04","supports_fp8":false,"supports_int8":true,"supports_tf32":true,"sysname":"Linux","system_gpu_count":"1",...}'

This command generates a test list file (l0_e2e.txt) based on the specified context and system configuration.

Running Tests

Execute the tests using pytest with the generated test list:

pytest -v --test-list=/TensorRT-LLM/src/l0_e2e.txt --output-dir=/tmp/logs

This command runs the tests specified in the test list and outputs the results to the specified directory.

Additional Information

The --context parameter in the trt-test-db command specifies which context to search in the YAML files.
The --match-exact parameter provides system information used to filter tests based on the conditions defined in the YAML files.
Modify the YAML files to add or update test conditions and test cases as needed. For more detailed information on trt-test-db and pytest usage, refer to their respective documentation.