TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Yueh-Ting (eop) Chen 4882815fa1 [TLLM-6777][feature] Support SWA KV cache reuse OOW block detach (#7922 ) This MR is a continuation of #6768. In the previous merge request, OOW (out-of-window) blocks are only detached when reuse is not enabled, that is, the block movement behavior is identical between SWA and full attention when reuse is enabled. This merge request attempts to enable OOW block detach when reuse is enabled. The required changes are: - Let KV cache manager keep track of which block is used by which sequence - Remove restriction for the eviction policy to be able to release a non-leaf block Along with the development, bugs inside freeChildren and offload mechanism under getFreeBlock is resolved because they will affect the functionality this merge request is trying to achieve. When a block goes OOW, it is released from the sequence, it will be available to be reclaimed and the block is held by the eviction policy for another sequence to acquire upon calling. On the other hand, we want to potentially store the sequence for reuse. To safely achieve this, the record of block ownership is done under WindowBlockManager::getFreeBlock. If the block acquired was originally owned by another sequence that is live inside the manager, then we invalidate the sequence for store for reuse. At the end of a sequence (when removeSequence is called toward it), the KV cache manager will check if the sequence has all blocks not reclaimed by another sequence. If so, then the sequence is safe to be stored for reuse and store for reuse action will be performed. Signed-off-by: eopXD <yuehtingc@nvidia.com>		2025-10-13 09:18:12 -07:00
..
batch_manager	[TLLM-6777][feature] Support SWA KV cache reuse OOW block detach (#7922 )	2025-10-13 09:18:12 -07:00
common	[TRTLLM-6748][feat] add PDL support for more kernels (#7977 )	2025-10-11 08:32:05 +08:00
cutlass_extensions/include/cutlass_extensions	[None][feat] GPT-OSS Sm120/Sm121 Support (#7937 )	2025-10-06 16:59:06 -04:00
deep_ep	[TRTLLM-6589][feat] Support CUDA graph for DeepEP (#7514 )	2025-10-02 10:13:24 -07:00
deep_gemm	[https://nvbugs/5433581 ][fix] DeepGEMM installation on SBSA (#6588 )	2025-08-06 16:44:21 +08:00
executor	[TRTLLM-7349][feat] Adding new orchestrator type -- ray (#7520 )	2025-10-04 08:12:24 +08:00
executor_worker	Update TensorRT-LLM (#2792 )	2025-02-18 21:27:39 +08:00
kernels	[TRTLLM-8536][feat] Update trtllm gen fmha kernels to support block sparse attention (#8301 )	2025-10-13 05:54:48 -07:00
layers	refactor: Remove enforced sorted order of batch slots (#3502 )	2025-07-14 17:23:02 +02:00
nanobind	[TRTLLM-7349][feat] Adding new orchestrator type -- ray (#7520 )	2025-10-04 08:12:24 +08:00
plugins	[None][chroe] Rename TensorRT-LLM to TensorRT LLM for source code. (#7851 )	2025-09-25 21:02:35 +08:00
pybind	[TRTLLM-7349][feat] Adding new orchestrator type -- ray (#7520 )	2025-10-04 08:12:24 +08:00
runtime	[TRTLLM-7349][feat] Adding new orchestrator type -- ray (#7520 )	2025-10-04 08:12:24 +08:00
testing	fix: Improve chunking test and skip empty kernel calls (#5710 )	2025-07-04 09:08:15 +02:00
thop	[None][fix] Enable FP8 ContextMLA on GB300 (#8080 )	2025-10-10 10:20:46 +08:00
CMakeLists.txt	[#6102 ][fix] support non-system python installation (#7763 )	2025-09-26 10:16:15 +08:00