TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Yueh-Ting (eop) Chen cf100933cc [TRTLLM-6341][feature] Support SWA KV cache reuse (#6768 ) This merge request attempts to support more SWA KV cache functionality inside the KV cache manager. Before this merge request, the KV cache for sliding window attention (SWA) only holds "window size" number of blocks and reuse them in a cyclic manner. We will not be able to utilize more GPU memory with this design, leading to a limited max batch size throughput. Additionally, we will not be able to support KV cache reuse with this design. In this MR, we change such behavior to let the manager write blocks in a linear manner. With a linear block writing behavior, as the attention window moves on, the out-of-window (OOW) blocks will be detached. Right now for the sake of a correct feature first, we directly offload the OOW block from the primary block pool (GPU memory) to the secondary block pool (host memory). We will improve this in the future by delegating the block movement to the eviction policy. KV cache reuse for SWA is not developed in this merge request and will be amended in a follow-up merge request. Writing the blocks linearly, the maximum number of blocks allocated for a sequence(`GenerationRequest`) is the "max sequence length" specified. The `GenerationRequest` that stores the cache block bookkeeping structure will now keep "max sequence length" tokens of blocks. Given the above, main changes are (more context in the MR): - Remove "cyclic" concept under the kv cache manager, such concept originally guards the block reuse under kv cache manager. - Add detach mechanism and have it under `KVCacheManager::addToken`. Please note that detach is still guarded off for SWA when reuse is enabled. A follow-up merge request will proceed to improve this. - Enforce "max sequence length" to be a non-optional parameter to the `KVCacheManager`/`BlockManager` - Let all window size resource pool get identical proportion of memory - Fix free memory calculation under `resource_manager.py` Signed-off-by: eopXD <yuehtingc@nvidia.com> Co-authored-by: Tomer Asida <tasida@nvidia.com>		2025-09-24 14:28:24 +08:00
..
allocateKvCache.h	Update TensorRT-LLM (#2792 )	2025-02-18 21:27:39 +08:00
assignReqSeqSlots.h	Update TensorRT-LLM (#2436 )	2024-11-12 15:27:49 +08:00
cacheTransceiver.h	[TRTLLM-8044][refactor] Rename data -> cache for cacheTransceiver (#7659 )	2025-09-16 08:43:56 -04:00
capacityScheduler.h	fix: max_num_sequences calculation with overlap scheduling (#4532 )	2025-06-03 09:31:22 +02:00
common.h	open source 4dbf696ae9b74a26829d120b67ab8443d70c8e58 (#2297 )	2024-10-08 12:19:19 +02:00
contextProgress.h	Update TensorRT-LLM (#2413 )	2024-11-05 16:27:06 +08:00
createNewDecoderRequests.h	[None][refactor] Simplify decoder state initialization for speculative decoding (#6869 )	2025-08-22 18:44:17 +02:00
decoderBuffers.h	refactor: Enhanced handling of decoder requests and logits within the batch manager (#6055 )	2025-07-18 12:12:08 +02:00
evictionPolicy.h	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
guidedDecoder.h	refactor: Enhanced handling of decoder requests and logits within the batch manager (#6055 )	2025-07-18 12:12:08 +02:00
handleContextLogits.h	refactor: Speculative decoding buffers part 2 (#5316 )	2025-06-27 17:41:48 +02:00
handleGenerationLogits.h	refactor: Speculative decoding buffers part 2 (#5316 )	2025-06-27 17:41:48 +02:00
kvCacheConnector.h	[None][feat] KV Cache Connector API (#7228 )	2025-08-28 23:09:27 -04:00
kvCacheEventManager.h	[TRTLLM-6881][feat] Include attention dp rank info with KV cache events (#6563 )	2025-08-07 14:17:07 +02:00
kvCacheManager.h	[TRTLLM-6341][feature] Support SWA KV cache reuse (#6768 )	2025-09-24 14:28:24 +08:00
kvCacheTransferManager.h	[None][feat] Nixl support for GDS (#5488 )	2025-09-09 13:00:38 +08:00
kvCacheType.h	refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead (#5384 )	2025-06-26 19:45:52 +08:00
kvCacheUtils.h	feat: cache reuse support (selective cache transfer) in mla cache formatter (#4749 )	2025-06-04 09:56:31 +08:00
llmRequest.h	[None][fix] using arrival time in llmapi when creating LlmRequest in pytorch workflow (#7553 )	2025-09-15 07:26:01 -04:00
logitsPostProcessor.h	[None][chore] Mass integration of release/1.0 - 3rd (#7519 )	2025-09-08 14:03:04 +08:00
makeDecodingBatchInputOutput.h	refactor: Enhanced handling of decoder requests and logits within the batch manager (#6055 )	2025-07-18 12:12:08 +02:00
medusaBuffers.h	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
microBatchScheduler.h	[TRTLLM-3429] feat: Overlap scheduling in C++ runtime (#3625 )	2025-05-06 15:06:46 +02:00
pauseRequests.h	Update TensorRT-LLM (#2532 )	2024-12-04 21:16:56 +08:00
peftCacheManager.h	Update TensorRT-LLM (#2783 )	2025-02-13 18:40:22 +08:00
peftCacheManagerConfig.h	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
promptTuningBuffers.h	feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode (#3380 )	2025-04-21 14:31:01 +08:00
rnnStateManager.h	Update TensorRT-LLM (#2413 )	2024-11-05 16:27:06 +08:00
runtimeBuffers.h	Revert "feat: nanobind bindings (#5961 )" (#6160 )	2025-07-18 10:12:54 +08:00
sequenceSlotManager.h	Update TensorRT-LLM (#2413 )	2024-11-05 16:27:06 +08:00
transformerBuffers.h	refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead (#5384 )	2025-06-26 19:45:52 +08:00
updateDecoderBuffers.h	refactor: Speculative decoding buffers part 2 (#5316 )	2025-06-27 17:41:48 +02:00