TensorRT-LLMs/cpp/tensorrt_llm/batch_manager
Yueh-Ting (eop) Chen cf100933cc
[TRTLLM-6341][feature] Support SWA KV cache reuse (#6768)
This merge request attempts to support more SWA KV cache functionality
inside the KV cache manager. Before this merge request, the KV cache for
sliding window attention (SWA) only holds "window size" number of blocks
and reuse them in a cyclic manner. We will not be able to utilize more
GPU memory with this design, leading to a limited max batch size
throughput. Additionally, we will not be able to support KV cache reuse
with this design.

In this MR, we change such behavior to let the manager write blocks in
a linear manner. With a linear block writing behavior, as the attention
window moves on, the out-of-window (OOW) blocks will be detached. Right
now for the sake of a correct feature first, we directly offload the
OOW block from the primary block pool (GPU memory) to the secondary
block pool (host memory). We will improve this in the future by
delegating the block movement to the eviction policy.

KV cache reuse for SWA is not developed in this merge request and will
be amended in a follow-up merge request.

Writing the blocks linearly, the maximum number of blocks allocated for
a sequence(`GenerationRequest`) is the "max sequence length" specified.
The `GenerationRequest` that stores the cache block bookkeeping
structure will now keep "max sequence length" tokens of blocks.

Given the above, main changes are (more context in the MR):
- Remove "cyclic" concept under the kv cache manager, such concept
  originally guards the block reuse under kv cache manager.
- Add detach mechanism and have it under `KVCacheManager::addToken`.
  Please note that detach is still guarded off for SWA when reuse
  is enabled. A follow-up merge request will proceed to improve this.
- Enforce "max sequence length" to be a non-optional parameter to
  the `KVCacheManager`/`BlockManager`
- Let all window size resource pool get identical proportion of memory
- Fix free memory calculation under `resource_manager.py`

Signed-off-by: eopXD <yuehtingc@nvidia.com>
Co-authored-by: Tomer Asida <tasida@nvidia.com>
2025-09-24 14:28:24 +08:00
..
utils [None][refactor] Simplify decoder state initialization for speculative decoding (#6869) 2025-08-22 18:44:17 +02:00
allocateKvCache.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
assignReqSeqSlots.cpp [https://nvbugs/5394392][fix] Enlarge scheduler capacity under disagg bs == 1 (#6537) 2025-08-15 09:52:06 -07:00
cacheFormatter.cpp [TRTLLM-8044][refactor] Rename data -> cache for cacheTransceiver (#7659) 2025-09-16 08:43:56 -04:00
cacheFormatter.h [TRTLLM-8044][refactor] Rename data -> cache for cacheTransceiver (#7659) 2025-09-16 08:43:56 -04:00
cacheTransBuffer.cpp [TRTLLM-7361][feat] KV cache transfer for uneven pp (#7117) 2025-09-08 13:37:46 -04:00
cacheTransBuffer.h [TRTLLM-7361][feat] KV cache transfer for uneven pp (#7117) 2025-09-08 13:37:46 -04:00
cacheTransceiver.cpp [TRTLLM-8044][refactor] Rename data -> cache for cacheTransceiver (#7659) 2025-09-16 08:43:56 -04:00
capacityScheduler.cpp refactor: Scheduling based on KV cache state (#4865) 2025-06-16 08:14:58 +02:00
CMakeLists.txt [TRTLLM-8044][refactor] Rename data -> cache for cacheTransceiver (#7659) 2025-09-16 08:43:56 -04:00
contextProgress.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
createNewDecoderRequests.cpp [None][refactor] Simplify decoder state initialization for speculative decoding (#6869) 2025-08-22 18:44:17 +02:00
dataTransceiver.cpp [TRTLLM-6549][fix] add kv cache time output back (#7798) 2025-09-23 14:12:42 -04:00
dataTransceiver.h [TRTLLM-8044][refactor] Rename data -> cache for cacheTransceiver (#7659) 2025-09-16 08:43:56 -04:00
decoderBuffers.cpp refactor: Enhanced handling of decoder requests and logits within the batch manager (#6055) 2025-07-18 12:12:08 +02:00
encoderBuffers.cpp Feat: Variable-Beam-Width-Search (VBWS) part4 (#3979) 2025-05-12 22:32:29 +02:00
encoderBuffers.h Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
evictionPolicy.cpp [JIRA-5226219][fix] Fix Bug in KV cache manager (#4596) 2025-05-29 22:03:20 -07:00
guidedDecoder.cpp [TRTLLM-8209][feat] Support new structural tag API (upgrade XGrammar to 0.1.25) (#7893) 2025-09-23 09:10:09 +08:00
handleContextLogits.cpp refactor: Enhanced handling of decoder requests and logits within the batch manager (#6055) 2025-07-18 12:12:08 +02:00
handleGenerationLogits.cpp refactor: Enhanced handling of decoder requests and logits within the batch manager (#6055) 2025-07-18 12:12:08 +02:00
kvCacheEventManager.cpp [TRTLLM-6881][feat] Include attention dp rank info with KV cache events (#6563) 2025-08-07 14:17:07 +02:00
kvCacheManager.cpp [TRTLLM-6341][feature] Support SWA KV cache reuse (#6768) 2025-09-24 14:28:24 +08:00
kvCacheTransferManager.cpp [None][feat] Nixl support for GDS (#5488) 2025-09-09 13:00:38 +08:00
llmRequest.cpp [None][fix] acceptance rate calculation fix in benchmark_serving (#6746) 2025-08-19 17:29:36 +08:00
logitsPostProcessor.cpp [None][chore] Mass integration of release/1.0 - 3rd (#7519) 2025-09-08 14:03:04 +08:00
loraBuffers.cpp fix: [nvbugs/5287097] Align PP layer distribution between pytorch and TRT flow. (#4399) 2025-05-19 14:25:36 -07:00
loraBuffers.h Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
makeDecodingBatchInputOutput.cpp refactor: Enhanced handling of decoder requests and logits within the batch manager (#6055) 2025-07-18 12:12:08 +02:00
medusaBuffers.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
microBatchScheduler.cpp [nvbugs/5274894] fix: Sort requests for functional correctness and performance (adapted from #4608) (#4621) 2025-05-26 17:10:55 +08:00
mlaCacheFormatter.cpp [TRTLLM-7731][feat] KV cache transmission in disagg with CP on gen side (#7624) 2025-09-20 06:15:26 -07:00
mlaCacheFormatter.h [TRTLLM-7731][feat] KV cache transmission in disagg with CP on gen side (#7624) 2025-09-20 06:15:26 -07:00
pauseRequests.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
peftCacheManager.cpp [TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter (#6510) 2025-08-07 09:05:36 +03:00
promptTuningBuffers.cpp perf: Removing initializing ptuning buffers to zero (#4915) 2025-06-09 21:57:21 -04:00
rnnStateBuffers.cpp [TRTLLM-5171] chore: Remove GptSession/V1 from TRT workflow (#4092) 2025-05-14 23:10:04 +02:00
rnnStateBuffers.h Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
rnnStateManager.cpp fix: [nvbugs/5287097] Align PP layer distribution between pytorch and TRT flow. (#4399) 2025-05-19 14:25:36 -07:00
runtimeBuffers.cpp Revert "feat: nanobind bindings (#5961)" (#6160) 2025-07-18 10:12:54 +08:00
scheduledBlocksManager.h refactor: Scheduling based on KV cache state (#4865) 2025-06-16 08:14:58 +02:00
sequenceSlotManager.cpp refactor: Remove enforced sorted order of batch slots (#3502) 2025-07-14 17:23:02 +02:00
transformerBuffers.cpp refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead (#5384) 2025-06-26 19:45:52 +08:00
trtEncoderModel.cpp refactor: remove TrtGptModelOptionalParams (#5165) 2025-06-20 10:31:40 +02:00
trtEncoderModel.h refactor: remove TrtGptModelOptionalParams (#5165) 2025-06-20 10:31:40 +02:00
trtGptModel.h refactor: remove TrtGptModelOptionalParams (#5165) 2025-06-20 10:31:40 +02:00
trtGptModelFactory.h refactor: remove TrtGptModelOptionalParams (#5165) 2025-06-20 10:31:40 +02:00
trtGptModelInflightBatching.cpp [TRTLLM-6341][feature] Support SWA KV cache reuse (#6768) 2025-09-24 14:28:24 +08:00
trtGptModelInflightBatching.h [https://nvbugs/5501557][fix] Fix out-of-bounds vector access for model with multiple layer types (#7636) 2025-09-22 14:28:38 +08:00
updateDecoderBuffers.cpp refactor: Speculative decoding buffers part 2 (#5316) 2025-06-27 17:41:48 +02:00