TensorRT-LLMs/cpp/tensorrt_llm/batch_manager
Yueh-Ting (eop) Chen 4882815fa1
[TLLM-6777][feature] Support SWA KV cache reuse OOW block detach (#7922)
This MR is a continuation of #6768. In the previous merge request,
OOW (out-of-window) blocks are only detached when reuse is not enabled,
that is, the block movement behavior is identical between SWA and full
attention when reuse is enabled.

This merge request attempts to enable OOW block detach when reuse is
enabled. The required changes are:

- Let KV cache manager keep track of which block is used by which
  sequence
- Remove restriction for the eviction policy to be able to release a
  non-leaf block

Along with the development, bugs inside freeChildren and offload
mechanism under getFreeBlock is resolved because they will affect the
functionality this merge request is trying to achieve.

When a block goes OOW, it is released from the sequence, it will be
available to be reclaimed and the block is held by the eviction policy
for another sequence to acquire upon calling. On the other hand, we
want to potentially store the sequence for reuse. To safely achieve
this, the record of block ownership is done under
WindowBlockManager::getFreeBlock. If the block acquired was originally
owned by another sequence that is live inside the manager, then we
invalidate the sequence for store for reuse.

At the end of a sequence (when removeSequence is called toward it),
the KV cache manager will check if the sequence has all blocks not
reclaimed by another sequence. If so, then the sequence is safe to
be stored for reuse and store for reuse action will be performed.

Signed-off-by: eopXD <yuehtingc@nvidia.com>
2025-10-13 09:18:12 -07:00
..
utils [TRTLLM-6106][feat] Add support for KVCache transfer from KVCache reuse path (#6348) 2025-09-27 19:29:30 -04:00
allocateKvCache.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
assignReqSeqSlots.cpp [https://nvbugs/5394392][fix] Enlarge scheduler capacity under disagg bs == 1 (#6537) 2025-08-15 09:52:06 -07:00
cacheFormatter.cpp [None][feat] perf_metrics endpoint functionality improvement (#8005) 2025-10-02 17:43:25 -07:00
cacheFormatter.h [TRTLLM-6106][feat] Add support for KVCache transfer from KVCache reuse path (#6348) 2025-09-27 19:29:30 -04:00
cacheTransBuffer.cpp [None][feat] Optimize kv cache transfer TEP (#7613) 2025-09-25 20:20:04 -07:00
cacheTransBuffer.h [TRTLLM-7361][feat] KV cache transfer for uneven pp (#7117) 2025-09-08 13:37:46 -04:00
cacheTransceiver.cpp [TRTLLM-7349][feat] Adding new orchestrator type -- ray (#7520) 2025-10-04 08:12:24 +08:00
capacityScheduler.cpp refactor: Scheduling based on KV cache state (#4865) 2025-06-16 08:14:58 +02:00
CMakeLists.txt [TRTLLM-7349][feat] Adding new orchestrator type -- ray (#7520) 2025-10-04 08:12:24 +08:00
contextProgress.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
createNewDecoderRequests.cpp [None] [refactor] Minor cleanup and improvements (#7619) 2025-10-03 11:40:06 +02:00
dataTransceiver.cpp [None][fix] Add Lock to protect mReqeustToSession (#8085) 2025-10-10 21:51:50 +08:00
dataTransceiver.h [None][feat] Support for cancelling requests with disaggregation (#8114) 2025-10-02 11:04:26 -07:00
decoderBuffers.cpp refactor: Enhanced handling of decoder requests and logits within the batch manager (#6055) 2025-07-18 12:12:08 +02:00
encoderBuffers.cpp Feat: Variable-Beam-Width-Search (VBWS) part4 (#3979) 2025-05-12 22:32:29 +02:00
encoderBuffers.h Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
evictionPolicy.cpp [TLLM-6777][feature] Support SWA KV cache reuse OOW block detach (#7922) 2025-10-13 09:18:12 -07:00
guidedDecoder.cpp [TRTLLM-8209][feat] Support new structural tag API (upgrade XGrammar to 0.1.25) (#7893) 2025-09-23 09:10:09 +08:00
handleContextLogits.cpp refactor: Enhanced handling of decoder requests and logits within the batch manager (#6055) 2025-07-18 12:12:08 +02:00
handleGenerationLogits.cpp refactor: Enhanced handling of decoder requests and logits within the batch manager (#6055) 2025-07-18 12:12:08 +02:00
kvCacheEventManager.cpp [TRTLLM-6881][feat] Include attention dp rank info with KV cache events (#6563) 2025-08-07 14:17:07 +02:00
kvCacheManager.cpp [TLLM-6777][feature] Support SWA KV cache reuse OOW block detach (#7922) 2025-10-13 09:18:12 -07:00
kvCacheTransferManager.cpp [None][feat] Nixl support for GDS (#5488) 2025-09-09 13:00:38 +08:00
llmRequest.cpp [TRTLLM-6106][feat] Add support for KVCache transfer from KVCache reuse path (#6348) 2025-09-27 19:29:30 -04:00
logitsPostProcessor.cpp [None][chore] Mass integration of release/1.0 - 3rd (#7519) 2025-09-08 14:03:04 +08:00
loraBuffers.cpp fix: [nvbugs/5287097] Align PP layer distribution between pytorch and TRT flow. (#4399) 2025-05-19 14:25:36 -07:00
loraBuffers.h Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
makeDecodingBatchInputOutput.cpp refactor: Enhanced handling of decoder requests and logits within the batch manager (#6055) 2025-07-18 12:12:08 +02:00
medusaBuffers.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
microBatchScheduler.cpp [nvbugs/5274894] fix: Sort requests for functional correctness and performance (adapted from #4608) (#4621) 2025-05-26 17:10:55 +08:00
mlaCacheFormatter.cpp [None][feat] perf_metrics endpoint functionality improvement (#8005) 2025-10-02 17:43:25 -07:00
mlaCacheFormatter.h [TRTLLM-7731][feat] KV cache transmission in disagg with CP on gen side (#7624) 2025-09-20 06:15:26 -07:00
pauseRequests.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
peftCacheManager.cpp [TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter (#6510) 2025-08-07 09:05:36 +03:00
promptTuningBuffers.cpp perf: Removing initializing ptuning buffers to zero (#4915) 2025-06-09 21:57:21 -04:00
rnnStateBuffers.cpp [TRTLLM-5171] chore: Remove GptSession/V1 from TRT workflow (#4092) 2025-05-14 23:10:04 +02:00
rnnStateBuffers.h Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
rnnStateManager.cpp fix: [nvbugs/5287097] Align PP layer distribution between pytorch and TRT flow. (#4399) 2025-05-19 14:25:36 -07:00
runtimeBuffers.cpp Revert "feat: nanobind bindings (#5961)" (#6160) 2025-07-18 10:12:54 +08:00
scheduledBlocksManager.h refactor: Scheduling based on KV cache state (#4865) 2025-06-16 08:14:58 +02:00
sequenceSlotManager.cpp refactor: Remove enforced sorted order of batch slots (#3502) 2025-07-14 17:23:02 +02:00
transformerBuffers.cpp refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead (#5384) 2025-06-26 19:45:52 +08:00
trtEncoderModel.cpp refactor: remove TrtGptModelOptionalParams (#5165) 2025-06-20 10:31:40 +02:00
trtEncoderModel.h refactor: remove TrtGptModelOptionalParams (#5165) 2025-06-20 10:31:40 +02:00
trtGptModel.h refactor: remove TrtGptModelOptionalParams (#5165) 2025-06-20 10:31:40 +02:00
trtGptModelFactory.h refactor: remove TrtGptModelOptionalParams (#5165) 2025-06-20 10:31:40 +02:00
trtGptModelInflightBatching.cpp [TRTLLM-6341][feature] Support SWA KV cache reuse (#6768) 2025-09-24 14:28:24 +08:00
trtGptModelInflightBatching.h [https://nvbugs/5501557][fix] Fix out-of-bounds vector access for model with multiple layer types (#7636) 2025-09-22 14:28:38 +08:00
updateDecoderBuffers.cpp refactor: Speculative decoding buffers part 2 (#5316) 2025-06-27 17:41:48 +02:00