mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
This MR is a preliminary MR for implementing the SWA reuse mechanism for the kv cache manager. Please be aware that **no functional change is intended** in this merge request. The purpose of the clean-up is to decouple and remove existing functions for the up-coming SWA KV cache reuse change to be more natural and easier to review. Right now, (1) streamLLM, and (2) beam search with SWA, are broken. We do not want to complicate the code base by stacking more features upon something that does not work. This MR prunes out the logic and add assertions so we can come back and re-support the broken feature and remove the assertion. Since streamLLM (sink attention) is broken now, assertion is added under `KVCacheManager` ctor to guard for the value of `mSinkBlockTokenLength` and `mSinkBubbleLength`. Compute logics relate to it are pruned. The beam search with SWA will still be broke when introducing the SWA KV cache reuse. We will revisit this problem in the future. On top of this, we should make an effort to update the [supporting matrix](https://github.com/NVIDIA/TensorRT-LLM/blob/feat/1.0_doc_dev/docs/source/1.0/features/feature-combination-matrix.md) of the kv cache manager after merging the support of SWA KV cache reuse. Changes are listed as following: - Separate `KVCacheManager::updateToken` into `KVCacheManager::addToken` and `KVCacheManager::removeToken`. The functionality should be decoupled. - Push utility `cacheSequenceBlockOffsets` and `cacheNewBlockOffset` from `KVCacheManager` down to `WindowBlockManager`. `KVCacheManager`-exposed functions should be real utilities that users of the structure can leverage. Implementation-detailed function calls should not exist at this level. - Simplify "is shared last context block" logic under `KVCacheManager::addSequence`. Since no functional change is intended in this merge request, no test case is added. Several comments are added for future test coverage reminder. For `LlmRequestTest.ParamTest`, `streaming=True` is commented out because we guard sink attention with assertion now. In `capacitySchedulerTest`, `addToken` action to `crossKVCacheManager` is removed because in encoder-decoder model, generation tokens are added only to the decoder and not to the encoder. Signed-off-by: eopXD <yuehtingc@nvidia.com> |
||
|---|---|---|
| .. | ||
| utils | ||
| allocateKvCache.cpp | ||
| assignReqSeqSlots.cpp | ||
| cacheFormatter.cpp | ||
| cacheFormatter.h | ||
| cacheTransBuffer.cpp | ||
| cacheTransBuffer.h | ||
| cacheTransceiver.cpp | ||
| capacityScheduler.cpp | ||
| CMakeLists.txt | ||
| contextProgress.cpp | ||
| createNewDecoderRequests.cpp | ||
| dataTransceiver.cpp | ||
| dataTransceiver.h | ||
| dataTransceiverImpl.cpp | ||
| dataTransceiverImpl.h | ||
| decoderBuffers.cpp | ||
| encoderBuffers.cpp | ||
| encoderBuffers.h | ||
| evictionPolicy.cpp | ||
| guidedDecoder.cpp | ||
| handleContextLogits.cpp | ||
| handleGenerationLogits.cpp | ||
| kvCacheEventManager.cpp | ||
| kvCacheManager.cpp | ||
| kvCacheTransferManager.cpp | ||
| llmRequest.cpp | ||
| logitsPostProcessor.cpp | ||
| loraBuffers.cpp | ||
| loraBuffers.h | ||
| makeDecodingBatchInputOutput.cpp | ||
| medusaBuffers.cpp | ||
| microBatchScheduler.cpp | ||
| mlaCacheFormatter.cpp | ||
| mlaCacheFormatter.h | ||
| pauseRequests.cpp | ||
| peftCacheManager.cpp | ||
| promptTuningBuffers.cpp | ||
| rnnStateBuffers.cpp | ||
| rnnStateBuffers.h | ||
| rnnStateManager.cpp | ||
| runtimeBuffers.cpp | ||
| scheduledBlocksManager.h | ||
| sequenceSlotManager.cpp | ||
| transformerBuffers.cpp | ||
| trtEncoderModel.cpp | ||
| trtEncoderModel.h | ||
| trtGptModel.h | ||
| trtGptModelFactory.h | ||
| trtGptModelInflightBatching.cpp | ||
| trtGptModelInflightBatching.h | ||
| updateDecoderBuffers.cpp | ||