TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Yueh-Ting (eop) Chen 020fed97b6 [TRTLLM-6341][chore] Preliminary refactors on the kv cache manager before supporting swa kv cache reuse (#6767 ) This MR is a preliminary MR for implementing the SWA reuse mechanism for the kv cache manager. Please be aware that no functional change is intended in this merge request. The purpose of the clean-up is to decouple and remove existing functions for the up-coming SWA KV cache reuse change to be more natural and easier to review. Right now, (1) streamLLM, and (2) beam search with SWA, are broken. We do not want to complicate the code base by stacking more features upon something that does not work. This MR prunes out the logic and add assertions so we can come back and re-support the broken feature and remove the assertion. Since streamLLM (sink attention) is broken now, assertion is added under `KVCacheManager` ctor to guard for the value of `mSinkBlockTokenLength` and `mSinkBubbleLength`. Compute logics relate to it are pruned. The beam search with SWA will still be broke when introducing the SWA KV cache reuse. We will revisit this problem in the future. On top of this, we should make an effort to update the [supporting matrix](https://github.com/NVIDIA/TensorRT-LLM/blob/feat/1.0_doc_dev/docs/source/1.0/features/feature-combination-matrix.md) of the kv cache manager after merging the support of SWA KV cache reuse. Changes are listed as following: - Separate `KVCacheManager::updateToken` into `KVCacheManager::addToken` and `KVCacheManager::removeToken`. The functionality should be decoupled. - Push utility `cacheSequenceBlockOffsets` and `cacheNewBlockOffset` from `KVCacheManager` down to `WindowBlockManager`. `KVCacheManager`-exposed functions should be real utilities that users of the structure can leverage. Implementation-detailed function calls should not exist at this level. - Simplify "is shared last context block" logic under `KVCacheManager::addSequence`. Since no functional change is intended in this merge request, no test case is added. Several comments are added for future test coverage reminder. For `LlmRequestTest.ParamTest`, `streaming=True` is commented out because we guard sink attention with assertion now. In `capacitySchedulerTest`, `addToken` action to `crossKVCacheManager` is removed because in encoder-decoder model, generation tokens are added only to the decoder and not to the encoder. Signed-off-by: eopXD <yuehtingc@nvidia.com>		2025-08-20 13:57:57 +08:00
..
utils	refactor: Speculative decoding buffers part 2 (#5316 )	2025-06-27 17:41:48 +02:00
allocateKvCache.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
assignReqSeqSlots.cpp	[https://nvbugs/5394392 ][fix] Enlarge scheduler capacity under disagg bs == 1 (#6537 )	2025-08-15 09:52:06 -07:00
cacheFormatter.cpp	[None][fix] fix same pp disagg (#6730 )	2025-08-10 22:45:15 -04:00
cacheFormatter.h	[None][feat] move kv cache measure into transfer session (#6633 )	2025-08-08 17:49:22 +08:00
cacheTransBuffer.cpp	chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service (#5234 )	2025-07-17 17:42:07 +08:00
cacheTransBuffer.h	chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service (#5234 )	2025-07-17 17:42:07 +08:00
cacheTransceiver.cpp	[None][chore] ucx establish connection with zmq (#6090 )	2025-08-05 02:50:45 -04:00
capacityScheduler.cpp	refactor: Scheduling based on KV cache state (#4865 )	2025-06-16 08:14:58 +02:00
CMakeLists.txt	feat: Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21 (#6408 )	2025-07-31 09:53:52 +08:00
contextProgress.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
createNewDecoderRequests.cpp	[None][refactor] Simplify decoder state initialization (#6559 )	2025-08-12 21:44:41 +02:00
dataTransceiver.cpp	[None][feat] move kv cache measure into transfer session (#6633 )	2025-08-08 17:49:22 +08:00
dataTransceiver.h	[None][feat] move kv cache measure into transfer session (#6633 )	2025-08-08 17:49:22 +08:00
dataTransceiverImpl.cpp	[None][feat] move kv cache measure into transfer session (#6633 )	2025-08-08 17:49:22 +08:00
dataTransceiverImpl.h	[None][feat] move kv cache measure into transfer session (#6633 )	2025-08-08 17:49:22 +08:00
decoderBuffers.cpp	refactor: Enhanced handling of decoder requests and logits within the batch manager (#6055 )	2025-07-18 12:12:08 +02:00
encoderBuffers.cpp	Feat: Variable-Beam-Width-Search (VBWS) part4 (#3979 )	2025-05-12 22:32:29 +02:00
encoderBuffers.h	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
evictionPolicy.cpp	[JIRA-5226219][fix] Fix Bug in KV cache manager (#4596 )	2025-05-29 22:03:20 -07:00
guidedDecoder.cpp	[TRTLLM-6637][feat] Resolve KV cache divergence issue (#6628 )	2025-08-09 23:15:04 +08:00
handleContextLogits.cpp	refactor: Enhanced handling of decoder requests and logits within the batch manager (#6055 )	2025-07-18 12:12:08 +02:00
handleGenerationLogits.cpp	refactor: Enhanced handling of decoder requests and logits within the batch manager (#6055 )	2025-07-18 12:12:08 +02:00
kvCacheEventManager.cpp	[TRTLLM-6881][feat] Include attention dp rank info with KV cache events (#6563 )	2025-08-07 14:17:07 +02:00
kvCacheManager.cpp	[TRTLLM-6341][chore] Preliminary refactors on the kv cache manager before supporting swa kv cache reuse (#6767 )	2025-08-20 13:57:57 +08:00
kvCacheTransferManager.cpp	[fix] Fix illegal mem access and possible accuracy lose. Cherry-pick … (#5017 )	2025-06-09 17:50:57 +08:00
llmRequest.cpp	[None][fix] acceptance rate calculation fix in benchmark_serving (#6746 )	2025-08-19 17:29:36 +08:00
logitsPostProcessor.cpp	refactor: Enhanced handling of decoder requests and logits within the batch manager (#6055 )	2025-07-18 12:12:08 +02:00
loraBuffers.cpp	fix: [nvbugs/5287097] Align PP layer distribution between pytorch and TRT flow. (#4399 )	2025-05-19 14:25:36 -07:00
loraBuffers.h	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
makeDecodingBatchInputOutput.cpp	refactor: Enhanced handling of decoder requests and logits within the batch manager (#6055 )	2025-07-18 12:12:08 +02:00
medusaBuffers.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
microBatchScheduler.cpp	[nvbugs/5274894] fix: Sort requests for functional correctness and performance (adapted from #4608 ) (#4621 )	2025-05-26 17:10:55 +08:00
mlaCacheFormatter.cpp	[None][fix] fix same pp disagg (#6730 )	2025-08-10 22:45:15 -04:00
mlaCacheFormatter.h	[TRTLLM-6549] chore: record delay introduced by disaggregated serving in kv cache measure (#6135 )	2025-07-30 10:39:40 +08:00
pauseRequests.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
peftCacheManager.cpp	[TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter (#6510 )	2025-08-07 09:05:36 +03:00
promptTuningBuffers.cpp	perf: Removing initializing ptuning buffers to zero (#4915 )	2025-06-09 21:57:21 -04:00
rnnStateBuffers.cpp	[TRTLLM-5171] chore: Remove GptSession/V1 from TRT workflow (#4092 )	2025-05-14 23:10:04 +02:00
rnnStateBuffers.h	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
rnnStateManager.cpp	fix: [nvbugs/5287097] Align PP layer distribution between pytorch and TRT flow. (#4399 )	2025-05-19 14:25:36 -07:00
runtimeBuffers.cpp	Revert "feat: nanobind bindings (#5961 )" (#6160 )	2025-07-18 10:12:54 +08:00
scheduledBlocksManager.h	refactor: Scheduling based on KV cache state (#4865 )	2025-06-16 08:14:58 +02:00
sequenceSlotManager.cpp	refactor: Remove enforced sorted order of batch slots (#3502 )	2025-07-14 17:23:02 +02:00
transformerBuffers.cpp	refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead (#5384 )	2025-06-26 19:45:52 +08:00
trtEncoderModel.cpp	refactor: remove TrtGptModelOptionalParams (#5165 )	2025-06-20 10:31:40 +02:00
trtEncoderModel.h	refactor: remove TrtGptModelOptionalParams (#5165 )	2025-06-20 10:31:40 +02:00
trtGptModel.h	refactor: remove TrtGptModelOptionalParams (#5165 )	2025-06-20 10:31:40 +02:00
trtGptModelFactory.h	refactor: remove TrtGptModelOptionalParams (#5165 )	2025-06-20 10:31:40 +02:00
trtGptModelInflightBatching.cpp	[None][refactor] Simplify decoder state initialization (#6559 )	2025-08-12 21:44:41 +02:00
trtGptModelInflightBatching.h	[nvbug/5374773] chore: Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache (#5974 )	2025-07-25 18:10:40 -04:00
updateDecoderBuffers.cpp	refactor: Speculative decoding buffers part 2 (#5316 )	2025-06-27 17:41:48 +02:00