TensorRT-LLMs/cpp
Yueh-Ting (eop) Chen 020fed97b6
[TRTLLM-6341][chore] Preliminary refactors on the kv cache manager before supporting swa kv cache reuse (#6767)
This MR is a preliminary MR for implementing the SWA reuse mechanism for
the kv cache manager. Please be aware that **no functional change is
intended** in this merge request. The purpose of the clean-up is to
decouple and remove existing functions for the up-coming SWA KV cache
reuse change to be more natural and easier to review.

Right now, (1) streamLLM, and (2) beam search with SWA, are broken. We
do not want to complicate the code base by stacking more features upon
something that does not work. This MR prunes out the logic and add
assertions so we can come back and re-support the broken feature and
remove the assertion.

Since streamLLM (sink attention) is broken now, assertion is added
under `KVCacheManager` ctor to guard for the value of
`mSinkBlockTokenLength` and `mSinkBubbleLength`. Compute logics relate
to it are pruned.

The beam search with SWA will still be broke when introducing the SWA
KV cache reuse. We will revisit this problem in the future.

On top of this, we should make an effort to update the [supporting matrix](https://github.com/NVIDIA/TensorRT-LLM/blob/feat/1.0_doc_dev/docs/source/1.0/features/feature-combination-matrix.md)
of the kv cache manager after merging the support of SWA KV cache reuse.

Changes are listed as following:
- Separate `KVCacheManager::updateToken` into `KVCacheManager::addToken`
  and `KVCacheManager::removeToken`. The functionality should be
  decoupled.
- Push utility `cacheSequenceBlockOffsets` and `cacheNewBlockOffset` from
  `KVCacheManager` down to `WindowBlockManager`. `KVCacheManager`-exposed
  functions should be real utilities that users of the structure can
  leverage. Implementation-detailed function calls should not exist at
  this level.
- Simplify "is shared last context block" logic under
  `KVCacheManager::addSequence`.

Since no functional change is intended in this merge request, no test
case is added. Several comments are added for future test coverage
reminder.

For `LlmRequestTest.ParamTest`, `streaming=True` is commented out
because we guard sink attention with assertion now.

In `capacitySchedulerTest`, `addToken` action to `crossKVCacheManager`
is removed because in encoder-decoder model, generation tokens are
added only to the decoder and not to the encoder.

Signed-off-by: eopXD <yuehtingc@nvidia.com>
2025-08-20 13:57:57 +08:00
..
cmake feat: reduce unnecessary kernel generation (#5476) 2025-07-04 14:37:49 +08:00
include/tensorrt_llm [TRTLLM-6341][chore] Preliminary refactors on the kv cache manager before supporting swa kv cache reuse (#6767) 2025-08-20 13:57:57 +08:00
kernels [None][feat] Use Separate QKV Input Layout for Context MLA (#6538) 2025-08-19 22:04:48 +08:00
micro_benchmarks [TRTLLM-6744][feat] Remove input_sf swizzle for module WideEPMoE (#6231) 2025-08-08 11:13:42 +08:00
tensorrt_llm [TRTLLM-6341][chore] Preliminary refactors on the kv cache manager before supporting swa kv cache reuse (#6767) 2025-08-20 13:57:57 +08:00
tests [TRTLLM-6341][chore] Preliminary refactors on the kv cache manager before supporting swa kv cache reuse (#6767) 2025-08-20 13:57:57 +08:00
CMakeLists.txt [TRTLLM-7141][infra] Use repo mirrors to avoid intermittent network failures (#6836) 2025-08-15 11:16:07 +08:00
conandata.yml infra: add conan (#3744) 2025-04-30 11:53:14 -07:00
conanfile.py feat: large-scale EP(part 6: Online EP load balancer integration for GB200 nvfp4) (#4818) 2025-06-08 10:25:18 +08:00
libnuma_conan.py fix cuda driver link issue with driver version less than 12.3 (#5025) 2025-06-10 15:27:39 +08:00