For VSWA scheme, we do not want `kv_cache_cnonfig.max_token` to control
and cap the maximum memory of a block pool because block pool size are
not identical amongst different window sizes. This MR omits the effect
of `kv_cache_config.max_tokens` under `kvCacheManager.cpp` to allow the
setting of block pool size to rely on the window size to share ratio
and the total gpu memory analyzed and fed to the kv cache manager.
Only skipping for VSWA scheme, no extra coverage was added.
Signed-off-by: eopXD <yuehtingc@nvidia.com>
This MR is a continuation of #6768. In the previous merge request,
OOW (out-of-window) blocks are only detached when reuse is not enabled,
that is, the block movement behavior is identical between SWA and full
attention when reuse is enabled.
This merge request attempts to enable OOW block detach when reuse is
enabled. The required changes are:
- Let KV cache manager keep track of which block is used by which
sequence
- Remove restriction for the eviction policy to be able to release a
non-leaf block
Along with the development, bugs inside freeChildren and offload
mechanism under getFreeBlock is resolved because they will affect the
functionality this merge request is trying to achieve.
When a block goes OOW, it is released from the sequence, it will be
available to be reclaimed and the block is held by the eviction policy
for another sequence to acquire upon calling. On the other hand, we
want to potentially store the sequence for reuse. To safely achieve
this, the record of block ownership is done under
WindowBlockManager::getFreeBlock. If the block acquired was originally
owned by another sequence that is live inside the manager, then we
invalidate the sequence for store for reuse.
At the end of a sequence (when removeSequence is called toward it),
the KV cache manager will check if the sequence has all blocks not
reclaimed by another sequence. If so, then the sequence is safe to
be stored for reuse and store for reuse action will be performed.
Signed-off-by: eopXD <yuehtingc@nvidia.com>
By default, we allocate equal proportion shares of memory for all
window sizes (see the else case). With TRTLLM_WINDOW_SIZE_SHARES,
we can override this behavior to adjust the memory share of each
window size. For example, if we have window size of [512, 32768],
then setting TRTLLM_WINDOW_SIZE_SHARES=0.4,0.6 will be allocating
40% of the memory to window size 512 and 60% of the memory to window
size 32768.
Signed-off-by: eopXD <yuehtingc@nvidia.com>
This merge request attempts to support more SWA KV cache functionality
inside the KV cache manager. Before this merge request, the KV cache for
sliding window attention (SWA) only holds "window size" number of blocks
and reuse them in a cyclic manner. We will not be able to utilize more
GPU memory with this design, leading to a limited max batch size
throughput. Additionally, we will not be able to support KV cache reuse
with this design.
In this MR, we change such behavior to let the manager write blocks in
a linear manner. With a linear block writing behavior, as the attention
window moves on, the out-of-window (OOW) blocks will be detached. Right
now for the sake of a correct feature first, we directly offload the
OOW block from the primary block pool (GPU memory) to the secondary
block pool (host memory). We will improve this in the future by
delegating the block movement to the eviction policy.
KV cache reuse for SWA is not developed in this merge request and will
be amended in a follow-up merge request.
Writing the blocks linearly, the maximum number of blocks allocated for
a sequence(`GenerationRequest`) is the "max sequence length" specified.
The `GenerationRequest` that stores the cache block bookkeeping
structure will now keep "max sequence length" tokens of blocks.
Given the above, main changes are (more context in the MR):
- Remove "cyclic" concept under the kv cache manager, such concept
originally guards the block reuse under kv cache manager.
- Add detach mechanism and have it under `KVCacheManager::addToken`.
Please note that detach is still guarded off for SWA when reuse
is enabled. A follow-up merge request will proceed to improve this.
- Enforce "max sequence length" to be a non-optional parameter to
the `KVCacheManager`/`BlockManager`
- Let all window size resource pool get identical proportion of memory
- Fix free memory calculation under `resource_manager.py`
Signed-off-by: eopXD <yuehtingc@nvidia.com>
Co-authored-by: Tomer Asida <tasida@nvidia.com>
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
This MR is a preliminary MR for implementing the SWA reuse mechanism for
the kv cache manager. Please be aware that **no functional change is
intended** in this merge request. The purpose of the clean-up is to
decouple and remove existing functions for the up-coming SWA KV cache
reuse change to be more natural and easier to review.
Right now, (1) streamLLM, and (2) beam search with SWA, are broken. We
do not want to complicate the code base by stacking more features upon
something that does not work. This MR prunes out the logic and add
assertions so we can come back and re-support the broken feature and
remove the assertion.
Since streamLLM (sink attention) is broken now, assertion is added
under `KVCacheManager` ctor to guard for the value of
`mSinkBlockTokenLength` and `mSinkBubbleLength`. Compute logics relate
to it are pruned.
The beam search with SWA will still be broke when introducing the SWA
KV cache reuse. We will revisit this problem in the future.
On top of this, we should make an effort to update the [supporting matrix](https://github.com/NVIDIA/TensorRT-LLM/blob/feat/1.0_doc_dev/docs/source/1.0/features/feature-combination-matrix.md)
of the kv cache manager after merging the support of SWA KV cache reuse.
Changes are listed as following:
- Separate `KVCacheManager::updateToken` into `KVCacheManager::addToken`
and `KVCacheManager::removeToken`. The functionality should be
decoupled.
- Push utility `cacheSequenceBlockOffsets` and `cacheNewBlockOffset` from
`KVCacheManager` down to `WindowBlockManager`. `KVCacheManager`-exposed
functions should be real utilities that users of the structure can
leverage. Implementation-detailed function calls should not exist at
this level.
- Simplify "is shared last context block" logic under
`KVCacheManager::addSequence`.
Since no functional change is intended in this merge request, no test
case is added. Several comments are added for future test coverage
reminder.
For `LlmRequestTest.ParamTest`, `streaming=True` is commented out
because we guard sink attention with assertion now.
In `capacitySchedulerTest`, `addToken` action to `crossKVCacheManager`
is removed because in encoder-decoder model, generation tokens are
added only to the decoder and not to the encoder.
Signed-off-by: eopXD <yuehtingc@nvidia.com>
No functional change is intended in this MR.
`WindowBlockManager::mCachedBlocksRoot` is now who is responsible
for the bookkeeping of the `KVCacheBlock`, and the `mNextBlocks` is
now the actual hash map that fetches the block.
The `mEnableHashKey` knob and related hashing is removed.
Signed-off-by: eopXD <yuehtingc@nvidia.com>
* implement variable window attention by breaking the block manager into window block managers per window size
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* revert isCyclic to be true if the min attention window is reached, not per window size
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* add explanatory comment to mCyclicThreshold
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* load correct gemma config
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* don't shadow inputLength in addSequence - it should remain the function scope input length between window size loop iterations
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix KVCacheManagerVariableWindowAttentionWithReuseTest for multiple window block managers
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* if TYPE_CHECKING
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* set temp_attention_window_inputs to None explicitly
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* set temp_attention_window_inputs to None explicitly
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* pass dtype as well
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* test_gemma variable sliding window attention
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* allot a fraction of primary/secondaryBlocks to different window size heaps, depending on the window size's total contribution to the kvcache size (i.e., including all layers)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* remove || mEnableBlockReuse which erroneously triggers beamsearch code for cyclic variable attention window code
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* turn off request delaying for MaxUtil
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* make comments better
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* windowSizesTotalSum using std::accumulate
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix error handling of forwardAsync - forwardAsync catch-all catch cleanup code that runs terminateRequest can also fail and must be caught
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix comments
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* remove assert that kills disagg tests, since it isn't necessary
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix corrupted expression: 'isNewTask && (peftCacheManager ?' -> '(isNewTask && peftCacheManager) ?' which caused boolean algebra. Main is correct
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* add Gemma3 to SUPPORTED_HF_ARCHITECTURES
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* support Gemma3
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* finally fix test_gemma - always spread at least {} into generate_summary_cmd, never None
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* finally fix test_gemma - always spread at least {} into generate_summary_cmd, never None
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix kvfactor field for deepseek
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix comment
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix gemma-3 entries in testlist to include vswa
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* only quantize gemma2 VSWA
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
remove misleading comment
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
fix test_gemma
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix test_gemma
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix test_gemma
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* in sendRequestInfo, fromOldAllocatedBlockIds->fromOldAllocatedBlockIds, like in main
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix: disable KV cache reuse if using attention sink (#3021)
* fix: disable KV cache reuse if using attention sink
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fix: disable KV cache reuse if sink bubble
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* add comment
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>