Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
Signed-off-by: Dongxu Yang <dongxuy@nvidia.com>
Co-authored-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
Signed-off-by: Batsheva Black <132911331+BatshevaBlack@users.noreply.github.com>
Signed-off-by: Bo Deng <deemod@nvidia.com>
Co-authored-by: Bo Deng <deemod@nvidia.com>
This MR is a preliminary MR for implementing the SWA reuse mechanism for
the kv cache manager. Please be aware that **no functional change is
intended** in this merge request. The purpose of the clean-up is to
decouple and remove existing functions for the up-coming SWA KV cache
reuse change to be more natural and easier to review.
Right now, (1) streamLLM, and (2) beam search with SWA, are broken. We
do not want to complicate the code base by stacking more features upon
something that does not work. This MR prunes out the logic and add
assertions so we can come back and re-support the broken feature and
remove the assertion.
Since streamLLM (sink attention) is broken now, assertion is added
under `KVCacheManager` ctor to guard for the value of
`mSinkBlockTokenLength` and `mSinkBubbleLength`. Compute logics relate
to it are pruned.
The beam search with SWA will still be broke when introducing the SWA
KV cache reuse. We will revisit this problem in the future.
On top of this, we should make an effort to update the [supporting matrix](https://github.com/NVIDIA/TensorRT-LLM/blob/feat/1.0_doc_dev/docs/source/1.0/features/feature-combination-matrix.md)
of the kv cache manager after merging the support of SWA KV cache reuse.
Changes are listed as following:
- Separate `KVCacheManager::updateToken` into `KVCacheManager::addToken`
and `KVCacheManager::removeToken`. The functionality should be
decoupled.
- Push utility `cacheSequenceBlockOffsets` and `cacheNewBlockOffset` from
`KVCacheManager` down to `WindowBlockManager`. `KVCacheManager`-exposed
functions should be real utilities that users of the structure can
leverage. Implementation-detailed function calls should not exist at
this level.
- Simplify "is shared last context block" logic under
`KVCacheManager::addSequence`.
Since no functional change is intended in this merge request, no test
case is added. Several comments are added for future test coverage
reminder.
For `LlmRequestTest.ParamTest`, `streaming=True` is commented out
because we guard sink attention with assertion now.
In `capacitySchedulerTest`, `addToken` action to `crossKVCacheManager`
is removed because in encoder-decoder model, generation tokens are
added only to the decoder and not to the encoder.
Signed-off-by: eopXD <yuehtingc@nvidia.com>
No functional change is intended in this MR.
`WindowBlockManager::mCachedBlocksRoot` is now who is responsible
for the bookkeeping of the `KVCacheBlock`, and the `mNextBlocks` is
now the actual hash map that fetches the block.
The `mEnableHashKey` knob and related hashing is removed.
Signed-off-by: eopXD <yuehtingc@nvidia.com>