TensorRT-LLMs/examples
Yueh-Ting (eop) Chen cf100933cc
[TRTLLM-6341][feature] Support SWA KV cache reuse (#6768)
This merge request attempts to support more SWA KV cache functionality
inside the KV cache manager. Before this merge request, the KV cache for
sliding window attention (SWA) only holds "window size" number of blocks
and reuse them in a cyclic manner. We will not be able to utilize more
GPU memory with this design, leading to a limited max batch size
throughput. Additionally, we will not be able to support KV cache reuse
with this design.

In this MR, we change such behavior to let the manager write blocks in
a linear manner. With a linear block writing behavior, as the attention
window moves on, the out-of-window (OOW) blocks will be detached. Right
now for the sake of a correct feature first, we directly offload the
OOW block from the primary block pool (GPU memory) to the secondary
block pool (host memory). We will improve this in the future by
delegating the block movement to the eviction policy.

KV cache reuse for SWA is not developed in this merge request and will
be amended in a follow-up merge request.

Writing the blocks linearly, the maximum number of blocks allocated for
a sequence(`GenerationRequest`) is the "max sequence length" specified.
The `GenerationRequest` that stores the cache block bookkeeping
structure will now keep "max sequence length" tokens of blocks.

Given the above, main changes are (more context in the MR):
- Remove "cyclic" concept under the kv cache manager, such concept
  originally guards the block reuse under kv cache manager.
- Add detach mechanism and have it under `KVCacheManager::addToken`.
  Please note that detach is still guarded off for SWA when reuse
  is enabled. A follow-up merge request will proceed to improve this.
- Enforce "max sequence length" to be a non-optional parameter to
  the `KVCacheManager`/`BlockManager`
- Let all window size resource pool get identical proportion of memory
- Fix free memory calculation under `resource_manager.py`

Signed-off-by: eopXD <yuehtingc@nvidia.com>
Co-authored-by: Tomer Asida <tasida@nvidia.com>
2025-09-24 14:28:24 +08:00
..
apps [TRTLLM-5208][BREAKING CHANGE] chore: make pytorch LLM the default (#5312) 2025-06-20 03:01:10 +08:00
auto_deploy [#7308] [feat] AutoDeploy: graph-less transformers mode for HF (#7635) 2025-09-18 10:44:24 +08:00
bindings/executor Update TensorRT-LLM (#2582) 2024-12-16 21:50:47 -08:00
cpp/executor [TRTLLM-7030][fix] Refactor the example doc of dist-serving (#6766) 2025-08-13 17:39:27 +08:00
cpp_library Update TensorRT-LLM (#1274) 2024-03-12 18:15:52 +08:00
disaggregated [None][chore] Update benchmark script (#7860) 2025-09-23 03:15:42 -07:00
dora Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
draft_target_model Fix: draft target README and set exclude_input_in_output to False (#4882) 2025-06-03 23:45:02 -07:00
eagle doc: remove the outdated features which marked as Experimental (#5995) 2025-08-06 22:01:42 -04:00
infinitebench Update TensorRT-LLM (#1725) 2024-06-04 20:26:32 +08:00
language_adapter Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
llm-api [TRTLLM-7385][feat] Optimize Qwen2/2.5-VL performance (#7250) 2025-09-22 03:40:02 -07:00
llm-eval/lm-eval-harness chore: update doc by replacing use_cuda_graph with cuda_graph_config (#5680) 2025-07-04 15:39:15 +09:00
lookahead doc: fix path after examples migration (#3814) 2025-04-24 02:36:45 +08:00
medusa feat: adding multimodal (only image for now) support in trtllm-bench (#3490) 2025-04-18 07:06:16 +08:00
models [TRTLLM-6341][feature] Support SWA KV cache reuse (#6768) 2025-09-24 14:28:24 +08:00
ngram [chore] Clean up quickstart_advanced.py (#6021) 2025-07-21 15:00:59 -04:00
openai_triton Update TensorRT-LLM (#2792) 2025-02-18 21:27:39 +08:00
python_plugin Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
quantization [#6530][fix] Fix script when using calibration tensors from modelopt (#6803) 2025-08-12 20:41:10 -07:00
redrafter ReDrafter support for Qwen (#4875) 2025-06-28 02:33:10 +08:00
sample_weight_stripping doc: remove the outdated features which marked as Experimental (#5995) 2025-08-06 22:01:42 -04:00
scaffolding [https://nvbugs/5517260][fix] move scaffolding contrib module's import to subdirectory (#7758) 2025-09-17 11:36:33 +08:00
serve [None][chore] Enhance trtllm-serve example test (#6604) 2025-08-06 20:30:35 +08:00
trtllm-eval test: Add LLGuidance test and refine guided decoding (#5348) 2025-06-25 14:12:56 +08:00
wide_ep [None][chore] Update benchmark script (#7860) 2025-09-23 03:15:42 -07:00
constraints.txt [None][chore] Version bump for 1.1.0rc6 (#7824) 2025-09-18 11:13:56 +08:00
eval_long_context.py Feat: Variable-Beam-Width-Search (VBWS) part3 (#3338) 2025-04-08 23:51:27 +08:00
generate_checkpoint_config.py Update TensorRT-LLM (#2562) 2024-12-11 00:31:05 -08:00
generate_xgrammar_tokenizer_info.py Update TensorRT-LLM (#2783) 2025-02-13 18:40:22 +08:00
hf_lora_convert.py Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
mmlu.py feat: run mmlu and summarize without engine_dir. (#4056) 2025-05-05 19:35:07 +08:00
run.py [nvbug/5374773] chore: Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache (#5974) 2025-07-25 18:10:40 -04:00
summarize.py [refactor] Unify name of NGram speculative decoding (#5937) 2025-07-19 12:59:57 +08:00
utils.py [TRTLLM-6341][feature] Support SWA KV cache reuse (#6768) 2025-09-24 14:28:24 +08:00