TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-02-08 20:21:48 +08:00

History

Yueh-Ting (eop) Chen cf100933cc [TRTLLM-6341][feature] Support SWA KV cache reuse (#6768 ) This merge request attempts to support more SWA KV cache functionality inside the KV cache manager. Before this merge request, the KV cache for sliding window attention (SWA) only holds "window size" number of blocks and reuse them in a cyclic manner. We will not be able to utilize more GPU memory with this design, leading to a limited max batch size throughput. Additionally, we will not be able to support KV cache reuse with this design. In this MR, we change such behavior to let the manager write blocks in a linear manner. With a linear block writing behavior, as the attention window moves on, the out-of-window (OOW) blocks will be detached. Right now for the sake of a correct feature first, we directly offload the OOW block from the primary block pool (GPU memory) to the secondary block pool (host memory). We will improve this in the future by delegating the block movement to the eviction policy. KV cache reuse for SWA is not developed in this merge request and will be amended in a follow-up merge request. Writing the blocks linearly, the maximum number of blocks allocated for a sequence(`GenerationRequest`) is the "max sequence length" specified. The `GenerationRequest` that stores the cache block bookkeeping structure will now keep "max sequence length" tokens of blocks. Given the above, main changes are (more context in the MR): - Remove "cyclic" concept under the kv cache manager, such concept originally guards the block reuse under kv cache manager. - Add detach mechanism and have it under `KVCacheManager::addToken`. Please note that detach is still guarded off for SWA when reuse is enabled. A follow-up merge request will proceed to improve this. - Enforce "max sequence length" to be a non-optional parameter to the `KVCacheManager`/`BlockManager` - Let all window size resource pool get identical proportion of memory - Fix free memory calculation under `resource_manager.py` Signed-off-by: eopXD <yuehtingc@nvidia.com> Co-authored-by: Tomer Asida <tasida@nvidia.com>		2025-09-24 14:28:24 +08:00
..
_tensorrt_engine	[TRTLLM-5208][BREAKING CHANGE] chore: make pytorch LLM the default (#5312 )	2025-06-20 03:01:10 +08:00
_torch	[TRTLLM-6341][feature] Support SWA KV cache reuse (#6768 )	2025-09-24 14:28:24 +08:00
auto_parallel	[None][fix] Migrate to new cuda binding package name (#6700 )	2025-08-07 16:29:55 -04:00
bench	[None][fix] refine `backend` option handling for commands (#7829 )	2025-09-24 10:54:33 +08:00
commands	[None][fix] refine `backend` option handling for commands (#7829 )	2025-09-24 10:54:33 +08:00
evaluate	[TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler, cf. TRTLLM-7156) (#7294 )	2025-09-23 16:05:05 -07:00
executor	[#7692 ][fix] recognize RequestError as per-request error in background handler (#7726 )	2025-09-24 11:11:17 +08:00
inputs	[TRTLLM-7328][feat] E-PD Disagg Support via llmapi (3/N) (#7577 )	2025-09-22 19:07:18 -07:00
layers	[TRTLLM-5863][feat] Support MoE INT8 Weight-Only-Quantization in PyTorch Workflow (#6629 )	2025-08-15 17:15:49 -04:00
llmapi	[None][chore] Make sampler type beta. (#7934 )	2025-09-23 20:51:39 -07:00
metrics	[None][feat] Core Metrics Implementation (#5785 )	2025-08-09 02:48:53 -04:00
models	[https://nvbugs/5496960 ][fix] Fix Gemma model forward. (#7509 )	2025-09-22 14:28:38 +08:00
plugin	feat: Add support for fp8 rowwise quantization (#4876 )	2025-06-14 06:37:48 -07:00
quantization	[OMNIML-2336][feat] Add NVFP4 x FP8 (#6809 )	2025-09-04 09:03:38 -07:00
runtime	[None][fix] Migrate to new cuda binding package name (#6700 )	2025-08-07 16:29:55 -04:00
scaffolding	[TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler, cf. TRTLLM-7156) (#7294 )	2025-09-23 16:05:05 -07:00
serve	[TRTLLM-7182][test] add multi-nodes test for disagg-serving (#7470 )	2025-09-24 08:31:56 +08:00
tools	[None] [feat] nsys profile output kernel classifier (#7020 )	2025-08-23 00:57:37 -04:00
__init__.py	[https://nvbugs/5367180 ][fix] Fix xgrammar import before loading tensorrt_llm binary (#7906 )	2025-09-23 00:29:57 -07:00
_common.py	linting(python): Enable ruff on more files (wave 1/N) (#5140 )	2025-06-14 19:19:34 +08:00
_dlpack_utils.py	linting(python): Enable ruff on more files (wave 1/N) (#5140 )	2025-06-14 19:19:34 +08:00
_ipc_utils.py	[None][fix] Migrate to new cuda binding package name (#6700 )	2025-08-07 16:29:55 -04:00
_mnnvl_utils.py	[https://nvbugs/5477730 ][fix] Fix the alltoall case when tp_size larger than ep_size (#7331 )	2025-09-04 08:10:03 -04:00
_utils.py	[TRTLLM-4629] [feat] Add support of CUDA13 and sm103 devices (#7568 )	2025-09-16 09:56:18 +08:00
builder.py	[TRTLLM-5930][doc] 1.0 Documentation. (#6696 )	2025-09-09 12:16:03 +08:00
disaggregated_params.py	[TRTLLM-7328][feat] E-PD Disagg Support via llmapi (3/N) (#7577 )	2025-09-22 19:07:18 -07:00
functional.py	[TRTLLM-6341][feature] Support SWA KV cache reuse (#6768 )	2025-09-24 14:28:24 +08:00
graph_rewriting.py	linting(python): Enable ruff on more files (wave 1/N) (#5140 )	2025-06-14 19:19:34 +08:00
logger.py	[None][chore] Mass integration of release/1.0 - 3rd (#7519 )	2025-09-08 14:03:04 +08:00
lora_helper.py	[TRTLLM-6825][fix] Update lora for phi4-mm (#6817 )	2025-08-21 22:00:04 -04:00
lora_manager.py	[https://nvbugs/5467232 ][fix] Fix load_torch_hf_lora to override lora_config.trtllm_modules_to_hf_modules with default only when it has no value (#7132 )	2025-08-24 15:00:24 +03:00
mapping.py	[TRTLLM-6741] [feat] enable LM tp for MTP, under attention dp case (cherry-pick #7128 ) (#7571 )	2025-09-17 09:41:32 +08:00
math_utils.py	perf: Optimize swizzle_sf, unswizzle_sf, reswizzle_sf (#5318 )	2025-06-26 14:03:56 +08:00
module.py	linting(python): Enable ruff on more files (wave 1/N) (#5140 )	2025-06-14 19:19:34 +08:00
network.py	chore: remove usernames from comments (#3291 )	2025-04-05 13:44:28 +08:00
parameter.py	fix:https://nvbugs/5234033 enable starcoder trt-flow with transforme… (#3909 )	2025-05-15 11:16:45 +08:00
profiler.py	linting(python): Enable ruff on more files (wave 1/N) (#5140 )	2025-06-14 19:19:34 +08:00
prompt_adapter_manager.py	linting(python): Enable ruff on more files (wave 1/N) (#5140 )	2025-06-14 19:19:34 +08:00
python_plugin.py	linting(python): Enable ruff on more files (wave 1/N) (#5140 )	2025-06-14 19:19:34 +08:00
sampling_params.py	[TRTLLM-7015] [feat] Enable `prompt_logprobs` in pytorch backend (#7580 )	2025-09-23 18:48:10 -07:00
scheduling_params.py	[None][feat] Add support of scheduling attention dp request (#6246 )	2025-08-01 20:38:01 -04:00
serialization.py	[None] [feat] Add model gpt-oss (#6645 )	2025-08-07 03:04:18 -04:00
top_model_mixin.py	[None][fix] Refactoring to avoid circular import when importing torch models (#6720 )	2025-08-11 18:00:42 -04:00
version.py	[None][chore] Version bump for 1.1.0rc6 (#7824 )	2025-09-18 11:13:56 +08:00