Commit Graph

41 Commits

Author SHA1 Message Date
Yueh-Ting (eop) Chen
cf100933cc
[TRTLLM-6341][feature] Support SWA KV cache reuse (#6768)
This merge request attempts to support more SWA KV cache functionality
inside the KV cache manager. Before this merge request, the KV cache for
sliding window attention (SWA) only holds "window size" number of blocks
and reuse them in a cyclic manner. We will not be able to utilize more
GPU memory with this design, leading to a limited max batch size
throughput. Additionally, we will not be able to support KV cache reuse
with this design.

In this MR, we change such behavior to let the manager write blocks in
a linear manner. With a linear block writing behavior, as the attention
window moves on, the out-of-window (OOW) blocks will be detached. Right
now for the sake of a correct feature first, we directly offload the
OOW block from the primary block pool (GPU memory) to the secondary
block pool (host memory). We will improve this in the future by
delegating the block movement to the eviction policy.

KV cache reuse for SWA is not developed in this merge request and will
be amended in a follow-up merge request.

Writing the blocks linearly, the maximum number of blocks allocated for
a sequence(`GenerationRequest`) is the "max sequence length" specified.
The `GenerationRequest` that stores the cache block bookkeeping
structure will now keep "max sequence length" tokens of blocks.

Given the above, main changes are (more context in the MR):
- Remove "cyclic" concept under the kv cache manager, such concept
  originally guards the block reuse under kv cache manager.
- Add detach mechanism and have it under `KVCacheManager::addToken`.
  Please note that detach is still guarded off for SWA when reuse
  is enabled. A follow-up merge request will proceed to improve this.
- Enforce "max sequence length" to be a non-optional parameter to
  the `KVCacheManager`/`BlockManager`
- Let all window size resource pool get identical proportion of memory
- Fix free memory calculation under `resource_manager.py`

Signed-off-by: eopXD <yuehtingc@nvidia.com>
Co-authored-by: Tomer Asida <tasida@nvidia.com>
2025-09-24 14:28:24 +08:00
Matthias Jouanneaux
1be7faef37
[TRTLLM-5966][feat] Helix: add custom position ids to MLA kernels (#6904)
Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>
Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com>
2025-09-19 20:55:32 +08:00
Shiyu Li
8bdbb48264
[https://nvbugs/5489015][fix] Support communicator split in MNNVL allreduce and fix the binding issues. (#7387)
Signed-off-by: Shiyu Li <shili@nvidia.com>
2025-09-17 07:43:20 +08:00
jmydurant
7deefb3d2b
[TRTLLM-7192][feat] optimize MLA chunked prefill && support fp8 mla chunked prefill (#7477)
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2025-09-15 21:43:49 +08:00
Zheng Duan
24fc1f9acf
[None][fix] using arrival time in llmapi when creating LlmRequest in pytorch workflow (#7553)
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-09-15 07:26:01 -04:00
Linda
a4312ba743
[https://nvbugs/5477359][fix] Nanobind: Allow none types for fields in result (#7672)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-09-10 14:13:46 +01:00
Linda
0566df672d
[TRTLLM-6707][fix] nanobind fix for executor exit call (#7565)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-09-09 14:56:04 +01:00
Tomer Shmilovich
ecc0e687c6
[None][feat] Nixl support for GDS (#5488)
Signed-off-by: Tomer Shmilovich <tshmilovich@nvidia.com>
Signed-off-by: Guy Lev <glev@nvidia.com>
Co-authored-by: Guy Lev <glev@nvidia.com>
2025-09-09 13:00:38 +08:00
Chuang Zhu
77657a1c12
[TRTLLM-7361][feat] KV cache transfer for uneven pp (#7117)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-09-08 13:37:46 -04:00
Chang Liu
23500b55c3
[TRTLLM-7398][feat] Support KV cache salting for secure KV cache reuse (#7106)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
2025-09-06 17:58:32 -04:00
Shunkangz
bddf183e15
[None][feat] Add Request specific exception (#6931)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-09-04 18:43:42 -04:00
Enwei Zhu
5ff3a65b23
[TRTLLM-7028][feat] Enable guided decoding with speculative decoding (part 2: one-model engine) (#6948)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-09-03 15:16:11 -07:00
Tian Zheng
1b9c4cc2f7
[None][fix] Fix nanobind failure (#7425)
Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>
2025-09-01 17:26:40 -04:00
Tian Zheng
e257cb3533
[None][feat] Support NVFP4 KV Cache (#6244)
Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>
2025-09-01 09:24:52 +08:00
Richard Huo
ce580ce4f5
[None][feat] KV Cache Connector API (#7228)
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: richardhuo-nv <rihuo@nvidia.com>
Co-authored-by: jthomson04 <jwillthomson19@gmail.com>
Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>
2025-08-28 23:09:27 -04:00
qixiang-99
b165f8bc97
fix/improve kvcache allocation in PyTorch runtime (#5933)
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
2025-08-26 12:40:22 +08:00
Linda
898f37faa0
[None][feat] Enable nanobind as the default binding library (#6608)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-08-22 09:48:41 +02:00
zhhuang-nv
7e135d2ea7
[None][feat] Use Separate QKV Input Layout for Context MLA (#6538)
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
2025-08-19 22:04:48 +08:00
Zero Zeng
953f4fd69e
[None][fix] acceptance rate calculation fix in benchmark_serving (#6746)
Signed-off-by: Zero Zeng <38289304+zerollzeng@users.noreply.github.com>
2025-08-19 17:29:36 +08:00
Martin Marciniszyn Mehringer
425dad01fd
[None][fix] Clean up linking to CUDA stub libraries in build_wheel.py (#6823)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>
Co-authored-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-08-18 11:20:51 -04:00
Linda
eb4ed18a63
[None][fix] max_num_sequences argument in nanobind (#6862)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-08-13 19:16:17 -04:00
Robin Kobus
45c7518032
[None][refactor] Simplify decoder state initialization (#6559)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-08-12 21:44:41 +02:00
QI JUN
8845e0f065
[None][fix] fix ci (#6814) 2025-08-12 02:21:50 -07:00
Liao Lanyu
f7c13a4aa7
[TRTLLM-6906][chore] Using pybind to bind functions in thop/attentionOp (#6745)
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
2025-08-12 16:45:16 +08:00
Martin Marciniszyn Mehringer
9a8195ef88
fix: Ensure that Python stub generation works against libnvidia-ml stubs (#6188)
Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>
2025-08-11 09:18:17 +02:00
Ziyi Xiong
de472828b9
[TRTLLM-6637][feat] Resolve KV cache divergence issue (#6628)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-08-09 23:15:04 +08:00
pcastonguay
453a06e6ab
[TRTLLM-6881][feat] Include attention dp rank info with KV cache events (#6563)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-08-07 14:17:07 +02:00
hlu1
8207d5fd39
[None] [feat] Add model gpt-oss (#6645)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-08-07 03:04:18 -04:00
amitz-nv
85af62184b
[TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter (#6510)
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-08-07 09:05:36 +03:00
ixlmar
1ebceb790d
[TRTLLM-5508][feat] check input tokens + improve error handling (#5170)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-08-05 18:27:43 +01:00
Olya Kozlova
13cc1c4878
[TRTLLM-5271][feat] best_of/n for pytorch workflow (#5997)
Signed-off-by: Olya Kozlova <okozlova@nvidia.com>
2025-08-04 14:08:06 +02:00
Yuan Tong
a2f271c8e0
[TRTLLM-4406][feat] LLM sleep & wakeup Part 1: virtual device memory (#5034)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-08-04 13:51:01 +08:00
Robin Kobus
918fedf952
[None][refactor] Simplify finish reasons handling in DecoderState (#6524)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-08-02 07:17:43 +02:00
Robin Kobus
d3c14682f0
refactor: Remove unused buffers and bindings from sampler (#6484)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-08-01 00:43:03 -04:00
Michal Guzek
b8719fe96d
[nvbug/5374773] chore: Update nanobind with fail_fast_on_attention_window_too_large changes (#6491)
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
2025-07-31 20:25:29 +01:00
Linda
9a99e6d6d7
fix: integration tests with nanobind (#6326)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-25 09:23:20 +08:00
Linda
60073731ca
fix: bindings unit tests for nanobind (#6221)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-22 14:51:43 +01:00
Linda
3efad2e58c
feat: nanobind bindings (#6185)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-21 08:56:57 +01:00
Iman Tabrizian
b75e53ab69
Revert "feat: nanobind bindings (#5961)" (#6160)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-18 10:12:54 +08:00
Linda
5bff317abf
feat: nanobind bindings (#5961)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-17 22:42:52 +08:00
Linda
4d071eb2d1
feat: binding type build argument (pybind, nanobind) (#5802)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-11 00:48:50 +09:00