Commit Graph

1553 Commits

Author SHA1 Message Date
Cao Dong
2f8dc6feb0
[None][feat] Return topk logprobs in torch backend (#7756)
Signed-off-by: Dong Cao <docao@nvidia.com>
2025-09-24 15:30:39 +08:00
xinhe-nv
62563760fb
[None][chore] update chunked prefill cases (#7921)
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-09-24 15:14:49 +08:00
Pengbo Wang
b890d7fea4
[None][infra] Skip failed test for nvbugs 5537738 (#7946)
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
2025-09-23 23:48:50 -07:00
Yueh-Ting (eop) Chen
cf100933cc
[TRTLLM-6341][feature] Support SWA KV cache reuse (#6768)
This merge request attempts to support more SWA KV cache functionality
inside the KV cache manager. Before this merge request, the KV cache for
sliding window attention (SWA) only holds "window size" number of blocks
and reuse them in a cyclic manner. We will not be able to utilize more
GPU memory with this design, leading to a limited max batch size
throughput. Additionally, we will not be able to support KV cache reuse
with this design.

In this MR, we change such behavior to let the manager write blocks in
a linear manner. With a linear block writing behavior, as the attention
window moves on, the out-of-window (OOW) blocks will be detached. Right
now for the sake of a correct feature first, we directly offload the
OOW block from the primary block pool (GPU memory) to the secondary
block pool (host memory). We will improve this in the future by
delegating the block movement to the eviction policy.

KV cache reuse for SWA is not developed in this merge request and will
be amended in a follow-up merge request.

Writing the blocks linearly, the maximum number of blocks allocated for
a sequence(`GenerationRequest`) is the "max sequence length" specified.
The `GenerationRequest` that stores the cache block bookkeeping
structure will now keep "max sequence length" tokens of blocks.

Given the above, main changes are (more context in the MR):
- Remove "cyclic" concept under the kv cache manager, such concept
  originally guards the block reuse under kv cache manager.
- Add detach mechanism and have it under `KVCacheManager::addToken`.
  Please note that detach is still guarded off for SWA when reuse
  is enabled. A follow-up merge request will proceed to improve this.
- Enforce "max sequence length" to be a non-optional parameter to
  the `KVCacheManager`/`BlockManager`
- Let all window size resource pool get identical proportion of memory
- Fix free memory calculation under `resource_manager.py`

Signed-off-by: eopXD <yuehtingc@nvidia.com>
Co-authored-by: Tomer Asida <tasida@nvidia.com>
2025-09-24 14:28:24 +08:00
Yuan Tong
70c3b100eb
[#7692][fix] recognize RequestError as per-request error in background handler (#7726)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-09-24 11:11:17 +08:00
Lizhi Zhou
e4f1f90202
[https://nvbugs/5477404][chore] unwaive test_disaggregated_single_gpu.py::test_disaggregated_llama_context_capacity (#7857)
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2025-09-24 10:31:35 +08:00
Venky
6ff0fad75e
[TRTLLM-7015] [feat] Enable prompt_logprobs in pytorch backend (#7580)
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
2025-09-23 18:48:10 -07:00
Lizhi Zhou
7550251988
[TRTLLM-7182][test] add multi-nodes test for disagg-serving (#7470)
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2025-09-24 08:31:56 +08:00
mpikulski
9970345919
[TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler, cf. TRTLLM-7156) (#7294)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-09-23 16:05:05 -07:00
Zheng Duan
e3c1a9409f
[TRTLLM-6549][fix] add kv cache time output back (#7798)
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-09-23 14:12:42 -04:00
Yanchao Lu
6a36349964
[None][test] Waive another intermittent OOM test (#7930)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-23 22:34:09 +08:00
Zheyu Fu
34963ec39c
[None][fix] Assign [] to req.py_draft_tokens instead of None when spec decode is off (#7511)
Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>
2025-09-23 06:54:18 -07:00
ruodil
05bec3bf0f
[None][test] rename llm_perf_full to llm_perf_core and add missing cases (#7899)
Signed-off-by: Ruodi Lu <ruodil@users.noreply.github.com>
Co-authored-by: Ruodi Lu <ruodil@users.noreply.github.com>
2025-09-22 23:04:34 -07:00
Pengbo Wang
a4b4ed4535
[None][fix] Fix and add test for TRTLLM MoE backend (#7755)
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
2025-09-23 11:26:25 +08:00
Pengbo Wang
08cc7a041f
[https://nvbugs/5355128][fix] Add missing wgmma intrinsic for starcoder (#7643)
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
2025-09-23 10:38:58 +08:00
yunruis
126cd707e3
[None][opt] Add batch waiting when scheduling (#7416)
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>
2025-09-23 10:27:37 +08:00
Chang Liu
998857bcde
[TRTLLM-7328][feat] E-PD Disagg Support via llmapi (3/N) (#7577)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-09-22 19:07:18 -07:00
Enwei Zhu
8330d5363a
[TRTLLM-8209][feat] Support new structural tag API (upgrade XGrammar to 0.1.25) (#7893)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-09-23 09:10:09 +08:00
xxi
d471655242
[TRTLLM-7831][feat] Cherry-pick from #7423 Support fp8 block wide ep cherry pick (#7712) 2025-09-23 08:41:38 +08:00
Enwei Zhu
59f57598a7
[https://nvbugs/5504086][fix] Fix MTP vanilla (#7904)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-09-23 08:38:28 +08:00
ChristinaZ
be576a3152
[None] [feat] Enable run_post_quant_allgather for MoE TRTLLM backend (#6794)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-09-23 08:24:21 +08:00
Jin Li
b5391b4ac6
[https://nvbugs/5516665][fix] Fix CUTLASS moe fake impl errors (#7714)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-09-22 11:08:39 -07:00
Linda
b1738c3f18
[https://nvbugs/5477359][fix] Removing test waivers (#7877)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-09-22 08:59:13 -07:00
Wanli Jiang
2a30f11d63
[None][chore] Upgrade transformers to 4.56.0 (#7523)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-09-22 22:20:16 +08:00
Emma Qiao
324301ccba
[None][infra] Skip failed test for nvbugs 5532023 (#7905)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-22 03:49:44 -07:00
Yechan Kim
f77aca9f2c
[TRTLLM-7385][feat] Optimize Qwen2/2.5-VL performance (#7250)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-09-22 03:40:02 -07:00
Bo Deng
8cf95681e6
[TRTLLM-7989][infra] Bundle UCX and NIXL libs in the TRTLLM python package (#7766)
Signed-off-by: Bo Deng <deemod@nvidia.com>
2025-09-22 16:43:35 +08:00
Emma Qiao
d330d0005c
[None][infra] Waive a failed case on main (#7901)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-22 00:37:01 -07:00
xinhe-nv
9c1b75e978
[TRTLLM-7070][feat] add gpt-oss chunked prefill tests (#7779)
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-09-22 00:12:43 -07:00
Wanli Jiang
f5bfd68a50 [https://nvbugs/5509024][fix] Print full parsed outputs and update keywords for multimodal model (#7670)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
Yi Zhang
f9c9c3f50a [https://nvbugs/5355219][fix] Fix trtllm moe backend test config and Qwen3 MoE multi node (#7724)
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
Ivy Zhang
022bc96fb6 [https://nvbugs/5512734][fix] Update kv cache config for maverick (#7710)
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
bhsueh_NV
ef557f880b [https://nvbugs/5437405][fix] cherry-pick PR 7000 (qwen3 235b eagle3 ci) (#7702)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
Yanchao Lu
5c8b022d1e [None][ci] Test waives for the release/1.0 branch 09/15 (#7700)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
Simeng Liu
99995846b3 [https://nvbugs/5470782][chore] Remove the skip statement in 1.0 rele… (#7573)
Signed-off-by: Simeng Liu <simengl@nvidia.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
peaceh-nv
541b7fda89 [https://nvbugs/5503423][waive] Waive Llama3.1-70B-FP8 test on RTX PRO 6000 (#7603)
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
Yan Chunwei
afca2fcbe0 [https://nvbugs/5351244][fix] test_mpi_session (#7501)
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
Yuxian Qiu
2d46dda6a7 [https://nvbugs/5448754][fix] Download HF model for all nodes. (#6824)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
Lizhi Zhou
293d9fb612 [https://nvbugs/5448767][fix] disable kv cache reuse for disagg pp>1 tests (#7354)
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
Stefan Niebler
8aead224fb
[https://nvbugs/5513423][fix] Correctly respect min_tokens in PyTorch Workflow (#7808)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
Co-authored-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
2025-09-21 22:15:18 -07:00
peaceh-nv
9dc7316b7f
[https://nvbugs/5512556][unwaive] Unwaive DeepSeek PP tests (#7828)
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
2025-09-22 10:26:30 +08:00
dongxuy04
9eb8084ca9
[TRTLLM-7008][fix] cherrypick to main Add automatic shared memory delete if already exist (#7727)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-09-21 11:01:51 -07:00
Ziyi Xiong
897c4dd23b
[https://nvbugs/5517404][fix] Use the correct cuda graph for dynamic spec dec (#7728)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-09-21 08:20:48 +08:00
Yan Chunwei
4509d97780
[TRTLLM-8188][chore] refactor GenerationExecutorWorker with WorkerBase for better code reusing (#7840)
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
2025-09-20 06:24:22 -07:00
Chang Liu
2e317a7db6
[https://nvbugs/5520490][fix] Fix intermittent test failures by avoiding external web data pulls (#7879)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-09-19 17:24:13 -07:00
Mike Iovine
8030b540ac
[https://nvbugs/5522462][fix] Fix FP8 scout illegal memory access (#7845)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-09-19 10:30:37 -04:00
pcastonguay
fbe325ce57
[https://nvbugs/5471108][chore] Unwaiving disagg acc test (#7686)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-09-19 08:56:09 -04:00
Yuxian Qiu
7d28acdbf0
[https://nvbugs/5522332][fix] Pin numpy version for Gemma. (cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/7783) (#7797)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-09-19 18:50:40 +08:00
Liao Lanyu
18095a7cb8
[https://nvbugs/5503440][fix] Fix potential hang due to wrong type of ZMQ socket and protocol for worker_init_status_queue (#7646)
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
2025-09-19 18:13:33 +08:00
xinhe-nv
efb763402f
[None][chore] Add failed cases into waives.txt (#7841)
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-09-19 17:59:47 +08:00