Commit Graph

908 Commits

Author SHA1 Message Date
Lucas Liebenwein
752cc3a8cb
[https://nvbugs/5606166][fix] AutoDeploy: use tuples for cudagraph shape lookup (#8772)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-10-31 13:59:48 +01:00
Yukun He
a1d912688c
[https://nvbugs/5623960][fix] Compress the warning log of AutoTuner when encountering tactic failures. (#8795)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-10-31 12:55:56 +08:00
Jin Li
0dac57f2bc
[https://nvbugs/5569534][fix] Warm up with different sizes for more s… (#8515)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-10-29 22:29:06 -07:00
JunyiXu-nv
6adccd758d
[https://nvbugs/5606268][fix] Separate cuda graph workspace to prevent IMA (#8685)
Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>
2025-10-29 09:43:30 +01:00
sunnyqgg
e9aa8b222f
[https://nvbugs/5556020][fix] cherry-pick fix test_disaggregated_serving.py::TestLlama3_1_8BInstruct::test_eagle3 dimension mismatch (#8644)
Signed-off-by: qgai <qgai@nvidia.com>
2025-10-29 15:44:25 +08:00
Yukun He
e04354bc09
[https://nvbugs/5608489][fix] Fix output unpack issues for Llama3/4 NVFP4 models. (#8679)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-10-28 14:21:47 +08:00
Chang Liu
f4e1cc7b39
[https://nvbugs/5549081][fix] Fix device id assignment for some visio… (#8552)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
2025-10-23 14:06:13 +08:00
Lizhi Zhou
3f82cdbdad
[https://nvbugs/5582277][fix] rework DisaggPPTerminationHandler to fix hang issue (#8519)
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2025-10-23 09:43:59 +08:00
Kaiyu Xie
c7b06b1b0a
[https://nvbugs/5488576][fix] Propagate disable_finalize_fusion config flag in WIDEEP MoE backend (cherry-pick #8141) (#8566)
Signed-off-by: Sergey Klevtsov <sklevtsov@nvidia.com>
Co-authored-by: Sergey Klevtsov <141879860+sklevtsov-nvidia@users.noreply.github.com>
2025-10-22 21:46:59 +08:00
Jin Li
6631791c60
[https://nvbugs/5546510][fix] Move torch.cuda.Stream out of torch com… (#8494)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-10-22 11:21:58 +08:00
JunyiXu-nv
0acdecb2c3
[https://nvbugs/5569713][fix] Disable fp8 deep gemm for EXAONE-4.0-32B-FP8 (#8429)
Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>
2025-10-21 12:37:56 -04:00
mpikulski
f256eb9063
[TRTLLM-8650][fix] beam search request validation (#8433)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-10-21 10:50:27 +02:00
danielafrimi
a0b7fe9e36
[https://nvbugs/5524714][fix] Fix TP sharding of fused-QKV weight scales in W4A16 AWQ (#8432)
Signed-off-by: Daniel Afrimi <dafrimi@nvidia.com>
2025-10-19 15:27:23 +03:00
Ziyi Xiong
4ad7ef1497
[https://nvbugs/5534705][fix] Skip unnecessary CUDA graph capture (#8… (#8344)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-10-16 10:27:19 +08:00
Patrice Castonguay
7862372ee2
[https://nvbugs/5552889][fix] fix: Prevent empty batch when using attention DP with disagg (#8372)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-10-16 09:11:04 +08:00
amitz-nv
27c6c8466b
[https://nvbugs/5510879][fix] Fix pytorch & TRT-python flows fused LoRA adapter modules weight split with TP>1 (#8313)
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-10-15 08:24:02 -07:00
amitz-nv
e5476a6b2a
[https://nvbugs/5521949][fix] Update FP8 model with BF16 LoRA test, fix test_bielik_11b_v2_2_instruct_multi_lora (#8324)
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-10-15 05:48:38 -07:00
xiweny
d5b79268e7
[https://nvbugs/5565565] [fix] fp8 wideep support sm103 (#8228)
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-10-15 10:17:08 +08:00
Jin Li
4bac6b337e
[https://nvbugs/5537348][fix] Use device tensor index for MTP (#8062)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-10-14 05:51:45 -07:00
Ziyi Xiong
9ecc6db5b4
[https://nvbugs/5537878][fix] Reserve an extra slot for padded batch … (#8231)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-10-13 23:34:22 -07:00
Yechan Kim
3d3d49434a
[https://nvbugs/5547434][fix] Fix Qwen2.5-VL device_path error (#8057)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-10-13 14:12:27 +08:00
Yukun He
1ca84e1a25
[https://nvbugs/5536131][fix] Fix illegal access issue when scale is not provided in Llama3/4. (#7960)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-10-07 23:47:00 -07:00
Jin Li
b4e6a1648b
[https://nvbugs/5451280][fix] Reduce memory fraction problem by warmu… (#7999)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-10-03 18:14:13 -07:00
Enwei Zhu
a64d9b69e5
[None][fix] Fix chunked prefill state of draft request (#8067)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-09-30 09:51:21 +08:00
sunnyqgg
2e5850c28a
[TRTLLM-7330][feat] Eagle3 cuda graph support for the first draft model inference (#7363)
Signed-off-by: qgai <qgai@nvidia.com>
2025-09-26 11:28:05 +08:00
dongfengy
1eb653146a
[https://nvbugs/5525951][fix] Clarify that PP is not supported for GPTOSS (#7911)
Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
2025-09-25 12:54:18 -07:00
QI JUN
1529a6f22d
[None][chore] extract weights loading related logic to model loader (#7579)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-25 10:19:22 -07:00
xxi
57ff5f4c0d
[None][fix] fix a bug in wideEp use DeepEP with num_chunks > 1 (#7954)
Signed-off-by: xxi <xxi@nvidia.com>
2025-09-25 07:53:42 -07:00
Matthias Jouanneaux
eda1467061
[TRTLLM-5966][feat] Helix: add alltoall op (#6815)
Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>
2025-09-25 07:18:29 -07:00
Guoming Zhang
202bed4574 [None][chroe] Rename TensorRT-LLM to TensorRT LLM for source code. (#7851)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
Guoming Zhang
9f0f52249e [None][doc] Rename TensorRT-LLM to TensorRT LLM for homepage and the … (#7850)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
Yan Chunwei
5342c607cd [https://nvbugs/5516710][fix] fix Llama 3.3 TP PP case (#7717)
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
Tao Li @ NVIDIA
44d7c3b245 [https://nvbugs/1234567][fix] Revert https://github.com/NVIDIA/TensorRT-LLM/pull/7768/files (#7813)
Signed-off-by: Tao Li
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
Wanli Jiang
22b45ff9c7
[TRTLLM-7758][feat] Phi4-mm image modality inference optimization (#7918)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-09-25 15:58:29 +08:00
Void
336c2ef540
[None][feat] DeepEP LL fp8 dispatch/combine (#7927)
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
2025-09-25 09:20:24 +08:00
Leslie Fang
342014069e
[None][chore] Validate features combination (#7630)
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
2025-09-25 08:01:13 +08:00
Iman Tabrizian
da30d496b0
[None][fix] Revert "[None][feat] Return topk logprobs in torch backend (#7756)" (#7969)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-09-24 15:36:38 -07:00
sychen52
5a65af24cd
[OMNIML-2336][feat] Add NVFP4 x FP8 moe kernels (#7821)
Signed-off-by: Shiyang Chen <shiychen@nvidia.com>
2025-09-24 12:14:35 -07:00
Mike Iovine
42c2ec3239
[https://nvbugs/5473781][fix] Fix llama 4 FP8 for PP>1 (#7220)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-09-24 12:16:27 -04:00
Yuxian Qiu
48fda86c56
[None][fix] Fix dummy load format for DeepSeek. (#7874)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-09-24 23:03:16 +08:00
Eran Geva
603517f72a
[#7675][feat] CapturedGraph to support max_batch_size > max(cuda_graph_batch_sizes) (#7888)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-09-24 10:11:44 -04:00
Necofish
cfbcf9b9e8
[None][feat] Support Seed-OSS model in pytorch backend (#7496)
Signed-off-by: Nekofish-L <liuxiangyang@mail.ustc.edu.cn>
2025-09-24 03:57:12 -07:00
Cao Dong
2f8dc6feb0
[None][feat] Return topk logprobs in torch backend (#7756)
Signed-off-by: Dong Cao <docao@nvidia.com>
2025-09-24 15:30:39 +08:00
Yueh-Ting (eop) Chen
cf100933cc
[TRTLLM-6341][feature] Support SWA KV cache reuse (#6768)
This merge request attempts to support more SWA KV cache functionality
inside the KV cache manager. Before this merge request, the KV cache for
sliding window attention (SWA) only holds "window size" number of blocks
and reuse them in a cyclic manner. We will not be able to utilize more
GPU memory with this design, leading to a limited max batch size
throughput. Additionally, we will not be able to support KV cache reuse
with this design.

In this MR, we change such behavior to let the manager write blocks in
a linear manner. With a linear block writing behavior, as the attention
window moves on, the out-of-window (OOW) blocks will be detached. Right
now for the sake of a correct feature first, we directly offload the
OOW block from the primary block pool (GPU memory) to the secondary
block pool (host memory). We will improve this in the future by
delegating the block movement to the eviction policy.

KV cache reuse for SWA is not developed in this merge request and will
be amended in a follow-up merge request.

Writing the blocks linearly, the maximum number of blocks allocated for
a sequence(`GenerationRequest`) is the "max sequence length" specified.
The `GenerationRequest` that stores the cache block bookkeeping
structure will now keep "max sequence length" tokens of blocks.

Given the above, main changes are (more context in the MR):
- Remove "cyclic" concept under the kv cache manager, such concept
  originally guards the block reuse under kv cache manager.
- Add detach mechanism and have it under `KVCacheManager::addToken`.
  Please note that detach is still guarded off for SWA when reuse
  is enabled. A follow-up merge request will proceed to improve this.
- Enforce "max sequence length" to be a non-optional parameter to
  the `KVCacheManager`/`BlockManager`
- Let all window size resource pool get identical proportion of memory
- Fix free memory calculation under `resource_manager.py`

Signed-off-by: eopXD <yuehtingc@nvidia.com>
Co-authored-by: Tomer Asida <tasida@nvidia.com>
2025-09-24 14:28:24 +08:00
Ziyi Xiong
31ef03fd82
[https://nvbugs/5528405][fix] Set up draft_tokens before scheduling (#7903)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-09-24 09:56:17 +08:00
Venky
6ff0fad75e
[TRTLLM-7015] [feat] Enable prompt_logprobs in pytorch backend (#7580)
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
2025-09-23 18:48:10 -07:00
mpikulski
9970345919
[TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler, cf. TRTLLM-7156) (#7294)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-09-23 16:05:05 -07:00
Daniel Cámpora
9f1d9b7b18
[None][feat] Use list instead of torch tensor for new tokens in update requests (#7730)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-09-23 10:40:08 -04:00
Zheyu Fu
34963ec39c
[None][fix] Assign [] to req.py_draft_tokens instead of None when spec decode is off (#7511)
Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>
2025-09-23 06:54:18 -07:00
ChristinaZ
dd5fb2857a
[None][fix] Re-add the import for allgather that was mistakenly removed. (#7920)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-09-23 03:09:48 -07:00