Commit Graph

849 Commits

Author SHA1 Message Date
Wanli Jiang
2a30f11d63
[None][chore] Upgrade transformers to 4.56.0 (#7523)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-09-22 22:20:16 +08:00
Yechan Kim
f77aca9f2c
[TRTLLM-7385][feat] Optimize Qwen2/2.5-VL performance (#7250)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-09-22 03:40:02 -07:00
HuiGao-NV
0dac1ddb74
[https://nvbugs/5525849][fix] Cherry-pick to fix mismatch of max seq len between kv cache manager and dummy requests (#7855)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-09-22 18:07:47 +08:00
Yukun He
ab26d21620 [https://nvbugs/5517023][fix] Pass allreduce strategy and force NCCL on pre-Blackwell arch (#7768)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
Yi Zhang
f9c9c3f50a [https://nvbugs/5355219][fix] Fix trtllm moe backend test config and Qwen3 MoE multi node (#7724)
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
HuiGao-NV
af34c9713a [https://nvbugs/5474169][fix] seq_len mismatch between kv cache manager and graph attn metadata (#7606)
Signed-off-by: Hui Gao <huig@nvidia.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
HuiGao-NV
123f5cbbf0 [https://nvbugs/5474169][fix]Adjust max seq len for kvcache for memory estimation (#7391)
Signed-off-by: Hui Gao <huig@nvidia.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
Bo Li
a15f08db3d [https://nvbugs/5467548][fix] DeepSeek illegal memory access. (#7298)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
Stefan Niebler
8aead224fb
[https://nvbugs/5513423][fix] Correctly respect min_tokens in PyTorch Workflow (#7808)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
Co-authored-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
2025-09-21 22:15:18 -07:00
dongxuy04
b057fc9593
[None][fix] cherrypick to main: Fix possible mpi broadcast and gather issue on large object (#7854)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-09-22 10:17:23 +08:00
Enwei Zhu
639d4109a7
[None][fix] Disable torch.compile for CapturableGuidedDecoder (#7871)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-09-22 10:04:30 +08:00
dongxuy04
9eb8084ca9
[TRTLLM-7008][fix] cherrypick to main Add automatic shared memory delete if already exist (#7727)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-09-21 11:01:51 -07:00
Ziyi Xiong
897c4dd23b
[https://nvbugs/5517404][fix] Use the correct cuda graph for dynamic spec dec (#7728)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-09-21 08:20:48 +08:00
Grzegorz Kwasniewski
8adaf0bb78
[TRTLLM-6342][feat] Support for partial sharding from factory (#7393)
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
Signed-off-by: Grzegorz Kwasniewski <213329731+greg-kwasniewski1@users.noreply.github.com>
2025-09-19 09:07:42 -07:00
Matthias Jouanneaux
1be7faef37
[TRTLLM-5966][feat] Helix: add custom position ids to MLA kernels (#6904)
Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>
Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com>
2025-09-19 20:55:32 +08:00
Gabriel Wu
0e72e8f7e6
[None][feat] Support EPLB in Qwen3 MoE (#7443)
Signed-off-by: Gabriel Wu <13583761+lucifer1004@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-09-19 16:45:35 +08:00
QI JUN
f1b362faac
[None][chore] polish error message in cute_dsl_utils.py (#7852)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-19 12:05:11 +08:00
HuiGao-NV
a6370fd143
[https://nvbugs/5481434][feat] cherry-pick fix to reuse pytorch memory segments occupied by cudagraph (#7747)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-09-19 10:25:21 +08:00
Yuxian Qiu
d6ebcf7c4a
[TRTLLM-6994][feat] FP8 Context MLA integration (Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/6059 from release/1.1.0rc2) (#7610)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-09-19 09:40:49 +08:00
Ziyi Xiong
420f0fbcf5
[https://nvbugs/5522851][fix] Correct the logic to update kv_lens_cuda (#7790)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-09-19 08:11:29 +08:00
sunnyqgg
80dd8fe197
[TRTLLM-6746][feat] Enable two-model spec dec for MTP Eagle (#7001)
Signed-off-by: qgai <qgai@nvidia.com>
2025-09-18 12:05:36 -04:00
Li Min
d921fc3352
[TRTLLM-6898][feat] Add swapab, tileN64, cga sync support for cute dsl nvfp4 gemm (#7764)
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
2025-09-18 21:20:04 +08:00
bhsueh_NV
c65457db8a
[None][fix] Revert "Revert "[None][feat] support attention dp for qwen3 dense model"" (#7780)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-09-18 20:11:05 +08:00
Wanli Jiang
fe104dc20d
[TRTLLM-7918][feat] Support kvcache reuse and chunk prefill for phi4mm (#7723)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-09-18 17:37:16 +08:00
Stefan Niebler
a55251bf75
[None][fix] Add TP information in weight scale loading in WeightOnlyQuantLinearMethod (#7732)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-09-18 10:30:50 +02:00
Wanli Jiang
a7ca0fff54
[TRTLLM-6577][feat] Support nano_v2_vlm in pytorch backend (#7207)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-09-18 16:26:20 +08:00
Leslie Fang
870cfcf9a0
[None][chore] Remove executor config in create_py_executor (#7599)
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
2025-09-18 14:24:58 +08:00
Li Min
14e455da3e
[None][fix] Fix CI issue for dsl pkg install (#7784)
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-09-18 13:58:20 +08:00
Barry Kang
4f0e6b5f96
[None][feat] Cherry-pick DeepGEMM related commits from release/1.1.0rc2 (#7716)
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
2025-09-18 13:51:48 +08:00
Ziyi Xiong
28469dbf27
[https://nvbugs/5523080][fix] Correct the batch index in device tensors (#7803)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-09-18 13:45:37 +08:00
Guoming Zhang
e0423bfaab
[https://nvbugs/5519544][fix] fix invalid expression for disabling pa… (#7806)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-09-18 12:54:52 +08:00
Yukun He
cd80e0a7f1
[None][fix] Make tile_tokens_dim calculation just in time before kernel launching. (#7529)
tile_tokens_dim directly depends on the num_token, which is a dynamic shape during tuning and inference. When AutoTuner prepares dummy tensors with different num_tokens, it does not update the value of tile_tokens_dim automatically. Therefore, the value stored in the AutoTuner cache is misaligned, which will introduce a lot of cache misses during inference, which hurts perf a lot.

To avoid this issue, we move the calculation of tile_tokens_dim right before kernel launching, so that the value of tile_tokens_dim is always up to date with the num_tokens of the current input tensor used for the kernel runner.

Also, the tile_tokens_dim is calculated based on the number of tokens of a tuned bucket, instead of the original token number. Because we only tune the value for the buckets, not for the raw input token number, to avoid unexpected misalignment between tile_tokens_dim and the token number.

This PR also removes the warmup requests with the extra input shapes, which are triggered in the CUDA graph warmup phase.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-09-18 10:58:52 +08:00
Lucas Liebenwein
39eb120b96
[#7308] [feat] AutoDeploy: graph-less transformers mode for HF (#7635)
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Co-authored-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
2025-09-18 10:44:24 +08:00
Netanel Haber
a5cfc8368f
[https://nvbugs/5508536][fix] Revert #7041: Move stop_criteria to sample_async (#7041) (#7796)
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Co-authored-by: Mike Iovine <miovine@nvidia.com>
2025-09-17 21:27:01 -04:00
William Zhang
2614d71994
[TRTLLM-7410][feat] Enable KV cache reuse and chunked prefill for mistral3.1 (#7628)
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-09-17 08:11:16 -07:00
Kaiyu Xie
62042a9733
[TRTLLM-6741] [feat] enable LM tp for MTP, under attention dp case (cherry-pick #7128) (#7571)
Signed-off-by: Cheng Hang <chang@nvidia.com>
Co-authored-by: Cheng Hang <chang@nvidia.com>
2025-09-17 09:41:32 +08:00
Yukun He
6313c9799c
[https://nvbugs/5488582][fix] Cherry-pick 7495: Avoid unexpected Triton recompilation in DG fused_moe (#7708)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-09-17 09:00:28 +08:00
Shiyu Li
8bdbb48264
[https://nvbugs/5489015][fix] Support communicator split in MNNVL allreduce and fix the binding issues. (#7387)
Signed-off-by: Shiyu Li <shili@nvidia.com>
2025-09-17 07:43:20 +08:00
HuiGao-NV
a49cfb3e68
[https://nvbugs/5516666][fix] cherrypick fix to the CUDA graph warmup issue when using speculative decoding (#7737)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>
Co-authored-by: Signed-off-by: Hui Gao <huig@nvidia.com>
2025-09-17 06:24:20 +08:00
Aurelien Chartier
471723bce1
[None][chore] Remove unused get_quant_scales methods (#7687)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-09-16 12:56:11 -07:00
Lucas Liebenwein
9befd1a72f
[None][chore] AutoDeploy: neat disablement of transforms in pipeline (#7736)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-09-16 23:31:48 +08:00
bhsueh_NV
8226ef23dc
Revert "[None][feat] support attention dp for qwen3 dense model" (#7765) 2025-09-16 19:09:04 +08:00
Li Min
b278d06481
[TRTLLM-6898][feat] Add Cute DSL nvfp4 linear op (#7632)
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
2025-09-16 14:25:26 +08:00
Void
103b554734
[None][fix] Ensure that the W4A8 custom input scale remains aligned across all ranks (#7614)
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
2025-09-16 11:04:26 +08:00
xiweny
c076a02b38
[TRTLLM-4629] [feat] Add support of CUDA13 and sm103 devices (#7568)
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>
Signed-off-by: Daniel Stokes <dastokes@nvidia.com>
Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com>
Signed-off-by: Xiwen Yu <xiweny@nvidia.com>
Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Signed-off-by: Bo Deng <deemod@nvidia.com>
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: xiweny <13230610+VALLIS-NERIA@users.noreply.github.com>
Co-authored-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>
Co-authored-by: Daniel Stokes <dastokes@nvidia.com>
Co-authored-by: Zhanrui Sun <zhanruis@nvidia.com>
Co-authored-by: Jiagan Cheng <jiaganc@nvidia.com>
Co-authored-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Bo Deng <deemod@nvidia.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-09-16 09:56:18 +08:00
Necofish
96f11b10ae
[None][feat] support attention dp for qwen3 dense model (#7618)
Signed-off-by: Nekofish-L <liuxiangyang@mail.ustc.edu.cn>
2025-09-16 09:33:22 +08:00
Ziyi Xiong
536e8776cd
[TRTLLM-6668][feat] Enable overlap scheduler for two-model spec decoding (#7651)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-09-16 07:33:44 +08:00
Izzy Putterman
8097be7e9c
[None][feat] Eagle, use last hidden post norm (#7546)
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
2025-09-15 12:23:57 -04:00
jmydurant
7deefb3d2b
[TRTLLM-7192][feat] optimize MLA chunked prefill && support fp8 mla chunked prefill (#7477)
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2025-09-15 21:43:49 +08:00
Zheng Duan
24fc1f9acf
[None][fix] using arrival time in llmapi when creating LlmRequest in pytorch workflow (#7553)
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-09-15 07:26:01 -04:00