Commit Graph

1892 Commits

Author SHA1 Message Date
Bo Li
8b5ededc83
[TRTLLM-9391][chore] Automatically estimate required workspace. (#9535)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-12-03 12:49:38 +08:00
Suyog Gupta
93871d52b2
[None][chore] AutoDeploy update cuda stream manager for multi-device (#9575)
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-12-02 20:43:14 -08:00
heyuhhh
a08eb81cce
[None][feat] Add RocketKV usage doc and e2e accuracy test on LongBenchV2 (#9572)
Signed-off-by: yuhangh <58161490+heyuhhh@users.noreply.github.com>
2025-12-03 11:33:46 +08:00
Anurag Mukkara
642dfae73a
[https://nvbugs/5698434][fix] Use separate weight mapper for draft (#9607)
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
2025-12-02 16:00:22 -08:00
Enwei Zhu
a3455f55c7
[None][chore] Fix trtllm-eval and move GroupedGemmInputsHelper (#9612)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-12-03 07:55:03 +08:00
Chang Liu
3916d032ec
[None][chore] Remove traceback dump for multimodal input processor (#9634)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-12-03 07:41:03 +08:00
Grzegorz Kwasniewski
0a7a88e74e
[TRTLLM-8946][feat] Improved heuristics to detect shardable regions (#9200)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-12-02 22:08:19 +01:00
Neta Zmora
a560ba5546
[#9550][feat] AutoDeploy: Add NVFP4 Cutlass MoE kernels (#9551)
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
2025-12-03 01:39:38 +08:00
Shi Xiaowei
227d42e492
[https://nvbugs/5651854][fix] Fix dist-serving perf by clearing CPU affinity (#9549)
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-12-03 01:17:03 +08:00
Lucas Liebenwein
e72ce98c0f
[#9150][feat] AutoDeploy: reviewer comments for #9150 (#9527)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-12-02 12:09:10 -05:00
William Zhang
2dd3ebf037
[#9150][feat] Add code for nano v3 to custom implementation in AD (#9465)
* Why?

We would like to show an alternative to monkey-patching in AutoDeploy.

* What?

This commit builds on the existing custom model implementation for
NemotronH and adds the bits relevant for MoE layers.

Part of #9150.

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-12-02 08:56:44 -08:00
Thor Johnsen
95049eea86
[https://nvbugs/5627710][fix] Fix synchronization bugs in KvCacheTransferManager that can cause corrupted blocks (#9056)
Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-12-02 09:10:21 -06:00
Yan Chunwei
b86256eb54
[TRTLLM-9144][fix] enhance RPC robustness (#8711)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-12-02 21:37:59 +08:00
Jin Li
21e3dc11d8
[https://nvbugs/5667774][fix] Refine Piecewise Cuda Graph Condition for DP (#9393)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-12-02 21:09:15 +08:00
brb-nv
be48cdf1d1
[TRTLLM-9466][test] Evaluate helix parallelism with DSV3 Lite (#9597)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-12-02 20:10:07 +08:00
mpikulski
84a1531594
[TRTLLM-9488][feat] use FlashInfer.sampling by default (#9545)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-12-02 16:29:55 +08:00
shuyixiong
1a2118b8fe
[https://nvbugs/5702793][fix] Fix uncontiguous tensor view (#9576)
Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com>
2025-12-02 15:41:32 +08:00
Wanli Jiang
5657a00ec0
[FMDL-1328][feat] Add support for nano-v3 and super-v3 with pytorch backend (#9261)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-12-02 13:40:20 +08:00
Guoming Zhang
6fbe87c8b5
[None][chroe] Polish qwen3-next modeling code. (#8902)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-12-02 11:28:35 +08:00
Iman Tabrizian
356a52edf5
[None][feat] Add support for KVCache reuse for DSv32 (#9383)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-12-02 11:14:30 +08:00
Shijie
dcf5c86720
[None][feat] Unify nvfp4 gemm backend (#8963)
Signed-off-by: Shijie Wang <jaywan@nvidia.com>
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: Shijie <jaywan@nvidia.com>
Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-12-02 11:03:51 +08:00
Yuening Li
09c840184c
[None][fix] Prevent YAML partial kv_cache_config from incorrectly overriding the complete kv_cache_config (#9262)
Signed-off-by: Yuening Li <62227368+Yuening-wa@users.noreply.github.com>
2025-12-02 10:10:08 +08:00
Eran Geva
c9771ebb99
[#9198][feat] Refactor dist ops in AutoDeploy (#9301)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-12-02 02:36:32 +08:00
Chenghao Zhang
0a2104dce9
[None][feat] AutoDeploy: Use the router gemm op for nemotron MOE (#9500)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
2025-12-01 10:24:31 -08:00
Venky
639c939a4f
[TRTC-1943][feat] Env vars override support in LLM API (#9104)
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
2025-12-01 10:04:49 -08:00
brb-nv
f61067cbb5
[None][chore] Defer exposing context parallel configs (#9552)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-12-01 09:50:02 -08:00
Stefan Niebler
f155812eb0
[TRTLLM-6756][feat] Add Beam Search to TorchSampler (#8509)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-12-01 18:48:04 +01:00
Enwei Zhu
90345ad3f3
[None][fix] Skip Allreduce init for Attention DP (#9542)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-12-01 21:24:40 +08:00
alel
4107254c82
[TRTLLM-6222][feat] Several perf opt for cuteDSL nvf4 gemm (#9428)
Signed-off-by: Yuhan Li <51736452+liyuhannnnn@users.noreply.github.com>
2025-12-01 18:10:45 +08:00
Yukun He
730eb3d859
[None][fix] Replace hash method with unique_id for cutedsl MoE runners. (#9569)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-12-01 17:02:33 +08:00
Neta Zmora
bc25fff039
[#9496][fix] AutoDeploy: remove auto-tuner from nvfp4_gemm forward (#9497)
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
2025-12-01 10:04:39 +02:00
Fanrong Li
d69bf9f92a
[None][feat] add chat template kwargs support to longbench-v2 (#9544)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-12-01 15:59:13 +08:00
Gaoji Liu
9d2df04a72
[None][doc] fix mtp.py typo (#9307)
Signed-off-by: liugaoji <757394026@qq.com>
2025-11-30 21:55:13 -08:00
Li Min
1797e91dfd
[TRTLLM-6222][feat] Extend cute_dsl_nvfp4_gemm to sm103. (#9543)
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
2025-12-01 10:19:36 +08:00
Enwei Zhu
34e2fa5c96
[https://nvbugs/5690172][fix] Fix Qwen3-235B ATP accuracy issue with PDL (#9530)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-12-01 09:10:21 +08:00
heyuhhh
6e470aab72
[None] [feat] Optimize the algorithm part of RocketKV (#9333)
Signed-off-by: yuhangh <58161490+heyuhhh@users.noreply.github.com>
2025-12-01 09:04:09 +08:00
xxi
c12e67bb66
[TRTLLM-8958][feat] and [TRTLLM-8960]: create ConfigurableMoE and support TRTLLMGenFusedMoE as backend (#9486) 2025-12-01 08:37:07 +08:00
brb-nv
b77f4ffe54
[TRTLLM-5971][feat] Integrate helix parallelism (#9342)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-11-29 15:17:30 -08:00
dominicshanshan
6345074686
[None][chore] Weekly mass integration of release/1.1 -- rebase (#9522)
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: qgai <qgai@nvidia.com>
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>
Signed-off-by: Simeng Liu <simengl@nvidia.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Signed-off-by: Vincent Zhang <vinczhang@nvidia.com>
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <moraxu@users.noreply.github.com>
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Co-authored-by: yunruis <205571022+yunruis@users.noreply.github.com>
Co-authored-by: sunnyqgg <159101675+sunnyqgg@users.noreply.github.com>
Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com>
Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Co-authored-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
Co-authored-by: Simeng Liu <109828133+SimengLiu-nv@users.noreply.github.com>
Co-authored-by: Guoming Zhang <137257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: Vincent Zhang <vcheungyi@163.com>
Co-authored-by: peaceh-nv <103117813+peaceh-nv@users.noreply.github.com>
Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>
Co-authored-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
Co-authored-by: Leslie Fang <leslief@nvidia.com>
Co-authored-by: Shunkangz <182541032+Shunkangz@users.noreply.github.com>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-11-29 21:48:48 +08:00
Grzegorz Kwasniewski
cff54fcae3
[#8948][feat] Support custom sharding config (#9143)
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
2025-11-29 05:28:05 +08:00
binghanc
db5b876124
[None][feat] support for more accurate AR calculation (#9323)
Signed-off-by: binghanc <176802681+binghanc@users.noreply.github.com>
2025-11-29 00:34:21 +08:00
Matthias Jouanneaux
f8dd494536
[None][perf] Helix: improve all-to-all perf for large CP size (#9494)
Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>
Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>
Co-authored-by: Zheyu Fu <zheyuf@nvidia.com>
2025-11-28 07:24:55 -08:00
mpikulski
e5f39ec7cf
[TRTLLM-9488][feat] add 'disable_flashinfer_sampling' config option (#9454)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-11-28 13:00:39 +01:00
Robin Kobus
5eae3650c3
[None][fix] Pass checkpoint_format to create_input_processor (#9521)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-11-28 10:32:29 +01:00
Zhenhuan Chen
7c3bb8534d
[None][chore] Revert "[None][fix] change allreduce workspace dtype to torch.int64 t… (#9538)
Signed-off-by: Zhenhuan Chen <zhenhuanc@nvidia.com>
2025-11-28 16:45:23 +08:00
Yukun He
60c43a200a
[None][fix] Fix on-disk cache and revise logger/statistics for AutoTuner. (#9211)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-11-28 13:32:21 +08:00
Lucas Liebenwein
2f8bd6fb36
[#9150][feat] AutoDeploy Nemotron-Flash support (#9504)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-11-27 18:03:57 +01:00
Enwei Zhu
c2562fc800
[https://nvbugs/5687820][fix] Remove self.abort() in DetokenizedGenerationResult (#9449)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-11-27 22:54:40 +08:00
Bo Li
62b771877c
[TRTLLM-9389][chore] Refactor AlltoallMethodType. (#9388)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-11-27 21:09:29 +08:00
Fanrong Li
2d5eadf65f
[None][fix] fix TP support for DeepSeek-V3.2 on hopper (#9484)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-11-27 21:02:25 +08:00
Zhenhuan Chen
e47927e847
[None][fix] change allreduce workspace dtype to torch.int64 to avoid overflow (#9479)
Signed-off-by: Zhenhuan Chen <zhenhuanc@nvidia.com>
2025-11-27 17:08:41 +08:00
xxi
f1ed057b4c
[cherry-pick][https://nvbugs/5670793][fix] Solve trtllm-serve launch_disaggregated issue (#9346)
Signed-off-by: xxi <xxi@nvidia.com>
2025-11-27 16:13:58 +08:00
Ziyi Xiong
1dd55d8507
[https://nvbugs/5698581][fix] Init draft tokens for CUDA graph dummy request (#9505)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-11-27 13:05:37 +08:00
Jiagan Cheng
14762e0287
[None][fix] Replace PYTORCH_CUDA_ALLOC_CONF with PYTORCH_ALLOC_CONF to fix deprecation warning (#9294)
Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>
2025-11-27 12:22:01 +08:00
QI JUN
a67d94963e
[None][chore] update comments in llm_args.py (#9472)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-11-27 11:06:34 +08:00
Aurelien Chartier
f2f197360d
[#9463][feat] Add revision option to trtllm commands (#9498)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-11-27 09:30:01 +08:00
Chenghao Zhang
18fbda5cdb
[None][feat] AutoDeploy: Add A_log fusion for Mamba layers (#9422)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
2025-11-26 14:39:20 -08:00
Chenghao Zhang
bc7b60e016
[None][feat] AutoDeploy: Remove redundant copies in mamba layers (#9461)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-11-26 14:38:33 -08:00
Aurelien Chartier
ef7ee6a940
[None][feat] Add environment variable to force spec-dec number of accepted tokens (#9371)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-11-26 07:22:16 -08:00
Chang Liu
b10137fdd5
[None][feat] Support MLA chunked prefill for DeepSeek V3.2 model (#9376)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-11-26 16:38:25 +08:00
Enwei Zhu
1bf2d750a2
[None][chore] Upgrade CuteDSL to 4.3.0 (#9444)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-11-26 14:53:09 +08:00
JunyiXu-nv
b7308a4000
[https://nvbugs/5580099][fix] Cherry pick IMA issue fix from release/1.1 (#9032)
Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>
2025-11-26 13:09:06 +08:00
shuyixiong
d8acea1db3
[TRTLLM-9293][feat] Enable partial weight loading to support streaming update weights (#9224)
Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com>
2025-11-26 10:59:06 +08:00
Yiqing Yan
1b9edf62c9
[None][chore] Bump version to 1.2.0rc5 (#9455)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-11-26 08:37:53 +08:00
Chuang Zhu
0e9c7f8c07
[https://nvbugs/5685143][fix] avoid cudaFree overlap with cuda graph (#9438)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-11-25 16:20:29 -08:00
Suyog Gupta
e484bec82f
[None][chore] AutoDeploy add multi stream moe pass to default.yaml (#9430)
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-11-25 14:16:13 -08:00
Robin Kobus
32f53910ef
[TRTLLM-909][feat] Overlap context chunks in pipeline parallel mode (#9308)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-11-25 22:11:51 +01:00
Eran Geva
afc52d7b93
[https://nvbugs/5647400] [fix] Enlarged the AllReduce workspace size to 64MB. Added AllReduce strategy to AD config. (#9145)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-11-25 10:56:07 -08:00
mpikulski
899fda9e47
[TRTLLM-9490][feat] use FlashInfer's top_k_sampling_from_probs (#9457)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-11-25 18:53:53 +01:00
mpikulski
c5f52ab304
[TRTLLM-8376][feat] top-p optimization (removes redundant softmax) (#9411)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-11-25 18:46:48 +01:00
YueWeng
cc336c4abd
[TRTLLM-8160][feat] Add draft token tree runtime on CDL (#8586)
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
2025-11-25 09:40:55 -05:00
Pengyun Lin
fa61825c74
[None][feat] Support custom chat template for tool calling (#9297)
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-11-25 22:07:04 +08:00
Tailing Yuan
51ef0379d2
[None][feat] Add a parser to layer-wise benchmarks (#9440)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2025-11-25 05:45:16 -08:00
Fanrong Li
c36f144591
[None][chore] Fix trtllm-eval for PyTorchLLM (#9427)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-11-25 04:49:03 -08:00
Yueh-Ting (eop) Chen
a38d91aae2
[https://nvbugs/5537996][fix] Let KV cache manager block initialization be aware whether it is doing a dry run or not (#9093)
Before this commit, the kv cache manager does the same regardless, which causes a mis-calculation in free memory available to allocate for the KV cache manager, hence causing a crash.

This commit fixes this by letting KV cache manager initialization be aware whether it is doing the dry run or not. If it is a dry run, use the max_tokens setting that is already pre-calculated and filled into kv_cache_config.max_tokens.

Signed-off-by: eopXD <yuehtingc@nvidia.com>
2025-11-25 17:27:11 +08:00
Yukun He
e580da4155
[TRTLLM-7963][feat] Cold L2 cache when doing autotune benchmarking. (#8779)
The performance results of some kernels could be easily affected by the warm/cold L2 cache status. To achieve more precise profiling results, the L2 cache is cleared for every execution by the circular buffer method for better benchmarking during autotuning.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-11-25 15:06:22 +08:00
William Zhang
a4049fc557
[#9413][fix] Minor fixes to nemotron H and custom models in AD (#9416)
* Why?

There were a couple of issues with the recently merged custom model
injection for AutoDeploy + the reference implementation of nemotron
H:
- `d_mlp` was left in despite being mathematically always null (could
  lead to runtime issues during sharding).
- the custom model mapping was inherited by children factories.

* What?

This commit fixes these issues, and refactors the key of the custom
implementation to be based on the name of the configuration class as
well.

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-11-24 20:17:33 -08:00
Suyog Gupta
efd503751f
[#9271][perf] Enable multi-stream MOE optimization in AutoDeploy (#9322)
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-11-24 19:50:10 -08:00
Yuxian Qiu
8a0295015f
[None][chore] Reduce nested nvtx ranges. (#9347)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-11-25 09:58:41 +08:00
bhsueh_NV
1a93583438
[None][feat] Support Yarn on QwQ-32B model (#9059)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>
Co-authored-by: NVJiangShao <91270701+StudyingShao@users.noreply.github.com>
2025-11-25 07:27:28 +08:00
Yibin Li
1ce483c999
[TRTLLM-7967][feat] Adding Starcoder2 PyTorch Backend Support (#8923)
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-11-24 11:23:22 -08:00
Yukun He
960851f419
[None][chore] Remove unnecessary log in the short tuning profile (#9387)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-11-24 12:31:26 +08:00
Yukun He
39076410a8
[https://nvbugs/5676748][fix] Fix mismatched nvfp4 gemm sf shape. (#9336)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-11-24 12:16:32 +08:00
brb-nv
c045e359a7
[https://nvbugs/5637012][fix] Fix helix unit tests (#9369)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-11-23 19:34:22 -08:00
QI JUN
34a6d2d28f
[TRTLLM-9302][chore] Move build config from BaseLlmArgs to TrtLlmArgs (#9249)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-11-24 10:54:41 +08:00
Yukun He
c3acf965a6
[TRTLLM-7963][fix] Several improvements of autotuning quality (#9348)
* Skip the shape profile generating process if the profile has already been found in the cache under tuning mode. This is a prerequisite for nested autotuning because host overhead might be included during the profiling of the high-level op.
* Enable the profiling with CUDA graph as the default profiling method.
* Apply a heuristic method to cut off the number of repeat times of profiling according to a few-run time measurement.
2025-11-24 10:38:45 +08:00
Bo Li
fcfec93cad
[TRTLLM-9389][chore] Rename AlltoAll backend names (#9329)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-11-23 13:52:57 -08:00
JadoTu
0582e54b61
[None][fix] modify qwen3-next sampling stop_tokens (#9331)
Signed-off-by: jiant <107457950+JadoTu@users.noreply.github.com>
2025-11-23 21:10:09 +08:00
William Zhang
11a0b276fb
[#9230][feat] Slimmed down implementation of nemotron H (#9235)
* Why?

The reference nemotron H code on HuggingFace is out of date,
and therefore bugged, and has several untested code paths.
This makes an already hairy patching system even hairier.

The proposal is to do away with those patches, and replace the
original implementation with one that is heavily slimmed down.

* What?

This PR sets the basis for an alternative path with such a 
slimmed down implementation that:
- fixes bugs in the current HF implementation
- adds no new dependencies to TensorRT-LLM
- does away with unnecessary features for TensorRT-LLM/
  AutoDeploy:
- no training related code (dropout, gradient checkpointing, etc.)
- no caching logic (we want to replace it with our own anyway)
- no attention masking where possible
- reuses existing AD custom ops for mamba SSM update /
   causal conv1d / attention

In order for the above to be usable in the AD apparatus,
`AutoModelForCausalLMFactory` is extended to allow registrations
of custom model implementations.

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-11-23 03:13:32 -08:00
Neta Zmora
3952a61681
[#9388][fix] AutoDeploy: Fix cutlass BF16 MoE kernel invocation (#9339)
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
2025-11-21 17:05:03 -08:00
Chenghao Zhang
564989865c
[TRTLLM-9082][feat] AutoDeploy: Move the moe Align kernel to AOT (#9106)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
2025-11-21 16:05:48 -08:00
Izzy Putterman
eb7792e875
[None][feat] Eagle: PostNorm and multilayer options (#9233)
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
2025-11-21 17:39:00 -05:00
Enwei Zhu
13fbd4366a
[TRTLLM-9370][feat] Integration of CuteDSL NVFP4 grouped GEMM (Part 2: SwiGLU Fusion and Finalize Fusion) (#9288)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-11-21 14:03:38 -08:00
Ziyi Xiong
5df907b388
[https://nvbugs/5590408][fix] Fallback to greedy sampling in two-model overlap scheduler (#9321)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-11-21 10:19:59 -05:00
HuiGao-NV
6dd2fcd7b3
[https://nvbugs/5629833][fix] Don't fill tensors with 0 (#9296)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-11-21 20:50:05 +08:00
mpikulski
cddc7549d1
[TRTLLM-9191][feat] support out-of-tree models in trtllm-serve (#9269)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-11-21 04:23:47 -08:00
mpikulski
095b6864a8
[TRTLLM-8650][fix] beam search request validation (#8433) (#9228)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-11-21 04:08:45 -08:00
Yiqing Yan
8cd3b496e9
[None][chore] Bump version to 1.2.0rc4 (#9363)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-11-21 18:28:12 +08:00
xxi
cc0dc7c124
[TRTLLM-8957][feat] create communication related classes (#8968) 2025-11-20 22:32:42 -08:00
Pengyun Lin
eca68e4465 [https://nvbugs/5564465][fix] Overwrite only if default_max_tokens is legal (#8538)
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
2025-11-20 12:43:13 -05:00
Yukun He
9a79f32f7a [https://nvbugs/5608489][fix] Fix output unpack issues for Llama3/4 NVFP4 models. (#8679)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
2025-11-20 12:43:13 -05:00
Lizhi Zhou
33b0b945c7 [https://nvbugs/5582277][fix] rework DisaggPPTerminationHandler to fix hang issue (#8519)
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
2025-11-20 12:43:13 -05:00
Yan Chunwei
b5f9fff1c1 [https://nvbugs/5569754][fix] trtllm-llmapi-launch port conflict (#8582)
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
2025-11-20 12:43:13 -05:00
Jin Li
3454eacd74 [https://nvbugs/5546510][fix] Move torch.cuda.Stream out of torch com… (#8494)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
2025-11-20 12:43:13 -05:00
JunyiXu-nv
ee6944bfa2 [https://nvbugs/5569713][fix] Disable fp8 deep gemm for EXAONE-4.0-32B-FP8 (#8429)
Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
2025-11-20 12:43:13 -05:00
Liao Lanyu
04ad9f96fa
[https://nvbugs/5667687][fix] Set correct lm_head_tp_size_upper_bound (#9300)
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Co-authored-by: Lanyu Liao <lancelly@users.noreply.github.com>
2025-11-20 00:41:00 -08:00
Neta Zmora
1d6fbbf45d
[#9236][feature] Make sharing of activation_type across SW layers more robust (#9238)
C++, Python and Python MoE layer all share the definition of ActivationType.
Currently this is done thru redefinition which is fragile and can break when adding new activation function types.

tensorrt_llm/_torch/utils.py
cpp/tensorrt_llm/kernels/cutlass_kernels/include/common.h
=>
tensorrt_llm/layers/moe.py
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-11-20 16:06:58 +08:00
Yukun He
5d118e0326
[None][chore] Revise the description of enable_autotuner. (#9320)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-11-19 22:59:37 -08:00
Yechan Kim
d5622b2689
[None][fix] Multimodal InputProcessor dummy builder fix (#8916)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-11-19 22:32:21 -08:00
Chang Liu
79a6c9742b
[None][fix] Use fp32 for indexer weight_proj GEMM (#9243)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-11-19 21:52:38 -08:00
Neta Zmora
028fc877a5
[#9096][feature] Auto Deploy: configurable fused MoE backend (#9194)
Allow configuring Auto Deploy's MoE/FP8-MoE backend from external yaml config file.

Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
2025-11-19 21:50:22 -08:00
JunyiXu-nv
46dccb5e2d
[None][chore] Prevent negative max_tokens passed into tllm request (#9037)
Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>
2025-11-20 09:58:13 +08:00
Yukun He
b6bced83c0
[TRTLLM-7963][feat] Use CUDAGraph to improve the tuning accuracy for AutoTuner. (#9089)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-11-20 08:54:29 +08:00
Fanrong Li
d4abb86f3e
[None][fix] fix EPLB for DeepSeek-V3.2-Exp (#9245)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-11-19 13:45:54 -08:00
Faraz
49c45ebef1
[None][fix] change logging for weight loading on unified memory (#9177)
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
Signed-off-by: Simeng Liu <109828133+SimengLiu-nv@users.noreply.github.com>
Co-authored-by: Simeng Liu <109828133+SimengLiu-nv@users.noreply.github.com>
2025-11-19 14:31:19 -05:00
NVShreyas
1eae941d77
[#9237][feat] enable iter stats in autodeploy (#9278)
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
2025-11-19 19:29:29 +01:00
NVShreyas
a7c0b54ce7
[None][feat] add specdec to nemotron nas (#8985)
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
2025-11-19 19:28:35 +01:00
Bo Li
d8b05894ee
[None][perf] Adjust select_alltoall_method_type. (#8950)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-11-19 07:43:55 -08:00
mpikulski
46dd9886bb
[https://nvbugs/5661877][fix] fix test regression in TestBatchedSampling::test_samples (#9215)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-11-19 01:44:44 -08:00
CarstyYou
ee941ac779
[https://nvbugs/5456493][feat] add fp8 dense for sm120 (#9174)
Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>
2025-11-19 14:40:34 +08:00
ChristinaZ
941a54c66a
[None][feat] Update the indexer topK (#9255)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-11-19 11:49:00 +08:00
jellysnack
99ba723e20
[None][fix] logits device and shape issues in dynamic draft path (#9079)
Signed-off-by: jellysnack <oleg.jellysnack@gmail.com>
2025-11-18 19:22:47 -08:00
Grzegorz Kwasniewski
7905d6c0da
[#9098][feat] Simple sharding latent experts (#9099)
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
2025-11-18 21:14:22 -05:00
Grzegorz Kwasniewski
92f86a50d4
[#9137][feat] Factory sharding as default (#9144)
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
2025-11-18 21:12:03 -05:00
Patrice Castonguay
9b0f45298f
[None][feat] Have ability to cancel disagg request if KV cache resource are exhausted (#9155)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-11-18 20:59:17 -05:00
Enwei Zhu
7c4777a571
[TRTLLM-9286][feat] Integration of CuteDSL NVFP4 grouped GEMM (#8880)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-11-18 17:40:12 -08:00
Ziyi Xiong
7c4344b92e
[https://nvbugs/5590408][fix] Exclude num of draft tokens from mMaxSeqLenKv (#9210)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-11-18 15:41:56 -05:00
Eran Geva
3ac11a6180
[#9152][fix] AutoDeploy fused_allreduce_residual_rmsnorm to support demollm mode (#9197)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-11-18 22:15:29 +02:00
Chenghao Zhang
f0b68e4c66
[None][feat] AutoDeploy: Perf improvement for small batch size (#9163)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-11-18 12:11:12 -08:00
Gal Hubara-Agam
36d3d8f608
[None][chore] Print device info in trtllm-bench report (#8584)
Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
2025-11-18 09:00:10 -08:00
Zheyu Fu
c4e02d7f04
[TRTLLM-8136][feat] Dynamic draft length in spec decode (stage 1). (#8194)
Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>
2025-11-18 11:13:39 -05:00
Robin Kobus
9913dc25ae
[None][refactor] decoding inputs, part 2 (#5799)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-11-18 14:38:51 +01:00
Chang Liu
8e001dd195
[None][fix] DeepSeek V3.2 indexer RoPE fix (#9232)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-11-18 20:35:27 +08:00
Lizhi Zhou
07343bb11c
[None][chore] fix a deepseekv3 error when debug mode is on (#9217)
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2025-11-18 01:14:32 -08:00
ruodil
82480346aa
[https://nvbugs/5652552][fix] add printing for llm args (#9205)
Signed-off-by: Ruodi Lu <ruodil@users.noreply.github.com>
Co-authored-by: Ruodi Lu <ruodil@users.noreply.github.com>
2025-11-17 23:58:36 -08:00
Tri Dao
fc088e642c
[None][feat] Support Glm4MoeForCausalLM (#8256)
Signed-off-by: Tri Dao <daominhtri0503@gmail.com>
Co-authored-by: Xuanyu Chen <xuanyuc@nvidia.com>
2025-11-18 09:43:21 +08:00
Lucas Liebenwein
6d0a8edbbb
[None][chore] local imports for AutoDeploy in serve and bench (#9199)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-11-18 08:14:32 +08:00
Robin Kobus
df41f220a2
[TRTLLM-8831][feat] Enable early exit with overlap scheduler (#8587)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-11-17 18:07:13 +01:00
Mike Iovine
6151a4c9d6
[None][feat] Add simple optimizations for MTP 2-model (#9176)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-11-17 10:05:39 -05:00
Kaiyu Xie
04be5a704e
[None] [fix] Fix missing ActivationType issue (#9171)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
2025-11-17 10:43:25 +08:00
Anthony Chang
86cfb3ea7e
[None][feat] Update TRTLLM MoE cubins; reduce mxfp4 weight padding requirement; tighten TMA bound (#9025)
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-11-17 10:04:29 +08:00
Jinyang Yuan
6dc70aa0e5
[https://nvbugs/5613089][fix] Fix the rank to access all_rank_chunk_size_list when chunked MoE is used (#8723)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-11-17 10:01:08 +08:00
sunnyqgg
7862b15a65
[TRTLLM-8778][feat] Add tree attention support for blackwell arch (#8975)
Signed-off-by: qgai <qgai@nvidia.com>
2025-11-17 09:01:53 +08:00
Guoming Zhang
e0f69657c7
[None][fix] Update the attention layers counting for Qwen3-next. (#9072)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-11-16 11:52:56 -08:00
JadoTu
3cde84581d
[None][fix] Make the sliced nvfp4 output contiguous (#9123)
Signed-off-by: jiant <107457950+JadoTu@users.noreply.github.com>
2025-11-15 20:00:54 +08:00
Chenghao Zhang
f6f6e1f25d
[#9102][feat] AutoDeploy: Support fp8 kv cache (#9107)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
2025-11-13 23:55:45 -08:00
Lizhi Zhou
8bd779171e
[https://nvbugs/5631254][fix] avoid torch.compile for multiple times (#9135)
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2025-11-13 21:49:52 -08:00
Suyog Gupta
d12cb9436d
[None][feat] Autodeploy add triton configs and optimize mamba prefill (#9083)
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-11-13 19:15:43 -08:00
heyuhhh
f07e9977c6
[None] [feat] Use triton kernels for RocketKV prediction module (#8682)
Signed-off-by: yuhangh <58161490+heyuhhh@users.noreply.github.com>
2025-11-13 18:51:09 -08:00
Tailing Yuan
cc4c980e03
[None][feat] Add Qwen3-Next to layer-wise benchmarks (#9065)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2025-11-14 10:03:00 +08:00
JunyiXu-nv
fdb0787e85
[None][chore] Support json_schema in response_format (#8934)
Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>
2025-11-14 09:43:13 +08:00
Erin
44d1c75701
[TRTLLM-8988][feat] Unify MPI & Ray's req/response handling with RPC Client/Server (#8765)
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-11-13 17:21:24 -08:00
Neta Zmora
34dc6869f3
[#8732][feat] Update TRTLLM Cutlass MoE kernels with ReLU2 (#9011)
Update TRTLLM Cutlass MoE kernels with ReLU2 activation.

Nemotron-6 requires ReLU2 (i.e. squared ReLU) MoE activation function.
The PR adds this and adds an API to set the activation function, in general.
The ReLU2 changes are based on this FlashInfer PR: https://github.com/flashinfer-ai/flashinfer/pull/1954.

The PR also updates the Auto Deploy MoE backend for 16-bit and FP8 from
Triton (`torch.ops.auto_deploy.triton_moe_fused`, `torch.ops.auto_deploy.triton_quant_fp8_moe`) to TRTLLM/Cutlass (`torch.ops.auto_deploy.trtllm_moe_fused`, `torch.ops.auto_deploy.trtllm_quant_fp8_moe_fused`).

Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
2025-11-13 16:54:45 -08:00
dongxuy04
a370643b26
[None][fix] support topk autotuner input for expert slot per group larger than 32 (#9087)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-11-14 08:37:20 +08:00
Leslie Fang
daa31d78f4
[https://nvbugs/5652552][fix] Log the llm args for main branch (#9120)
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
2025-11-14 07:43:21 +08:00
Frida Hou
b51258acdd
[None][autodeploy] fix weight extraction for graph based quantized checkpoints (#9109)
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2025-11-13 13:14:24 -08:00
Frida Hou
e96a3d294d
[None][autodeploy] minor refactor to rmsnorm transforms (#8657)
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2025-11-13 13:13:58 -08:00
Jinyang Yuan
12f339f3bf
[None][fix] Fix the aux_stream in Llama4MinLatencyFusedMoE (#9035)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-11-13 09:09:52 -08:00
Ziyi Xiong
a7aaf50541
[TRTLLM-8084][feat] Enhance the overlap shceduler for two-model spec decoding (#8706)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-11-13 10:20:16 -05:00
William Zhang
121140cfec
[None][fixes] Add tool call parsing fixes and Qwen3 coder parser (#8817)
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-11-13 04:34:38 -08:00
Kaiyu Xie
177ba7b0f1
[None] [fix] Disable UCC as WAR to MPI allgather issue before NGC PyTorch 25.12 upgrade (#9126)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-11-13 02:25:30 -08:00
Chang Liu
c37924f37b
[None][fix] Clear indexer k cache reference before release cuda memory (#9110)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-11-12 22:12:53 -08:00
Zhang Ge
49df731b96
[#6507][fix] Fix precision issue due to KV layout mismatch for split/concat kernels (#6917)
Signed-off-by: ZhangGe6 <sjtu.zg123@gmail.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-11-13 12:14:58 +08:00
QI JUN
d1b003d31e
[TRTLLM-9212][chore] move MoeLoadBalancerConfig to llm_args.py (#9002)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-11-13 10:47:35 +08:00
Zhenhuan Chen
943b05e2d3
[TRTLLM-9179][feat] add pp_partition to customize each rank's layer number (#9003)
Signed-off-by: Zhenhuan Chen <zhenhuanc@nvidia.com>
2025-11-13 10:34:17 +08:00
Chenghao Zhang
f1d637ec69
[None][fix] AutoDeploy: Use tmp folder for the load_moe_align (#9101)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
2025-11-12 14:59:49 -08:00
dongxuy04
9241ccaf27
[None][feat] Enable EPLB for trtllm-gen and cutlass backend (#8886)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-11-12 12:30:27 -08:00
Patrice Castonguay
8a751a0e56
[None][chore] Remove is_disaggregated param in executor request queue (#9049)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-11-12 13:37:15 -05:00
Fanrong Li
780d4f9dc5
[None][feat] Add MTP>1 support for DS-v3.2 (#9045)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-11-12 09:56:12 -08:00
Neta Zmora
53491ffdb1
[#9023][feat] reduce AD graph optimization time for non-participating passes (#9024)
Shorten AD graph optimization by 30% (measured on Nemotron-6):

A bug in the transformation interface marked all passes as not clean, regardless of what was reported by the transformation
Fix how the optimization passes report the results of their actions. Many passes report that the graph is not clean even when they didn't participate in the optimization. Each graph cleaning invocation can take several seconds.

Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
2025-11-12 09:05:53 -08:00
Chang Liu
0b81173efa
[TRTLLM-9259][perf] Use torch.compile to fuse copy + layernorm within the LayerNorm module (#9052)
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
2025-11-11 18:11:00 -08:00
Lucas Liebenwein
aca56097cb
[None][fix] AutoDeploy: update nano3 accuracy test (#9061)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-11-11 12:26:31 -08:00
QI JUN
524754b6fd
[TRTLLM-8521][chore] remove circular dependency between model engine and cuda graph runner (#7572)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-11-11 10:13:45 -08:00
Chenghao Zhang
ec9cf715a2
[None][feat] AutoDeploy: Perf improvement for mamba layers (#8991)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-11-11 08:27:07 -08:00
Wanli Jiang
ebdd1cc8e0
[TRTLLM-8119][feat] Update doc/tests/chat_template for nano-v2-vlm (#8840)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-11-11 07:48:23 -08:00
mpikulski
20fd305bb6
[None][fix] type annotation (#9071)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-11-11 07:20:20 -08:00
mpikulski
b151de4a8f
[TRTLLM-8377][test] unit tests for TorchSampler batched sampling (#9012)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-11-11 07:16:42 -08:00
Guoming Zhang
b894dc2d70
[None][fix] Display the GPU memory information in GiB unit. (#9070)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-11-11 06:24:59 -08:00
mpikulski
979b3ae9ce
[TRTLLM-7723][feat] sampling using FlashInfer.sampling (#8581)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-11-11 03:21:19 -08:00
Yuxian Qiu
7aeac97e4e
[https://nvbugs/5622938][fix] Use async send_requests_to_next_pp. (#9041)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-11-11 14:19:44 +08:00
Lucas Liebenwein
6bf4e59267
[#8763][feature] AutoDeploy: configurable dtype for caching (#8812)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-11-10 22:17:14 -08:00
Chang Liu
7ceb5e5ab6
[TRTLLM-9198][perf] Add torch.compile + multi-stream support for k-cache scatter and weight scaling (#8988)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-11-11 12:33:30 +08:00
shuyixiong
1ccb799c9a
[None][chore] Relocate rlhf_utils.py (#8938)
Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com>
2025-11-10 19:03:23 -08:00
Liao Lanyu
1fd11455d8
[https://nvbugs/5556998][fix] init_hf_modules in worker_main for models with trust_remote=true (#8931)
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Co-authored-by: Lanyu Liao <lancelly@users.noreply.github.com>
2025-11-11 10:30:37 +08:00
Frida Hou
f40e1f7496
[https://nvbugs/5625972][fix] Add context manager to fix FakeTensorProp (#9047)
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2025-11-10 16:25:58 -08:00
mpikulski
edc91ba819
[None][fix] Improve type annotations on ResourceManager.get_resource_manager (#9013)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-11-10 15:06:16 +01:00
ChristinaZ
2e7769d1e8
[None][feat] Add customized topk and related unit tests for DSA (#8882)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-11-10 03:35:35 -08:00
bhsueh_NV
e8d4a56dd0
[None][fix] fix eagle3 accuracy issue on sm120 (#8944)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-11-10 14:02:03 +08:00
Fanrong Li
a7033a9193
[TRTLLM-9001][feat] add TP support for DeepSeek-V3.2 (#8943)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-11-10 12:16:01 +08:00
mpikulski
533add5056
[TRTLLM-8598][feat] enable n > 1 in OpenAI API with PyTorch backend (#8951)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-11-07 17:47:35 -08:00
hvagadia
6ff82ea24e
[None][feat] Allow env variable to specify spawn process IPC address (#8922)
Signed-off-by: hvagadia <hvagadia@nvidia.com>
2025-11-07 15:45:57 -08:00
Chang Liu
7081f254cf
[None][perf] Add custom indexer k cache scatter op (#8960)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-11-07 11:24:26 -08:00
Patrice Castonguay
d8ea0b967f
[None][fix] Moving transfer timeout test to test_llm_pytorch, fixing broken kv transfer timeout (#8892)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-11-07 07:33:51 -08:00
Yiqing Yan
c836ae5aaa
[None][chore] Bump version to 1.2.0rc3 (#9004)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-11-07 01:24:32 -08:00
mpikulski
5ef65872a3
[None][fix] type annotations in fuse_input_embeds (#8976)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-11-07 09:04:08 +01:00
Stefan Niebler
326a201473
[https://nvbugs/5508536][fix] Take Over (#8627): Reintroduce: Move stop_criteria to sample_async (#7041) (#8794)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-11-07 09:01:15 +01:00
QI JUN
1c6e490894
[TRTLLM-9065][chore] remove PyTorchConfig completely (#8856)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-11-06 22:37:03 -08:00
Eran Geva
990e674b71
[None][fix] Switch AD AllReduce strategy to NCCL (#8979)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-11-07 06:49:44 +02:00
xiweny
ee20e679a9
[https://nvbugs/5636986][fix] Fix DeepGemmMoe get_buffer calls (#8939)
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: xiweny <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-11-06 19:57:19 -08:00
Cao Dong
b53961e972
[None][feat] Return logprobs incrementally in torch backend (#8785)
Signed-off-by: Dong Cao <docao@nvidia.com>
2025-11-07 10:23:39 +08:00