Commit Graph

735 Commits

Author SHA1 Message Date
Lucas Liebenwein
9879400479
[#10642][feat] AutoDeploy: optimized canonicalize_graph utilities [1/2] (#10675)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2026-01-18 13:42:30 -05:00
Eran Geva
4d2916d683
[#10688][fix] AutoDeploy Fix CUDA graph batch sizes exceeding max_batch_size (#10687)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2026-01-18 13:31:01 -05:00
Eran Geva
a11f0dbd61
[#10696][fix] AutoDeploy prevent torch.export from specializing batch dimension when max_batch_size=1 (#10697)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2026-01-18 10:42:49 +02:00
Grzegorz Kwasniewski
7bf4dd9f63
[TRTLLM-10318][feat] Fixing Nemotron sharding: support for sharding buffers (#10319)
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
Signed-off-by: Lucas <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Grzegorz Kwasniewski <213329731+greg-kwasniewski1@users.noreply.github.com>
Co-authored-by: Lucas <11156568+lucaslie@users.noreply.github.com>
2026-01-17 04:02:06 -05:00
Yukun He
3d16daf696
[None][fix] Fix tmp dir being deleted too early in unit test. (#10740)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2026-01-17 13:49:10 +08:00
Frida Hou
069ad68d3c
[None][fix] AutoDeploy: skip mxfp4_moe test unless on Hopper (#10729)
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2026-01-16 16:24:37 -05:00
Chenghao Zhang
b6acd96616
[None][fix] AutoDeploy: Fix the nvfp4 fused_moe (#10727)
Signed-off-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com>
2026-01-16 12:04:40 -08:00
Stefan Niebler
0cfd08745c
[TRTLLM-9735][feat] Add processed logprobs functionality to TorchSampler (#9675)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2026-01-16 10:52:41 -08:00
xxi
ce561b6a8e
[TRTLLM-9111][feat] MoE test refactor: Extend MoE quantization test utilities with comprehensive quant algorithm support (#10691)
Signed-off-by: xxi <xxi@nvidia.com>
2026-01-16 10:26:33 +08:00
heyuhhh
e3f27e06c7
[None][chore] Waive star attention unittests (#10439)
Signed-off-by: yuhangh <58161490+heyuhhh@users.noreply.github.com>
2026-01-16 10:12:32 +08:00
Perkz Zheng
71ccc07d2b
[None][feat] update trtllm-gen to support groupsTokensHeadsQ (#10261)
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Co-authored-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2026-01-15 02:24:25 -05:00
Anish Shanbhag
faa80e73fd
[None][feat] Auto download speculative models from HF for pytorch backend, add speculative_model field alias (#10099)
Signed-off-by: Anish Shanbhag <ashanbhag@nvidia.com>
2026-01-14 21:06:07 -08:00
Lucas Liebenwein
15b43e8a14
[https://nvbugs/5777041][fix] fix AutoDeploy ep sharding test (#10460)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2026-01-14 21:53:56 -05:00
Yuxian Qiu
39cefd6125
[None][refactor] Unify the usage of MPIDist and TorchDist. (#10380)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2026-01-14 14:05:47 +08:00
Leslie Fang
bc119f5644
[None][chore] Add test configurable moe module (#10575)
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
2026-01-14 07:25:57 +08:00
Frida Hou
bf16fbd86c
[#9283][feat] AutoDeploy: separate rms pattern detection from fusion (#9969)
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2026-01-13 14:57:27 -05:00
benzh-2025
6df2c8a074
[None][feat] add fp4 gemm + allreduce (#9729)
Signed-off-by: benzh 
Signed-off-by: benzh-2025
2026-01-13 21:11:13 +08:00
mpikulski
bf7998f1b8
[TRTLLM-9522][test] cover LLM API multi_modal_embeddings (#9963)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2026-01-12 11:38:22 +01:00
Yechan Kim
8e0d20d901
[TRTLLM-10195][feat] K-EXAONE support (#10355)
Signed-off-by: Jaedeok Kim <jaedeokk@nvidia.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Co-authored-by: Jaedeok Kim <jaedeokk@nvidia.com>
2026-01-12 00:29:51 +09:00
Chenghao Zhang
38f249b479
[https://nvbugs/5548861][fix] AutoDeploy: Fix the test (#10521)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
2026-01-09 13:30:24 -08:00
Yechan Kim
7295af68ba
[None][fix] Enable AttentionDP on Qwen3-VL and fix test (#10435)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2026-01-10 00:13:26 +09:00
William Zhang
c0ae6bbdbe
[None][feat] EPD for Qwen3 VL (#10470)
* Why?

We would like to support EPD disaggregated serving for Qwen3 VL.

* What?

This commit adds such support, and extends existing unit tests for
correctness checks.

Some minor (protected) interface changes had to be made to the
weight mapper as a side-effect.

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2026-01-08 06:45:54 -05:00
Lucas Liebenwein
30f8455d29
[https://nvbugs/5747878][fix] unwaive llama4 scout tests (#10468)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2026-01-07 23:33:45 -05:00
Yuxian Qiu
b85c447ceb
[https://nvbugs/5784543][fix] Setup dist before using autotuner. (#10491)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2026-01-08 10:32:50 +08:00
Lucas Liebenwein
d736c7f290
[https://nvbugs/5761665][fix] AutoDeploy: handle bugs for 25.12 dlfw upgrade (#10511)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2026-01-07 20:16:53 -05:00
Lucas Liebenwein
6095c80e56
[https://nvbugs/5721907][fix] AutoDeploy: improve numerical stability of flashinfer attention test (#10467)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2026-01-06 21:11:06 -05:00
Lucas Liebenwein
bb6a3973aa
[https://nvbugs/5732942][fix] AutoDeploy: handle transformers 4.57.1 upgrade fixes (#10466)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2026-01-06 19:55:49 -05:00
Mike Iovine
77be1b7572
[https://nvbugs/5749988][fix] Remove redundant qwen3 spec dec test (#10387)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2026-01-06 11:46:34 -05:00
alel
6b8ae6fa81
[None][feat] CuteDSL MOE FC1 Enhancement (#10088)
Signed-off-by: Yuhan Li <51736452+liyuhannnnn@users.noreply.github.com>
2026-01-06 09:30:43 +08:00
Anthony Chang
225d3a9001
[None][perf] TRTLLM MoE maps to lower tuning buckets when ep>1 (#9998)
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2026-01-05 17:16:12 +01:00
Yukun He
d272f1a9bc
[TRTLLM-8821][feat] Apply AutoTuner to AllReduce Op for strategy tuning. (#8531)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2026-01-05 15:44:37 +08:00
Yukun He
0937df2c68
[TRTLLM-10185][feat] AutoTuner Cache: Support cache file lock and merge all ranks into one (#10336)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2026-01-05 13:44:09 +08:00
dongfengy
afc533193d
[None][feat] Support nvfp4 for gptoss (#8956)
Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
2026-01-04 08:57:44 -05:00
Jaedeok Kim
a4dcc6a711
[TRTLLM-10171][fix] Correct attention handling in ModelConfig and KVCacheManager (#10330)
Signed-off-by: Jaedeok Kim <jaedeokk@nvidia.com>
2026-01-04 06:07:30 -05:00
Izzy Putterman
bdf6953ddc
[None][feat] Eagle: MLA Based Eagle (#9677)
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
2026-01-02 13:45:07 -05:00
Balaram Buddharaju
4a1b742aa0
[TRTLLM-9467][fix] Fix PP+CP combination with helix parallelism (#10312)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2026-01-01 13:42:53 -05:00
Lucas Liebenwein
1bbe71b3ed
[#10244][feat] AutoDeploy: separate prefill/decode in flashinfer (#10252)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-12-31 17:01:24 -05:00
Simeng Liu
84d107b2f0
[https://nvbugs/5717993][fix] Add execution_stream across PyExecutor, KVCacheManager, PeftCacheManager to ensure proper CUDA stream synchronization between KV cache transfer operations and model forward kernels. (#10060)
Signed-off-by: SimengLiu-nv <simengl@nvidia.com>
2025-12-31 09:22:54 -08:00
tcherckez-nvidia
464847c6be
[#9717][chore] Standardize MoE weights interface (#10295)
Signed-off-by: Tal Cherckez <127761168+tcherckez-nvidia@users.noreply.github.com>
2025-12-31 07:37:18 -05:00
Necofish
73870ae4ad
[None][feat] support Qwen3-VL dense model in pytorch backend (#9060)
Signed-off-by: Nekofish-L <liuxiangyang@mail.ustc.edu.cn>
2025-12-31 17:54:26 +09:00
Neta Zmora
966231d29c
[#9626][feat] Add an auto-deploy transform for using cutlass FP4 MoE kernels (#10304)
Add a transform to relace torch.ops.auto_deploy.torch_quant_nvfp4_moe
with the optimized torch.ops.auto_deploy.trtllm_quant_nvfp4_moe_fused.

Currently generates the wrong results when the number of rows in MoE FC1 weights is not divisible by 128,
so torch.ops.auto_deploy.trtllm_quant_nvfp4_moe_fused is not set as the default FP4 MoE implementation (i.e. the transform is disabled).

Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
2025-12-29 23:18:15 +02:00
Guoming Zhang
93ac0bc1dc
[TRTLLM-10126][feat] Increase topk upper limit to 22 for NVLinkOneSid… (#10229)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-12-27 22:48:10 +08:00
Neta Zmora
f3f02315df
[None][chore]: small refactoring to auto-deploy MoE operator (#10300)
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
2025-12-25 12:27:11 -05:00
Ziyi Xiong
d8b5aeb061
[https://nvbugs/5652062][fix] Rewind kv_cache and reset draft tokens (#10160)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-12-25 09:13:51 -05:00
ZhichenJiang
46e4af5688
[TRTLLM-9831][perf] Enable 2CTA with autotune for CuteDSL MoE and Grouped GEMM optimizations (#10201)
Signed-off-by: zhichen jiang <zhichenj@NVIDIA.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-12-25 09:04:20 -05:00
Gabriel Wu
1d01214ff0
[None][feat] Drop non-deepgemm fp8 block scale gemm (#10256)
Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>
2025-12-25 14:52:52 +08:00
Neta Zmora
c4b36d31ff
[#10137][feat] AutoDeploy FP8 MoE refactor (#10138)
The trtllm (cutlass) fp8 moe operator performs W3+W1 fusion (concat) during inference and we want to move this fusion to the model optimization time.

The Cutlass MoE kernel is used thru a trtllm torch operator.
Its implementation uses two FC operations (fc1 and fc2) while the canonical MoE API defines three GEMM operations and their associated weights (W1, W2, W3) so when we switch from the torch.moe op to the trtllm.moe op we also change terminology from w1, w2, w3 to fc1, fc2.

Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
2025-12-24 18:58:10 +02:00
shuyixiong
f4f0fe85e9
[TRTLLM-9737][chore] Add rl perf reproduce script and enhance the robustness of Ray tests (#9939)
Signed-off-by: Shuyi Xiong <219646547+shuyixiong@users.noreply.github.com>
2025-12-24 15:27:01 +08:00
Fanrong Li
156f6453dc
[TRTLLM-9798][feat] Change to use new DeepGEMM MQA sm100 kernel for MTP-3 (#10226)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-12-24 14:39:12 +08:00
Balaram Buddharaju
8c1cfc872b
[TRTLLM-9493][feat] Custom AllToAll for helix parallelism (#9986)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-12-23 18:14:30 -08:00