Commit Graph

1988 Commits

Author SHA1 Message Date
Leslie Fang
daa31d78f4
[https://nvbugs/5652552][fix] Log the llm args for main branch (#9120)
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
2025-11-14 07:43:21 +08:00
Frida Hou
b51258acdd
[None][autodeploy] fix weight extraction for graph based quantized checkpoints (#9109)
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2025-11-13 13:14:24 -08:00
Frida Hou
e96a3d294d
[None][autodeploy] minor refactor to rmsnorm transforms (#8657)
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2025-11-13 13:13:58 -08:00
Jinyang Yuan
12f339f3bf
[None][fix] Fix the aux_stream in Llama4MinLatencyFusedMoE (#9035)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-11-13 09:09:52 -08:00
Ziyi Xiong
a7aaf50541
[TRTLLM-8084][feat] Enhance the overlap shceduler for two-model spec decoding (#8706)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-11-13 10:20:16 -05:00
William Zhang
121140cfec
[None][fixes] Add tool call parsing fixes and Qwen3 coder parser (#8817)
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-11-13 04:34:38 -08:00
Kaiyu Xie
177ba7b0f1
[None] [fix] Disable UCC as WAR to MPI allgather issue before NGC PyTorch 25.12 upgrade (#9126)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-11-13 02:25:30 -08:00
Chang Liu
c37924f37b
[None][fix] Clear indexer k cache reference before release cuda memory (#9110)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-11-12 22:12:53 -08:00
Zhang Ge
49df731b96
[#6507][fix] Fix precision issue due to KV layout mismatch for split/concat kernels (#6917)
Signed-off-by: ZhangGe6 <sjtu.zg123@gmail.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-11-13 12:14:58 +08:00
QI JUN
d1b003d31e
[TRTLLM-9212][chore] move MoeLoadBalancerConfig to llm_args.py (#9002)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-11-13 10:47:35 +08:00
Zhenhuan Chen
943b05e2d3
[TRTLLM-9179][feat] add pp_partition to customize each rank's layer number (#9003)
Signed-off-by: Zhenhuan Chen <zhenhuanc@nvidia.com>
2025-11-13 10:34:17 +08:00
Chenghao Zhang
f1d637ec69
[None][fix] AutoDeploy: Use tmp folder for the load_moe_align (#9101)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
2025-11-12 14:59:49 -08:00
dongxuy04
9241ccaf27
[None][feat] Enable EPLB for trtllm-gen and cutlass backend (#8886)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-11-12 12:30:27 -08:00
Patrice Castonguay
8a751a0e56
[None][chore] Remove is_disaggregated param in executor request queue (#9049)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-11-12 13:37:15 -05:00
Fanrong Li
780d4f9dc5
[None][feat] Add MTP>1 support for DS-v3.2 (#9045)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-11-12 09:56:12 -08:00
Neta Zmora
53491ffdb1
[#9023][feat] reduce AD graph optimization time for non-participating passes (#9024)
Shorten AD graph optimization by 30% (measured on Nemotron-6):

A bug in the transformation interface marked all passes as not clean, regardless of what was reported by the transformation
Fix how the optimization passes report the results of their actions. Many passes report that the graph is not clean even when they didn't participate in the optimization. Each graph cleaning invocation can take several seconds.

Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
2025-11-12 09:05:53 -08:00
Chang Liu
0b81173efa
[TRTLLM-9259][perf] Use torch.compile to fuse copy + layernorm within the LayerNorm module (#9052)
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
2025-11-11 18:11:00 -08:00
Lucas Liebenwein
aca56097cb
[None][fix] AutoDeploy: update nano3 accuracy test (#9061)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-11-11 12:26:31 -08:00
QI JUN
524754b6fd
[TRTLLM-8521][chore] remove circular dependency between model engine and cuda graph runner (#7572)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-11-11 10:13:45 -08:00
Chenghao Zhang
ec9cf715a2
[None][feat] AutoDeploy: Perf improvement for mamba layers (#8991)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-11-11 08:27:07 -08:00
Wanli Jiang
ebdd1cc8e0
[TRTLLM-8119][feat] Update doc/tests/chat_template for nano-v2-vlm (#8840)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-11-11 07:48:23 -08:00
mpikulski
20fd305bb6
[None][fix] type annotation (#9071)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-11-11 07:20:20 -08:00
mpikulski
b151de4a8f
[TRTLLM-8377][test] unit tests for TorchSampler batched sampling (#9012)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-11-11 07:16:42 -08:00
Guoming Zhang
b894dc2d70
[None][fix] Display the GPU memory information in GiB unit. (#9070)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-11-11 06:24:59 -08:00
mpikulski
979b3ae9ce
[TRTLLM-7723][feat] sampling using FlashInfer.sampling (#8581)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-11-11 03:21:19 -08:00
Yuxian Qiu
7aeac97e4e
[https://nvbugs/5622938][fix] Use async send_requests_to_next_pp. (#9041)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-11-11 14:19:44 +08:00
Lucas Liebenwein
6bf4e59267
[#8763][feature] AutoDeploy: configurable dtype for caching (#8812)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-11-10 22:17:14 -08:00
Chang Liu
7ceb5e5ab6
[TRTLLM-9198][perf] Add torch.compile + multi-stream support for k-cache scatter and weight scaling (#8988)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-11-11 12:33:30 +08:00
shuyixiong
1ccb799c9a
[None][chore] Relocate rlhf_utils.py (#8938)
Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com>
2025-11-10 19:03:23 -08:00
Liao Lanyu
1fd11455d8
[https://nvbugs/5556998][fix] init_hf_modules in worker_main for models with trust_remote=true (#8931)
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Co-authored-by: Lanyu Liao <lancelly@users.noreply.github.com>
2025-11-11 10:30:37 +08:00
Frida Hou
f40e1f7496
[https://nvbugs/5625972][fix] Add context manager to fix FakeTensorProp (#9047)
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2025-11-10 16:25:58 -08:00
mpikulski
edc91ba819
[None][fix] Improve type annotations on ResourceManager.get_resource_manager (#9013)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-11-10 15:06:16 +01:00
ChristinaZ
2e7769d1e8
[None][feat] Add customized topk and related unit tests for DSA (#8882)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-11-10 03:35:35 -08:00
bhsueh_NV
e8d4a56dd0
[None][fix] fix eagle3 accuracy issue on sm120 (#8944)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-11-10 14:02:03 +08:00
Fanrong Li
a7033a9193
[TRTLLM-9001][feat] add TP support for DeepSeek-V3.2 (#8943)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-11-10 12:16:01 +08:00
mpikulski
533add5056
[TRTLLM-8598][feat] enable n > 1 in OpenAI API with PyTorch backend (#8951)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-11-07 17:47:35 -08:00
hvagadia
6ff82ea24e
[None][feat] Allow env variable to specify spawn process IPC address (#8922)
Signed-off-by: hvagadia <hvagadia@nvidia.com>
2025-11-07 15:45:57 -08:00
Chang Liu
7081f254cf
[None][perf] Add custom indexer k cache scatter op (#8960)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-11-07 11:24:26 -08:00
Patrice Castonguay
d8ea0b967f
[None][fix] Moving transfer timeout test to test_llm_pytorch, fixing broken kv transfer timeout (#8892)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-11-07 07:33:51 -08:00
Yiqing Yan
c836ae5aaa
[None][chore] Bump version to 1.2.0rc3 (#9004)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-11-07 01:24:32 -08:00
mpikulski
5ef65872a3
[None][fix] type annotations in fuse_input_embeds (#8976)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-11-07 09:04:08 +01:00
Stefan Niebler
326a201473
[https://nvbugs/5508536][fix] Take Over (#8627): Reintroduce: Move stop_criteria to sample_async (#7041) (#8794)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-11-07 09:01:15 +01:00
QI JUN
1c6e490894
[TRTLLM-9065][chore] remove PyTorchConfig completely (#8856)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-11-06 22:37:03 -08:00
Eran Geva
990e674b71
[None][fix] Switch AD AllReduce strategy to NCCL (#8979)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-11-07 06:49:44 +02:00
xiweny
ee20e679a9
[https://nvbugs/5636986][fix] Fix DeepGemmMoe get_buffer calls (#8939)
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: xiweny <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-11-06 19:57:19 -08:00
Cao Dong
b53961e972
[None][feat] Return logprobs incrementally in torch backend (#8785)
Signed-off-by: Dong Cao <docao@nvidia.com>
2025-11-07 10:23:39 +08:00
Chang Liu
1c19fd6868
[https://nvbugspro.nvidia.com/bug/5637012][fix] Bugfix when config is None for MLA (#8978)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-11-07 09:37:19 +08:00
jthomson04
fcae852cef
[None][fix] Fix KV cache clearing with KV Connector API (#8750)
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
2025-11-06 14:28:27 -08:00
Chenghao Zhang
1a78e7a3d6
[None][feat] AutoDeploy: Support Latent MOE for Nemotron (#8955)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
2025-11-06 12:40:19 -08:00
dhansen-nvidia
ada93f1187
[https://nvbugs/5527655][feat] Add NUMA-aware CPU affinity autoconfig (#8805)
Signed-off-by: Dan Hansen <1+dhansen-nvidia@users.noreply.github.com>
Co-authored-by: Dan Hansen <1+dhansen-nvidia@users.noreply.github.com>
2025-11-06 11:59:46 -08:00
Chenghao Zhang
ddf2d010e2
[TRTLLM-8814][feat] AutoDeploy: Use TRTLLM kernels for FP8 linear (#8820)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-11-06 11:00:10 -08:00
yunruis
51545560da
[TRTLLM-8803][feat] Add rope and uk-bgemm overlap for mla generation (#8495)
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
2025-11-06 17:39:57 +08:00
Yilin Fan
b7798bfab8
[None][feat] Add trtllm_ prefix for exposed metrics (#8845)
Signed-off-by: nv-yilinf <206948969+nv-yilinf@users.noreply.github.com>
2025-11-06 15:27:18 +08:00
Yi Sun
cc12d33393
[None][feat] Deep Research Implemented with Scaffolding (#8452)
Signed-off-by: Yi Sun <yisun0618@gmail.com>
2025-11-06 10:33:28 +08:00
JadoTu
6bbb43f2b9
[None][feat] Add qwen3-next nvfp4 support (#8526)
Signed-off-by: jiant <107457950+JadoTu@users.noreply.github.com>
2025-11-06 09:45:44 +08:00
Frida Hou
fb7f9831d3
[#8924][fix] Fix AutoDeploy pattern matcher for torch 2.9 (#8920)
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2025-11-05 13:29:20 -08:00
Lucas Liebenwein
b181568d6f
[TRTLLM-8201][feat] Nemotron H MoE Sharding (#8744)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
Co-authored-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-11-05 12:35:29 -08:00
Chang Liu
e57d83c5dc
[TRTLLM-8768][chore] Fuse QK down_proj with indexer K + weight_proj for FP4 ckpt (#8771) 2025-11-05 07:57:09 -08:00
Yukun He
b9e5315dfb
[https://nvbugs/5623960][fix] Fix the logger once key issue and further compress log in AutoTuner. (#8873)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-11-05 15:25:43 +08:00
Shiyu Li
eeb56c2848
[None][feat] MNNVLAllreduce Kernel Refactor (#8018)
Signed-off-by: Shiyu Li <timlee0212@outlook.com>
Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-11-05 08:49:47 +08:00
Frida Hou
11ded113cd
[#8389][fix] Update group attention matching to first map to custom torch attention (#8638)
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2025-11-04 12:00:43 -08:00
shuyixiong
70e4d72ffa
[TRTLLM-8511][feat] Add update_weights and sleep_wakeup support for rl integration (#8302)
Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com>
Co-authored-by: Liwei Ma <liweim@nvidia.com>
Co-authored-by: Jonas Yang CN <joyang@nvidia.com>
2025-11-04 10:19:24 -08:00
Bo Li
e4bf29bc66
[None][feat] Integrate MnnvlThroughput into TRTLLM MoE. (#8728)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-11-04 21:36:29 +08:00
Cao Dong
dddfcdd3bf
[None][fix] Fix bug of undefined py_topk_logprobs_vals (#8789)
Signed-off-by: Dong Cao <docao@nvidia.com>
2025-11-04 19:32:59 +08:00
Zhanrui Sun
4de31bece2
[TRTLLM-8994][infra] upgrade to DLFW 25.10 and pytorch 2.9.0 / triton 3.5.0 (#8838)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-11-04 18:59:34 +08:00
CarstyYou
4296c9553d
[TRTLLM-1234][feat] Add fp8 blockscaled Gemm for sm120 (#8844)
Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>
2025-11-04 18:10:36 +08:00
danielafrimi
2b58dba0f6 [https://nvbugs/5524714][fix] Fix TP sharding of fused-QKV weight scales in W4A16 AWQ (#8432)
Signed-off-by: Daniel Afrimi <dafrimi@nvidia.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-11-04 16:42:31 +08:00
Patrice Castonguay
65c138108e [https://nvbugs/5552889][fix] fix: Prevent empty batch when using attention DP with disagg (#8372)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-11-04 16:42:31 +08:00
xiweny
fcac2022e2 [https://nvbugs/5565565] [fix] fp8 wideep support sm103 (#8228)
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-11-04 16:42:31 +08:00
Yechan Kim
67208f1512
[None][fix] InputProcessor config naming convention fix (#8705)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-11-03 22:29:21 -08:00
HuiGao-NV
97674c3114
[TRTLLM-8690][feat] add more tensors to share buffers (#8691)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-11-03 21:08:01 -08:00
Yan Chunwei
ed297d7c2e
[None][chore] Optimize perf for the RPC executor and add some profile utilities to llm-api (#8415)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-11-03 17:59:49 -08:00
Matthias Jouanneaux
d0f107e4dd
[TRTLLM-5966][feat] Helix: add full MLA support for Helix (#8104)
Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>
2025-11-04 09:06:58 +08:00
Li Min
89336fbf07
[None][fix] Fix cute dsl nvfp4 gemm autotune issue (#8761)
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-11-03 22:55:45 +08:00
Yechan Kim
f48968b6cc
[TRTLLM-6928][fix] Refactor multimodal unittest (#8453)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-11-03 06:01:07 -08:00
Yechan Kim
00c0e6c440
[https://nvbugs/5523315][fix] Fix serve benchmark test (#8255)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-11-03 00:30:13 -08:00
Cao Dong
2ff772ef71
[None][feat] Add benchmark to DeepConf (#8776)
Signed-off-by: Dong Cao <docao@nvidia.com>
2025-11-03 16:05:50 +08:00
yufeiwu-nv
b4d17d1a4c
[TRTLLM-8991][test] Add Llama 3.3 70B model with different performance config (#8753)
Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
Co-authored-by: Larry Xu <197874197+LarryXFly@users.noreply.github.com>
2025-11-03 13:34:06 +08:00
Chang Liu
f57dc01e6f
[https://nvbugs/5625380][chore] Remove multimodal related fields from decoder llm input (#8846) 2025-11-02 17:44:08 -08:00
Eran Geva
f8778230e3
[#8781][fix] Cache the AllReduce wrapper to avoid re-allocating workspace which caused a hang (#8803)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-11-02 15:30:39 +02:00
Yan Chunwei
1551ed8e5f
[https://nvbugs/5437384][test] CHERRY-PICK: fix trtllm-llmapi-launch multi tests (#8567)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-11-01 06:49:33 -07:00
Bo Li
4c5a8f4ec6
[None][fix] Rename: slot_count -> invalid_expert_id (#8783)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-11-01 21:36:59 +08:00
QI JUN
89e0117097
[TRTLLM-8836][chore] Create ModelEngine from LlmArgs (#8600)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-11-01 05:26:06 -07:00
Fanrong Li
f0dc746738
[TRTLLM-8541][feat] Add trtllm-gen sparse MLA kernels to support per-Tensor FP8 KV Cache (#8692)
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Co-authored-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
Co-authored-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-10-31 14:38:31 -07:00
Suyog Gupta
3d0e38e074
[None][perf] AutoDeploy optimize _get_unique_value (#8822)
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-10-31 04:57:10 -07:00
Anthony Chang
852e5060aa
[https://nvbugs/5558117][fix] Allow per-layer quant config from hf_quant_config.json (#8617)
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-10-31 04:41:44 -07:00
Yukun He
1d4a186ace
[https://nvbugs/5623960][fix] Compress the warning log of AutoTuner when encountering tactic failures. (#8793)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-10-31 11:09:14 +08:00
Yuxian Qiu
025d2926df
[https://nvbugs/5599515][fix] Fix PP bubbles. (#8687)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-10-31 10:13:56 +08:00
Yilin Fan
f3224ccd32
[None][feat] Add disagg relay time to time breakdown tool (#8465)
Signed-off-by: nv-yilinf <206948969+nv-yilinf@users.noreply.github.com>
2025-10-30 18:21:45 -07:00
Chenghao Zhang
71c5576a44
[TRTLLM-8734][feat] AutoDeploy: Enable the nvfp4 for Nemotron MOE (#8737)
Signed-off-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-10-30 12:33:08 -07:00
Tailing Yuan
ec31363a86
[None][fix] Layer wise benchmarks: use local models, lint (#8799)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2025-10-30 09:47:46 -07:00
Tailing Yuan
f9c7786dc8
[None][feat] Add layer wise benchmarks (#8777)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2025-10-30 20:29:34 +08:00
Anthony Chang
f666ad2f6b
[None][feat] Autotuner can iterate through all tactics for test purposes (#8663)
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-10-30 13:11:25 +01:00
WeiHaocheng
cc286687c4
[None][feat] Refactor scaffolding streaming feature and fix openai wo… (#8622)
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
2025-10-30 16:02:40 +08:00
Void
6b755fd9f8
[None][fix] fix runtime error that bf16 input is not quantized to nvfp4 when use bf16 dispatch (#8507)
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
2025-10-30 15:06:54 +08:00
Yi Zhang
496b419791
[None][doc] Add doc for torch.compile & piecewise cuda graph (#8527)
Signed-off-by: yizhang-nv <187001205+yizhang-nv@users.noreply.github.com>
2025-10-29 21:15:46 -07:00
Simeng Liu
834a780655
[https://nvbugs/5599086][fix] Fix FP8 Linear module for spark (#8707)
Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-10-29 13:58:19 -07:00
Iman Tabrizian
ae6875fe10
[TRTLLM-8976][feat] Move indexer-k-cache to KVCacheManager (#8699)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-10-29 08:04:26 -07:00
Leslie Fang
451959c60d
[TRTLLM-8763][chore] Deprecate pybind based GuidedDecodingConfig usage in torch backend (#8717)
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
2025-10-29 20:37:14 +08:00
Fanrong Li
a21697ead9
[None][fix] fix config loading for DeepSeek-V3.2 in trtllm-bench (#8729)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-10-29 05:17:16 -07:00
kris1025
e2c5a38879
[https://nvbugs/5534574][fix] disable spec decoding forever once the request spec decoding is disabled (#8446)
Signed-off-by: linquanh <linquanh@nvidia.com>
2025-10-29 19:28:43 +08:00
Yi Zhang
a69bd2a6fa
[https://nvbugs/5550409][fix] Disable torch compile in piecewise attention part to Avoid host overhead (#8708)
Signed-off-by: yizhang-nv <187001205+yizhang-nv@users.noreply.github.com>
2025-10-29 18:12:58 +08:00
Pengyun Lin
2aade46d18
[TRTLLM-8214][feat] Support Qwen3 tool parser (#8216)
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-10-29 15:48:29 +08:00
Chang Liu
5f737b8dbe
[None][perf] Use fp8 quant kernel in DS3.2 indexer module (#8701)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-10-29 12:45:09 +08:00
Cheng Hang
15c293a90b
[None][feat] Enable nvfp4 cuda core for sm120 (#8620)
Signed-off-by: Cheng Hang <chang@nvidia.com>
2025-10-29 12:39:03 +08:00
Yechan Kim
bc26f4ce7c
[https://nvbugs/5549829][fix] Qwen2.5-VL TP > 1 + Quantized weight load fix (#8680)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-10-29 13:38:42 +09:00
Yechan Kim
cf8a1d2ef9
[https://nvbugs/5596377][fix] Fix mm dummy calculation (#8498)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-10-29 09:45:21 +09:00
Lizhi Zhou
24167d00eb
[TRTLLM-8431][doc] update public doc and example, add etcd auto-scaling tests (#8602)
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2025-10-28 17:04:53 -07:00
Kaiyu Xie
227c288441
[TRTLLM-8827] [feat] Enable low precision alltoall for Cutlass and TRTLLMGen backends (#8675)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-10-29 07:56:48 +08:00
Mike Iovine
00161b315f
[https://nvbugs/5549111][fix] Fix 2-model overlap scheduler accuracy on very long prompts (#8076)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Michael Iovine <miovine@nvidia.com>
2025-10-28 14:55:34 -07:00
Lucas Liebenwein
0ee71d95ec
[https://nvbugs/5606166][fix] AutoDeploy: use tuples for cudagraph shape lookup (#8658)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-10-28 10:52:43 -07:00
Anish Shanbhag
a09b38a862
[TRTLLM-8684][chore] Migrate BuildConfig to Pydantic, add a Python wrapper for KVCacheType enum (#8330)
Signed-off-by: Anish Shanbhag <ashanbhag@nvidia.com>
2025-10-28 09:17:26 -07:00
William Zhang
cdc9e5e645
[None][fix] Properly raise error for nemotron H models (#8697)
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-10-28 08:59:42 -07:00
Eran Geva
e051a05e6c
[#8694][fix] fix AutoDeploy cuda memory access failure in nvidia/NVIDIA-Nemotron-Nano-31B-A3-v3 (#8696)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-10-28 13:21:43 +02:00
Erin
a966644a71
[None][fix] Change Ray submit() to use async RPC (#8636)
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-10-28 00:56:13 -04:00
gramnarayan
88b0fbc8ff
[#8245][feat] Autodeploy: Guided Decoding Support (#8551)
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Co-authored-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-10-28 09:29:57 +08:00
Bo Li
9c4432f8a4
[TRTLLM-7318][feat] MnnvlThroughput AlltoAll implementation. (#7499)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-10-27 13:23:06 -04:00
nvxuanyuc
d1398c05e6
[None][feat] Support ignored prompt length for penalties via new sampling config parameter (#8127)
Signed-off-by: Xuanyu Chen <xuanyuc@nvidia.com>
2025-10-27 13:12:31 -04:00
Chenghao Zhang
b9b2802599
[None][feat] Autodeploy: Update the ssm to use slice (#8667)
Signed-off-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com>
2025-10-27 09:45:20 -07:00
mpikulski
7c8ba71b49
[TRTLLM-8832][feat] fully async _select_generated_logits with tests (#8628)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-10-27 16:15:32 +01:00
QI JUN
4fd58137a1
[TRTLLM-8933][chore] remove unused update_executor_config function (#8678)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-10-27 10:00:47 -04:00
Kaiyu Xie
c9b08790c2
[None] [test] Add MNNVL AlltoAll tests to pre-merge (#8601) 2025-10-27 21:39:44 +08:00
Chao Ni
0019d99e6d
[None][test] Add longbench v2 for long context evaluation (#8604)
Signed-off-by: mni <125171826+baize97@users.noreply.github.com>
2025-10-27 20:01:14 +08:00
zhanghaotong
1026069a2b
[None][feat] Add opentelemetry tracing (#5897)
Signed-off-by: Zhang Haotong <zhanghaotong.zht@antgroup.com>
Signed-off-by: zhanghaotong <zhanghaotong.zht@antgroup.com>
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Zhang Haotong <zhanghaotong.zht@alibaba-inc.com>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-10-27 18:51:07 +08:00
Tailing Yuan
858d6437c1
[None][fix] Fix ModelConfig.from_pretrained get quant config file (#8647)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2025-10-27 11:02:24 +08:00
Jinyang Yuan
0a0f93d4a8
[None][fix] Fix the performance issue of FP8 blockwise grouped GEMM when using attention DP (#8501)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-10-27 10:18:19 +08:00
Chenghao Zhang
a6d20f6f9b
[None][feat] AutoDeploy: Add FP8 MOE for Nemotron (#8599)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Co-authored-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2025-10-25 15:26:45 -04:00
Wanli Jiang
95be56e56b
[TRTLLM-8238][feat] Add EVS support for nano-v2-vlm (#8024)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
Co-authored-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
2025-10-25 05:43:27 -04:00
Simeng Liu
2b27810198
[https://nvbugs/5494718][fix] Fix Single GPU Multi-node issue and OOM on DGX Spark (#8514)
Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-10-24 19:09:07 -07:00
Erin
812bc8c954
[TRTLLM-8513][feat] Add back worker extension (#8482)
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-10-24 20:30:28 -04:00
jthomson04
02081e2390
[None][feat] Support KV Connector with Disagg Prefill Worker (#8246)
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
2025-10-24 11:09:06 -07:00
Chang Liu
e47c787dd7
[TRTLLM-8535][feat] Support DeepSeek V3.2 with FP8 + BF16 KV cache/NVFP4 + BF16 KV cache (#8405)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-10-24 13:40:41 -04:00
Yechan Kim
2d86d6be40
[TRTLLM-8737][feat] Support media_io_kwargs on trtllm-serve (#8528)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-10-24 12:53:40 -04:00
Aurelien Chartier
cdf0403c64
[None][feat] Pass KvCacheRetentionConfig to torch LlmRequest (#8634)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-10-24 06:44:34 -07:00
Chuang Zhu
2420918e5b
[TRTLLM-7078][chore] optimal kvcache transfer for VWSA (#7952)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-10-24 08:58:16 -04:00
Suyog Gupta
f512ddaeef
[None][feat] add skip condition in AutoDeploy's triton fused moe kernel (#8632)
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-10-24 08:46:17 -04:00
Wanli Jiang
f448043d88
[None][feat] Support base64 video input (#8458)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-10-24 10:23:13 +08:00
Zheng Duan
e666a704f5
[None][doc] add visualization of perf metrics in time breakdown tool doc (#8530)
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-10-23 22:09:21 -04:00
QI JUN
6ee1c87595
[TRTLLM-8817][chore] Set default value of KvCacheConfig.free_gpu_memory_fraction explicitly (#8561)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-10-24 08:55:49 +08:00
h-guo18
23920223ab
[#4585][feat] Replace unified attention before export (#8303)
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
2025-10-23 18:02:04 -04:00
QI JUN
cc81028547
[TRTLLM-8812][chore] Limit the scope of pybind based CacheTransceiverConfig (#8558)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-10-23 10:32:09 -04:00
Robin Kobus
3a5845e293
[TRTLLM-8714][fix] update create_input_processor to handle custom checkpoint format (#7811)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-10-23 10:27:56 +02:00
Shijie
928247a3f9
[https://nvbugs/5451205][feat] Add cuBLASLt NVFP4 GEMM backend support (#7943)
Signed-off-by: Shijie Wang <jaywan@nvidia.com>
2025-10-23 15:55:10 +08:00
Suyog Gupta
2956978da3
[None][feat] Enable rms norm fusion for Nemotron MOE (#8563)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Co-authored-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-10-23 00:09:42 -04:00
sunnyqgg
ea3e0eea51
[TRTLLM-7954][feat] Target model KV cache rellocation (#8421)
Signed-off-by: qgai <qgai@nvidia.com>
2025-10-23 09:36:50 +08:00
Anthony Chang
8a3b870e09
[None][feat] Update TRTLLM MoE MxFP4 cubins; autotune tileN (#8156)
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-10-23 09:14:18 +08:00
Anish Shanbhag
15de45d782
[TRTLLM-8682][chore] Remove auto_parallel module (#8329)
Signed-off-by: Anish Shanbhag <ashanbhag@nvidia.com>
2025-10-22 20:53:08 -04:00
Leslie Fang
e5865de518
[TRTLLM-8754][chore] Refine PyTorchModelEngine with llm args (#8493)
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
2025-10-22 20:03:18 -04:00
Patrice Castonguay
879039f6d5
[https://nvbugs/5429636][feat] Kv transfer timeout (#8459)
Signed-off-by: raayandhar <raayan.dhar@gmail.com>
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: raayandhar <raayan.dhar@gmail.com>
2025-10-22 09:29:02 -04:00
Yan Chunwei
f81caf5491
[None][chore] replace print_colored_debug with logger_debug (#8417)
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
2025-10-22 17:54:38 +08:00
Yan Chunwei
3f9dbc76c0
[None][fix] fix rpc unique addr related issue (#8419)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-10-22 04:47:18 -04:00
Yiqing Yan
b04e51291a
[None][chore] Bump version to 1.2.0rc2 (#8562)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-10-22 14:35:05 +08:00
sunnyqgg
90080e0e09
[https://nvbugs/5556020][fix] test_disaggregated_serving.py::TestLlama3_1_8BInstruct::test_eagle3 dimension mismatch (#8517)
Signed-off-by: qgai <qgai@nvidia.com>
2025-10-22 09:58:22 +08:00
Leslie Fang
50d4e5bc06
[TRTLLM-8483][chore] Refine scheduler_config and peft_cache_config in create_py_executor (#8451)
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
2025-10-22 08:33:48 +08:00
Chenghao Zhang
bac9e8c2ad
[None][feat] AutoDeploy: Add Nemotron MOE support for AutoDeploy (#8469) 2025-10-21 15:32:01 -07:00
Lizhi Zhou
23d5280a90
[TRTLLM-7843][feat] implement disagg cluster auto-scaling (#8215)
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2025-10-21 17:25:07 -04:00
Lucas Liebenwein
9b54b3bfaf
[None][chore] AutoDeploy: replace HF's deprecated keyword torch_dtype --> dtype (#8510)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-10-21 17:07:06 -04:00
YueWeng
8dc4aac5b6
[TRTLLM-8160][feat] Add max_total_draft_tokens (#8366)
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
2025-10-21 11:11:04 -04:00
Pengyun Lin
a4227cf1b0
[None][feat] Support Qwen3 reasoning parser (#8000)
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-10-21 14:08:39 +08:00
Bo Li
ebb62e17d8
[None][feat] Add alltoall to trtllm-gen MoE backend. (#8481)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-10-21 12:42:54 +08:00
mpikulski
87eb5086fb
[None][fix] restore list[list[list[int]]] in add_token (#8502)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-10-20 22:34:57 -04:00
Yechan Kim
85d5aa7763
[None][feat] Support kv_cahce_reuse for HyperCLOVAX-Vision model (#7789)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-10-21 11:11:24 +09:00
Suyog Gupta
7050b1ea49
[#8272][feat] Enable chunked prefill for SSMs in AutoDeploy (#8477)
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-10-20 15:31:52 -07:00
Lucas Liebenwein
55c468b218
[#8461][feat] AutoDeploy: trtllm-serve bug fix + unit test (#8462)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-10-20 16:06:39 -04:00
Pamela Peng
b818a912d7
[https://nvbugs/5540752][fix] Support quantized Phi4 MM models (#8190)
Signed-off-by: Pamela <179191831+pamelap-nvidia@users.noreply.github.com>
2025-10-20 06:36:09 -04:00
mpikulski
97ce0ecefe
[TRTLLM-8436][feat] batched sampling and top-k logprobs improvements (#8398)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-10-20 11:15:41 +02:00
ChristinaZ
c8b9998acb
[TRTLLM-8637][feat] Optimize the routing kernel for DeepseekV3 (MoE CUTLASS backend); Add support for KimiK2 and Qwen-next (MoE TRTLLM backend) (#7761)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-10-20 10:08:31 +08:00
Bo Deng
dd25595ae8
[TRTLLM-7964][infra] Set nixl to default cache transceiver backend (#7926)
Signed-off-by: Bo Deng <deemod@nvidia.com>
2025-10-19 19:24:43 +08:00
jthomson04
852316886e
[None][fix] Fix KV event consumption (#6346)
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
2025-10-18 15:41:26 -07:00
Lucas Liebenwein
41169fb20c
[None][feat] AutoDeploy: chunked prefill support (#8158)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-10-18 00:47:35 -07:00
QI JUN
4a8ac8dd62
[TRTLLM-8480][chore] clean create_py_executor API (#8412)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-10-17 23:52:02 -04:00
Wanli Jiang
58b43a6dab
[None][fix] Fix get_num_tokens_per_image for nano-v2-vlm (#8425)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-10-18 08:51:35 +08:00
Kyle McGill
136e0e6882
[None][feat] Enable CUDA graph support for KvConnectorWorker API (#8275)
Signed-off-by: Kyle McGill <kmcgill@nvidia.com>
Signed-off-by: Kyle McGill <101670481+nv-kmcgill53@users.noreply.github.com>
2025-10-17 18:09:03 -04:00
Anish Shanbhag
5ff4f88be6
[TRTLLM-8683][chore] Migrate PluginConfig to Pydantic (#8277)
Signed-off-by: Anish Shanbhag <ashanbhag@nvidia.com>
2025-10-17 16:13:22 -04:00
h-guo18
55fed1873c
[None][chore] AutoDeploy: cleanup old inference optimizer configs (#8039)
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-10-17 15:55:57 -04:00
Grzegorz Kwasniewski
bb7fdcebf4
[TRTLLM-8201][feat] Topological graph helpers (#8457)
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
2025-10-17 12:34:19 -04:00
zhhuang-nv
7a2bab93f0
[None][test] Add post merge test for Seed-OSS-36B-Instruct (#8321)
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
2025-10-17 02:30:33 -07:00
Tracin
dd06612d0e
[https://nvbugs/5540138][fix] Fix shape error when duplicating kv. (#8390)
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-10-17 10:07:29 +08:00
John Calderon
46ee7acb33
[TRTLLM-6780][fix] Add multimodal data to dummy requests during memory profiling (#7539)
Signed-off-by: John Calderon <johncalesp@gmail.com>
Signed-off-by: John Calderon <jcalderon@nvidia.com>
Signed-off-by: john calderon <jcalderon@nvidia.com>
Signed-off-by: John Calderon <jcalderon@nvidia>
2025-10-16 17:49:22 +02:00
Jin Li
d594c2d0ff [https://nvbugs/5537348][fix] Use device tensor index for MTP (#8062)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-10-16 22:46:19 +08:00
Yechan Kim
9587f099ac [https://nvbugs/5547434][fix] Fix Qwen2.5-VL device_path error (#8057)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-10-16 22:46:19 +08:00
Yukun He
179c7dc501 [https://nvbugs/5536131][fix] Fix illegal access issue when scale is not provided in Llama3/4. (#7960)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-10-16 22:46:19 +08:00
Enwei Zhu
57a4ef870a [None][fix] Fix chunked prefill state of draft request (#8067)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-10-16 22:46:19 +08:00
Wangjue Yao
9865d3d770
[None][feat] Support cached tokens for Openai server (#7637)
Signed-off-by: wjueyao <wyao123@terpmail.umd.edu>
Co-authored-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-10-16 20:51:37 +08:00
chinamaoge
ee588a73ac
[None][fix] Fix the error where checkpoint_dir is assigned as NONE wh… (#8401)
Signed-off-by: maoge <maoge23@qq.com>
Co-authored-by: maoge <maoge23@qq.com>
2025-10-16 13:37:43 +08:00
Min Yu
0a0159fdd8
[https://nvbugs/5378031] [feat] W4A8 AWQ MoE supports Per Expert Pre-quant Scale Factor for PyT backend (#7286)
Signed-off-by: Min Yu <171526537+yumin066@users.noreply.github.com>
2025-10-16 11:07:48 +08:00
Cao Dong
e75b4f9f65
[None][feat] Dev DeepConf (#8362)
Signed-off-by: Dong Cao <docao@nvidia.com>
2025-10-16 11:01:31 +08:00
Wanli Jiang
ebf0e51206
[TRTLLM-8579][feat] Support quantized model for nano-v2-vlm (#8304)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-10-16 09:44:11 +08:00
Yan Chunwei
206cf31705
[https://nvbugs/5560921][fix] GenerationExecutor RPC (#8209)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-10-16 09:05:22 +08:00
Chuang Zhu
40d129a415
[None][fix] Fix cache buffer size for window (#8320)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-10-16 09:01:11 +08:00
HuiGao-NV
e265eb5fe9
[None][feat] reuse cudagraph memory pool in normal forward flow (#8095)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-10-16 07:08:44 +08:00
dongfengy
7a0aa64973
[None][fix] Refactor triton paddings (#6980)
Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
Signed-off-by: dongfengy <99041270+dongfengy@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-10-15 12:59:01 -07:00
QI JUN
65ec01b257
[TRTLLM-8532][chore] clean warmup method of ModelEngine (#8264)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-10-15 08:40:58 -07:00
Yukun He
56c20665a9
[TRTLLM-4501][feat] Add input tensor pre-hook function API for the tuning process. (#6924)
Some tunable ops require a more realistic data distribution, for instance, a shape-associated tensor. Thus, a customizable pre-hook function can be declared in the tuning config to modify the input tensor before the tuning process.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-10-15 21:18:11 +08:00
mpikulski
0510b34588
[TRTLLM-8551][feat] add cache_salt in LLM.generate and refactor test_return_logits.py (#8317)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-10-15 02:53:57 -07:00
mpikulski
93a4b7f1b6
[None][chore] update torch_dtype -> dtype in 'transformers' (#8263)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-10-15 17:09:30 +09:00
QI JUN
616d1df7a0
[None][chore] set the default value of max_num_tokens explicitly (#8208)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-10-14 23:03:02 -07:00
sychen52
6a6124dcb5
[OMNIML-2336][feat] w4a8 nvfp4 fp8 exports scale factor properly (#8180)
Signed-off-by: Shiyang Chen <shiychen@nvidia.com>
Co-authored-by: Shiyang Chen <shiychen@omniml-a6.nvidia.com>
2025-10-15 13:41:27 +08:00
Lizhi Zhou
22471ecc67
[TRTLLM-7846][feat] implement etcd storage for disagg cluster (#8210)
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2025-10-14 16:48:41 -04:00
Tailing Yuan
8444a50d3a
[None][fix] Fix is_post_quant_all2all_supported for MNNVL (#8355)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2025-10-14 11:49:21 -07:00
shuyixiong
6776caaad1
[TRTLLM-8507][fix] Fix ray resource cleanup and error handling in LoRA test (#8175)
Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com>
2025-10-14 23:46:30 +08:00
Fanrong Li
0d20a8fd61
[TRTLLM-8536][feat] Add the sparse attention framework and one use case--RocketKV support (#8086)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Signed-off-by: yuhangh <58161490+heyuhhh@users.noreply.github.com>
Co-authored-by: yuhangh <58161490+heyuhhh@users.noreply.github.com>
2025-10-14 08:23:16 -07:00
Cao Dong
62cea877b1
[None][feat] Move StreamGeneration to scaffolding main directory (#8347)
Signed-off-by: Dong Cao <docao@nvidia.com>
2025-10-14 17:16:04 +08:00
Yuxian Qiu
3450fe9944
[None][fix] Fix dummy load format for key models. (#7993)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-10-14 11:18:39 +08:00
Aurelien Chartier
9bc055faf1
[None][fix] Disable DeepGEMM for Qwen3 MoE Attention layers (#8087)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-10-13 18:38:47 -07:00
Lucas Liebenwein
22aa4ac08c
[None][feat] AutoDeploy: VLMs with subgraphs + cudagraph/compile (#8203)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-10-13 17:34:09 -07:00
Zheyu Fu
bac665e650
[TRTLLM-7412][feat] Turn off spec decode when the rolling average acceptance length drops below threshold. (#7283)
Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>
2025-10-13 15:51:14 -07:00
Grzegorz Kwasniewski
ea4658197f
[TRTLLM-6342][feat] Factory TP sharding of quantized models (#8123)
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-10-13 14:04:46 -07:00
Yuxian Qiu
bd740c9ba6
[None][fix] Avoid unnecessary concat in attn_output_gate case. (#8094)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-10-13 12:59:40 -07:00
Robin Kobus
db8c63b9b1
[TRTLLM-4517] [feat] Additional model outputs (#7206)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-10-13 15:33:18 +02:00
Cao Dong
d882c92a84
[None][fix] Fix EventLoopShutdownError (#8260)
Signed-off-by: Dong Cao <docao@nvidia.com>
2025-10-13 17:31:33 +08:00
Po-Han Huang (NVIDIA)
6fc6f70a68
[https://nvbugs/5441729][test] Fix test_modeling_llama_min_latency.py failures (#7478)
Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
2025-10-13 15:35:02 +08:00
Leslie Fang
8d1b068b1a
[TRTLLM-8477][chore] Replace KvCacheConfigCpp with KvCacheConfig inside PyExecutor (#8259)
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
2025-10-13 14:55:36 +08:00
DylanChen-NV
d6e315e9ff
[None][feat] Add torch compile support for cuda core GEMM OP (#8261)
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
2025-10-12 20:57:17 -07:00
amitz-nv
fac47e2826
[https://nvbugs/5510879][fix] Fix pytorch & TRT-python flows fused LoRA adapter modules weight split with TP>1 (#8063)
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-10-12 12:29:52 -07:00
kris1025
a7ea544dbe
[TRTLLM-7384][feat] enable rejection sampling for CDL (#7731)
Signed-off-by: linquanh <linquanh@nvidia.com>
2025-10-12 20:38:48 +08:00
Ziyi Xiong
efd4ffa03b
[https://nvbugs/5534705][fix] Skip unnecessary CUDA graph capture (#8050)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-10-11 13:26:55 +08:00
Yilin Fan
2695d70d42
[None][feat] Add request timing breakdown option in benchmark_serving (#8128)
Signed-off-by: nv-yilinf <206948969+nv-yilinf@users.noreply.github.com>
2025-10-10 09:24:54 -07:00
QI JUN
48c15d805c
[https://nvbugs/5558167][fix] update canceled_req_ids correctly for canceled requests (#8207)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-10-10 18:58:26 +08:00
HuiGao-NV
795a051765
[None][chore] Print log with time for starting to load safetensor weights (#8218)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-10-10 13:54:54 +08:00
mpikulski
7b6803b6e9
[TRTLLM-7769][chore] document the role of 'd2t' (#8174)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-10-09 13:13:50 -04:00
Lizhi Zhou
fdf29ab8fa
[TRTLLM-7846][feat] Http disagg-cluster management implemention (#7869)
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2025-10-09 09:44:01 +08:00
dongfengy
9f2a3ae88c
[None][fix] Restrict tinygemm use to certain SMs (#8182)
Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
Signed-off-by: dongfengy <99041270+dongfengy@users.noreply.github.com>
2025-10-08 17:55:57 -07:00
mpikulski
8298e93bd8
[TRTLLM-8414][chore] BREAKING CHANGE: refine sampling strategy selection (#8132)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-10-08 15:46:50 +02:00
Sergey Klevtsov
017583a949
[https://nvbugs/5488576][fix] Propagate disable_finalize_fusion config flag in WIDEEP MoE backend (#8141)
Signed-off-by: Sergey Klevtsov <sklevtsov@nvidia.com>
2025-10-07 14:44:54 -07:00
Mike Iovine
7facac077b
[None][fix] Fix MTP illegal memory access (#8161)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-10-07 14:02:55 -04:00
Faraz
27a5091fcb
[None][feat] GPT-OSS Sm120/Sm121 Support (#7937)
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
Signed-off-by: Vincent Huang <vincenth@nvidia.com>
Co-authored-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
Co-authored-by: Vincent Huang <vincenth@nvidia.com>
2025-10-06 16:59:06 -04:00
Izzy Putterman
f2657c1ae9
[None][fix] Eagle: Attention DP (#7939)
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
2025-10-06 16:52:35 -04:00
mpikulski
98b3af4d4e
[TRTLLM-8413][chore] resolve sampling defaults in OpenAI API backend (#8121)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-10-06 06:09:43 -07:00
Yan Chunwei
54ab9767b5
[None][chore] fix llmargs conflict (#8152)
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
2025-10-06 02:34:27 -07:00
Yan Chunwei
fb51de6c2e
[TRTLLM-8189][chore] enhance GenerationExecutor with RPC (part1) (#5543)
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Signed-off-by: chunweiy <chunweiy@nvidia.com>
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: chunweiy <328693+Superjomn@users.noreply.github.com>
2025-10-05 17:28:20 +08:00
Frida Hou
f6654f26a4
[#5255][autodeploy] Update FuseAllreduceResidualRMSNorm to use pattern matcher utility; remove fuse_collective (#7545)
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2025-10-05 01:15:46 -07:00
Frida Hou
744246d316
[None][autodeploy] small refactors on attention matching (#8079)
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2025-10-03 22:00:27 -07:00
Jonas Yang CN
88ea2c4ee9
[TRTLLM-7349][feat] Adding new orchestrator type -- ray (#7520)
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-10-04 08:12:24 +08:00
Lucas Liebenwein
9d098e3142
[None][feat] AutoDeploy: graph/module inputs with kwargs instead of args (#8137)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-10-03 16:53:42 -07:00
Michal Guzek
38da871db3
[TRTLLM-6496][feat] Add LoRa Torch tests for the latest NIM model list (#6806)
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
2025-10-03 12:10:48 -07:00
Mike Iovine
ca8291133a
[None][fix] Fix MTP 2-model (#8115)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
2025-10-03 10:13:50 -07:00
Lucas Liebenwein
aaf2c3c2e5
[None][feat] AutoDeploy: compiler backends based on nn.Module (#8126)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-10-03 12:14:21 -04:00
Ziyi Xiong
7bc2d9e993
[https://nvbugs/5537878][fix] Reserve an extra slot for padded batch (#7998)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-10-03 08:42:52 -07:00
Suyog Gupta
d8215241d8
[None][feat] AutoDeploy add autotuning when capturing cudagraphs (#8120)
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-10-03 08:33:21 -07:00
Aurelien Chartier
9db4366903
[None][fix] Fix Qwen3 FP8 per-tensor when requesting TRTLLM-GEN MoE backend (#8075)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-10-03 07:52:52 -07:00
Lucas Liebenwein
5faa5e9dd8
[None][feat] AutoDeploy: dive deeper into token generation bugs + enable_block_reuse (#8108)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-10-03 04:57:26 -07:00
Nikita Korobov
9b3d7cc3e6
[None][feat] Update TRT-LLM Gen MoE kernels (#7970)
Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>
2025-10-03 09:22:45 +08:00
Yilin Fan
01423ac183
[None][feat] perf_metrics endpoint functionality improvement (#8005)
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
Signed-off-by: nv-yilinf <206948969+nv-yilinf@users.noreply.github.com>
2025-10-02 17:43:25 -07:00
Grzegorz Kwasniewski
a5b59fd31d
[TRTLLM-6342][bug] Patched incorrect starcoder tp config (#8118)
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
2025-10-02 18:41:59 -04:00
Daniel Cámpora
ab433b7228
[None][fix] Fix access to new tokens in sampler. (#7958)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-10-02 15:41:21 -04:00
Patrice Castonguay
fefa7d8fa3
[None][feat] Support for cancelling requests with disaggregation (#8114)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-10-02 11:04:26 -07:00
dongfengy
6568e565db
[TRTLLM-7775][feat] Integrate tinygemm2 for gpt-oss (#7916)
Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
Signed-off-by: dongfengy <99041270+dongfengy@users.noreply.github.com>
Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-10-02 10:47:04 -07:00
yifeizhang-c
34d158b6da
[TRTLLM-6589][feat] Support CUDA graph for DeepEP (#7514)
Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>
2025-10-02 10:13:24 -07:00
mpikulski
fc7f78c400
[TRTLLM-8269][test] do not explicitly pass temperature=0 to select greedy sampling (#8110)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-10-02 10:20:32 +02:00
Chang Liu
726ac07cc0
[https://nvbugs/5549081][fix] Fix device id assignment for some vision models (#8070)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
2025-10-01 23:28:05 -04:00
brb-nv
bd3d0ad233
[TRTLLM-7733][feat] Executor changes to support helix parallelism (#7972)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-10-01 22:13:03 -04:00
Izzy Putterman
1ad7bc4c78
[None][feat] Draft: Save state first pass (#7012)
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
2025-10-01 18:40:55 -04:00
Frida Hou
de99e23696
[#5860][feat] Add ModelOPT INT4 awq fake quant support in AutoDeploy (#7770)
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2025-10-01 13:13:45 -07:00
Yibin Li
d7581bb551
[TRTLLM-8031][feat] Add chunked return_generation_logits logic (#7831)
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-10-01 12:47:07 -04:00
Grzegorz Kwasniewski
6fd225833c
[TRTLLM-6342][bug] Fix shape propagation after TP sharding (#7912)
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
2025-10-01 11:15:46 -04:00
sychen52
ba8abeab10
[OMNIML-2336][feat] add W4A8 NVFP4 FP8 fused moe (#7968)
Signed-off-by: Shiyang Chen <shiychen@nvidia.com>
2025-10-01 02:39:33 -04:00
peaceh-nv
808e556c79
[None][fix] : Fix OOM issue when dp padding is enabled (#8052)
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
2025-10-01 09:10:00 +08:00
brb-nv
84aa3c981e
[None][chore] Waive failing MNNVL alltoall multi-gpu test (#8106)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-09-30 20:05:42 -04:00
mpikulski
ee5ae49337
[TRTLLM-8269][fix] Revert "do not explicitly pass temperature=0 to select greedy sampling" (#8103)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-09-30 16:53:49 -04:00
Guoming Zhang
b4be0d2e4c
[None][chore] Refine qwen3-next implementation. (#8064)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-09-30 15:05:13 -04:00
Yiqing Yan
1560cca227
[None][chore] Bump version to 1.2.0rc1 (#8097)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-09-30 06:00:25 -04:00
Yechan Kim
948b8b9569
[None][fix] Fix CUDA graph for Qwen2.5-VL (#8047)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-09-30 14:40:03 +08:00
Kaiyu Xie
b0cb9ca50e
[None] [test] Add MNNVL AlltoAll tests to pre-merge (#7466)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
Co-authored-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-09-29 23:12:24 -04:00
Lucas Liebenwein
dcfd3ef81c
[#4593][feat] AutoDeploy: Linear Attention Support (SSM + causal_conv + Bamba + Nemotron-H) (#8068)
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Co-authored-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Co-authored-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-09-29 22:41:06 -04:00
Cao Dong
62010c0ab7
[None][feat] Return topk logprobs in torch backend (#7976)
Signed-off-by: Cao Dong <87467313+dcaox@users.noreply.github.com>
2025-09-30 09:32:37 +08:00
Cheng Hang
cdce68c3e0
[TRTLLM-6741][fix] Add heuristics for lm head tp size when enable_lm_head_tp_in_adp=True (#7891)
Signed-off-by: Cheng Hang <chang@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-30 09:24:35 +08:00
mpikulski
31a1a5ff80
[TRTLLM-8269][test] do not explicitly pass temperature=0 to select greedy sampling (#7909)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-09-29 14:52:18 +01:00
bhsueh_NV
38d6e4e60b
[None][feat] Support Qwen3 next (#7892)
Signed-off-by: mengw <12670782+wm2012011492@users.noreply.github.com>
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-09-29 21:16:07 +08:00
mpikulski
a0d489a8d5
[TRTLLM-7728][perf] improve batched sampling perf for contiguous batches (#7908)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-09-29 13:32:50 +01:00
Yiqing Yan
560ded5450
[None][chore] Bump version to 1.2.0rc0 (#7941)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-09-29 17:39:07 +08:00
Gal Hubara-Agam
b2095aa074
[#4674][bugfix] AutoDeploy Fix memory leak in fuse_moe (#7844)
Delete the unstacked weights immediately to save GPU memory, cleanup occurs automatically after the transformation, but for large models we'll run out of memory during the transformation itself.

Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
2025-09-29 11:01:07 +03:00
Void
7f1e2dba92
[None][fix] only support deepep post quant all2all on nvfp4 (#8041)
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
2025-09-29 14:37:50 +08:00
Tailing Yuan
985b79ca82
[TRTLLM-8348][feat] Speed up concat k and copy k_nope in context phase using torch.compile (#8044)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2025-09-29 13:28:12 +08:00
Eran Geva
9cea6bfb30
[#7288][feat] Added AutoDeploy backend support to test_perf.py (#7588)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-09-28 21:21:27 -07:00
Zongfei Jing
e9f26feeb6
[None][chore] Cherry-pick from (#7598) Make low_precision_combine as a llm arg (#7898)
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-09-28 22:32:33 -04:00
Yukun He
28b9a81c58
[TRTLLM-4500][feat] Add serialization/deserialization options for AutoTuner profiling cache (#7738)
To achieve determinism for the AutoTuner profiling cache, serialization and deserialization are introduced to store the cache on disk in JSON format. Use TLLM_AUTOTUNER_CACHE_PATH to indicate the path where the cache file should be stored:

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-09-29 07:40:51 +08:00
Guoming Zhang
3ba4bf6e70
[None][chore] Disable concurrent weights loading for _load_weights_im… (#8034)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-09-28 07:11:16 -04:00
ChristinaZ
95eac2cda7
[https://nvbugs/5537738][fix] Add fp8 post-quant allgather support (#8008)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-09-28 15:32:45 +08:00
Aurelien Chartier
77b68d9d7d
[https://nvbugs/5461712] [fix] Use DG for Qwen3 Linear layers (#8030)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-09-28 10:33:36 +08:00
Xianjie Qiao
c8f98b3065
[None] [feat] Update disagg gen-only benchmark. (#7917)
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
2025-09-28 09:56:56 +08:00
Iman Tabrizian
33282351a2
[TRTLLM-6106][feat] Add support for KVCache transfer from KVCache reuse path (#6348)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-09-27 19:29:30 -04:00
Frida Hou
a36b48bcab
[#5860][autodeploy] GPT-OSS MXFP4 support (#7451)
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2025-09-26 15:36:06 -07:00
Jhao-Ting Chen
c33f43e13a
[https://nvbugs/5518713][fix] Trtllm-gen moe backend for blockwise fp8 ckpt (Qwen3-235B-A22B-FP8) (#7856)
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
2025-09-26 14:29:32 -07:00
Mike Iovine
d7087015f1
[TRTLLM-8271][fix] Fix CDL overlap scheduling performance (#7971)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-09-26 16:05:10 -04:00
YueWeng
a4243f0da5
[TRTLLM-6393][feat] add static tree sampling and verification (#7161)
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
2025-09-26 13:16:16 -04:00
HuiGao-NV
f4d3be4bbc
[None][feat] Add a standalone buffer cache class and reuse buffers between cduagraph and no-graph flow (#7669)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-09-26 07:28:06 -07:00
Tailing Yuan
b11ee868c5
[https://nvbugs/5495789][feat] Optionally disable server GC and worker GC (#7995)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2025-09-26 21:39:24 +08:00
HuiGao-NV
a9965d84e0
[None][chore] Report NCCL error message but not OOM when NCCL error happens (#8009)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-09-25 23:07:32 -07:00
peaceh-nv
55ce70060e
[https://nvbugs/5451740][fix] Add DP padding back on SM120 (#7965)
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
2025-09-26 13:59:54 +08:00
Lucas Liebenwein
3a96d75a3c
[https://nvbugs/5527956][fix] AutoDeploy: fix IMA due to outdated metadata (#8002)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-09-25 22:05:55 -07:00
sunnyqgg
2e5850c28a
[TRTLLM-7330][feat] Eagle3 cuda graph support for the first draft model inference (#7363)
Signed-off-by: qgai <qgai@nvidia.com>
2025-09-26 11:28:05 +08:00
Yuan Tong
fae83c387b
[#6102][fix] support non-system python installation (#7763)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-09-26 10:16:15 +08:00
Yanchao Lu
7e2521a7f0
[None][chore] Some clean-ups for CUDA 13.0 dependencies (#7979)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-26 08:46:11 +08:00
dongfengy
1eb653146a
[https://nvbugs/5525951][fix] Clarify that PP is not supported for GPTOSS (#7911)
Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
2025-09-25 12:54:18 -07:00
QI JUN
1529a6f22d
[None][chore] extract weights loading related logic to model loader (#7579)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-25 10:19:22 -07:00
xxi
57ff5f4c0d
[None][fix] fix a bug in wideEp use DeepEP with num_chunks > 1 (#7954)
Signed-off-by: xxi <xxi@nvidia.com>
2025-09-25 07:53:42 -07:00
Matthias Jouanneaux
eda1467061
[TRTLLM-5966][feat] Helix: add alltoall op (#6815)
Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>
2025-09-25 07:18:29 -07:00
Yueh-Ting (eop) Chen
c5012423f5
[None][chore] Remove developer name in comment (#7981)
Signed-off-by: eopXD <yuehtingc@nvidia.com>
2025-09-25 06:43:38 -07:00
Guoming Zhang
202bed4574 [None][chroe] Rename TensorRT-LLM to TensorRT LLM for source code. (#7851)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
QI JUN
961418908c [https://nvbugs/5531963][fix] cherry pick #7725 (#7907)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
Yan Chunwei
cb466a846d [None][fix] api stability bug in status label (#7861)
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
Yan Chunwei
9d48898def [None][doc] add stable label to all the un-labelled arguments in LLM class (#7863)
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
Guoming Zhang
9f0f52249e [None][doc] Rename TensorRT-LLM to TensorRT LLM for homepage and the … (#7850)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
Yan Chunwei
5342c607cd [https://nvbugs/5516710][fix] fix Llama 3.3 TP PP case (#7717)
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
Tao Li @ NVIDIA
44d7c3b245 [https://nvbugs/1234567][fix] Revert https://github.com/NVIDIA/TensorRT-LLM/pull/7768/files (#7813)
Signed-off-by: Tao Li
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
Wanli Jiang
22b45ff9c7
[TRTLLM-7758][feat] Phi4-mm image modality inference optimization (#7918)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-09-25 15:58:29 +08:00
Void
336c2ef540
[None][feat] DeepEP LL fp8 dispatch/combine (#7927)
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
2025-09-25 09:20:24 +08:00
Leslie Fang
342014069e
[None][chore] Validate features combination (#7630)
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
2025-09-25 08:01:13 +08:00
Iman Tabrizian
da30d496b0
[None][fix] Revert "[None][feat] Return topk logprobs in torch backend (#7756)" (#7969)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-09-24 15:36:38 -07:00
sychen52
5a65af24cd
[OMNIML-2336][feat] Add NVFP4 x FP8 moe kernels (#7821)
Signed-off-by: Shiyang Chen <shiychen@nvidia.com>
2025-09-24 12:14:35 -07:00
Mike Iovine
42c2ec3239
[https://nvbugs/5473781][fix] Fix llama 4 FP8 for PP>1 (#7220)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-09-24 12:16:27 -04:00
Yuxian Qiu
48fda86c56
[None][fix] Fix dummy load format for DeepSeek. (#7874)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-09-24 23:03:16 +08:00
Macrocell
6e5e8b8a3b
[None][fix] fix get_iteration_stats IndexError (#7216)
Signed-off-by: yuhongwei <yumiao.yhw@antgroup.com>
Co-authored-by: yuhongwei <yumiao.yhw@antgroup.com>
2025-09-24 22:43:03 +08:00
Eran Geva
603517f72a
[#7675][feat] CapturedGraph to support max_batch_size > max(cuda_graph_batch_sizes) (#7888)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-09-24 10:11:44 -04:00
Necofish
cfbcf9b9e8
[None][feat] Support Seed-OSS model in pytorch backend (#7496)
Signed-off-by: Nekofish-L <liuxiangyang@mail.ustc.edu.cn>
2025-09-24 03:57:12 -07:00
Enwei Zhu
a1a57e83b8
[TRTLLM-5235][feat] Enable regex and EBNF grammar in trtllm-serve (#7925)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-09-24 18:30:23 +08:00
JunyiXu-nv
6654b78c94
[https://nvbugs/5521799][fix] Trim incorrectly generated harmony messages (#7849)
Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>
2025-09-24 16:38:43 +08:00
Cao Dong
2f8dc6feb0
[None][feat] Return topk logprobs in torch backend (#7756)
Signed-off-by: Dong Cao <docao@nvidia.com>
2025-09-24 15:30:39 +08:00
Yueh-Ting (eop) Chen
cf100933cc
[TRTLLM-6341][feature] Support SWA KV cache reuse (#6768)
This merge request attempts to support more SWA KV cache functionality
inside the KV cache manager. Before this merge request, the KV cache for
sliding window attention (SWA) only holds "window size" number of blocks
and reuse them in a cyclic manner. We will not be able to utilize more
GPU memory with this design, leading to a limited max batch size
throughput. Additionally, we will not be able to support KV cache reuse
with this design.

In this MR, we change such behavior to let the manager write blocks in
a linear manner. With a linear block writing behavior, as the attention
window moves on, the out-of-window (OOW) blocks will be detached. Right
now for the sake of a correct feature first, we directly offload the
OOW block from the primary block pool (GPU memory) to the secondary
block pool (host memory). We will improve this in the future by
delegating the block movement to the eviction policy.

KV cache reuse for SWA is not developed in this merge request and will
be amended in a follow-up merge request.

Writing the blocks linearly, the maximum number of blocks allocated for
a sequence(`GenerationRequest`) is the "max sequence length" specified.
The `GenerationRequest` that stores the cache block bookkeeping
structure will now keep "max sequence length" tokens of blocks.

Given the above, main changes are (more context in the MR):
- Remove "cyclic" concept under the kv cache manager, such concept
  originally guards the block reuse under kv cache manager.
- Add detach mechanism and have it under `KVCacheManager::addToken`.
  Please note that detach is still guarded off for SWA when reuse
  is enabled. A follow-up merge request will proceed to improve this.
- Enforce "max sequence length" to be a non-optional parameter to
  the `KVCacheManager`/`BlockManager`
- Let all window size resource pool get identical proportion of memory
- Fix free memory calculation under `resource_manager.py`

Signed-off-by: eopXD <yuehtingc@nvidia.com>
Co-authored-by: Tomer Asida <tasida@nvidia.com>
2025-09-24 14:28:24 +08:00
Daniel Cámpora
5ccb2dea33
[None][chore] Make sampler type beta. (#7934)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-09-23 20:51:39 -07:00
Yuan Tong
70c3b100eb
[#7692][fix] recognize RequestError as per-request error in background handler (#7726)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-09-24 11:11:17 +08:00
Yuan Tong
f050b8d871
[None][fix] refine backend option handling for commands (#7829)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-09-24 10:54:33 +08:00
Ziyi Xiong
31ef03fd82
[https://nvbugs/5528405][fix] Set up draft_tokens before scheduling (#7903)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-09-24 09:56:17 +08:00
Venky
6ff0fad75e
[TRTLLM-7015] [feat] Enable prompt_logprobs in pytorch backend (#7580)
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
2025-09-23 18:48:10 -07:00
Lizhi Zhou
7550251988
[TRTLLM-7182][test] add multi-nodes test for disagg-serving (#7470)
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2025-09-24 08:31:56 +08:00
mpikulski
9970345919
[TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler, cf. TRTLLM-7156) (#7294)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-09-23 16:05:05 -07:00
Yilin Fan
7d4d6cc9e0
[TRTLLM-7292][feat] Support multi-threaded tokenizers for trtllm-serve (cherry-pick) (#7776)
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
2025-09-23 09:39:47 -07:00
Daniel Cámpora
9f1d9b7b18
[None][feat] Use list instead of torch tensor for new tokens in update requests (#7730)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-09-23 10:40:08 -04:00
Zheyu Fu
34963ec39c
[None][fix] Assign [] to req.py_draft_tokens instead of None when spec decode is off (#7511)
Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>
2025-09-23 06:54:18 -07:00
ChristinaZ
dd5fb2857a
[None][fix] Re-add the import for allgather that was mistakenly removed. (#7920)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-09-23 03:09:48 -07:00
Yan Chunwei
3ba19b6ff1
[https://nvbugs/5532023][fix] executor with-statement bug (#7895)
Signed-off-by: chunweiy <chunweiy@nvidia.com>
2025-09-23 02:05:39 -07:00
Enwei Zhu
f882fb86db
[https://nvbugs/5367180][fix] Fix xgrammar import before loading tensorrt_llm binary (#7906)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-09-23 00:29:57 -07:00
Yan Chunwei
40820e6711
[None][fix] CHERRY-PICK trtllm-serve yaml loading (#7551) (#7897)
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
2025-09-23 14:56:52 +08:00
Pengbo Wang
5792464d37
[None][fix] Read eos_token_id from generation_config for kimi_k2 (#7120)
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
2025-09-23 10:47:03 +08:00
yunruis
126cd707e3
[None][opt] Add batch waiting when scheduling (#7416)
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>
2025-09-23 10:27:37 +08:00
Chang Liu
998857bcde
[TRTLLM-7328][feat] E-PD Disagg Support via llmapi (3/N) (#7577)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-09-22 19:07:18 -07:00
jianweiwu
9da4203e2e
[None][feat] Add Tencent HunYuanDenseV1 model support (#7081)
Signed-off-by: sorenwu <sorenwu@tencent.com>
Signed-off-by: jianweiwu <sorenwu@tencent.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-09-23 09:27:29 +08:00
Tailing Yuan
740340dd17
[https://nvbugs/5522847][fix] Disable GC on disagg server and client (#7858)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2025-09-23 09:16:55 +08:00
Enwei Zhu
8330d5363a
[TRTLLM-8209][feat] Support new structural tag API (upgrade XGrammar to 0.1.25) (#7893)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-09-23 09:10:09 +08:00
xxi
d471655242
[TRTLLM-7831][feat] Cherry-pick from #7423 Support fp8 block wide ep cherry pick (#7712) 2025-09-23 08:41:38 +08:00
Enwei Zhu
59f57598a7
[https://nvbugs/5504086][fix] Fix MTP vanilla (#7904)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-09-23 08:38:28 +08:00
ChristinaZ
be576a3152
[None] [feat] Enable run_post_quant_allgather for MoE TRTLLM backend (#6794)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-09-23 08:24:21 +08:00
Jin Li
b5391b4ac6
[https://nvbugs/5516665][fix] Fix CUTLASS moe fake impl errors (#7714)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-09-22 11:08:39 -07:00
Wanli Jiang
2a30f11d63
[None][chore] Upgrade transformers to 4.56.0 (#7523)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-09-22 22:20:16 +08:00
Yechan Kim
f77aca9f2c
[TRTLLM-7385][feat] Optimize Qwen2/2.5-VL performance (#7250)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-09-22 03:40:02 -07:00
HuiGao-NV
0dac1ddb74
[https://nvbugs/5525849][fix] Cherry-pick to fix mismatch of max seq len between kv cache manager and dummy requests (#7855)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-09-22 18:07:47 +08:00
Yukun He
ab26d21620 [https://nvbugs/5517023][fix] Pass allreduce strategy and force NCCL on pre-Blackwell arch (#7768)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
Yan Chunwei
ba2864a2c6 [None][doc] Enhance api reference doc by labeling stable APIs (#7751)
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
Yi Zhang
f9c9c3f50a [https://nvbugs/5355219][fix] Fix trtllm moe backend test config and Qwen3 MoE multi node (#7724)
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
HuiGao-NV
af34c9713a [https://nvbugs/5474169][fix] seq_len mismatch between kv cache manager and graph attn metadata (#7606)
Signed-off-by: Hui Gao <huig@nvidia.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
Yukun He
3cc16c2438 [https://nvbugs/5496960][fix] Fix Gemma model forward. (#7509)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
Yuxian Qiu
2d46dda6a7 [https://nvbugs/5448754][fix] Download HF model for all nodes. (#6824)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
HuiGao-NV
123f5cbbf0 [https://nvbugs/5474169][fix]Adjust max seq len for kvcache for memory estimation (#7391)
Signed-off-by: Hui Gao <huig@nvidia.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
Bo Li
a15f08db3d [https://nvbugs/5467548][fix] DeepSeek illegal memory access. (#7298)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
Stefan Niebler
8aead224fb
[https://nvbugs/5513423][fix] Correctly respect min_tokens in PyTorch Workflow (#7808)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
Co-authored-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
2025-09-21 22:15:18 -07:00
dongxuy04
b057fc9593
[None][fix] cherrypick to main: Fix possible mpi broadcast and gather issue on large object (#7854)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-09-22 10:17:23 +08:00
Enwei Zhu
639d4109a7
[None][fix] Disable torch.compile for CapturableGuidedDecoder (#7871)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-09-22 10:04:30 +08:00
dongxuy04
9eb8084ca9
[TRTLLM-7008][fix] cherrypick to main Add automatic shared memory delete if already exist (#7727)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-09-21 11:01:51 -07:00
Ziyi Xiong
897c4dd23b
[https://nvbugs/5517404][fix] Use the correct cuda graph for dynamic spec dec (#7728)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-09-21 08:20:48 +08:00
Yan Chunwei
4509d97780
[TRTLLM-8188][chore] refactor GenerationExecutorWorker with WorkerBase for better code reusing (#7840)
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
2025-09-20 06:24:22 -07:00
Grzegorz Kwasniewski
8adaf0bb78
[TRTLLM-6342][feat] Support for partial sharding from factory (#7393)
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
Signed-off-by: Grzegorz Kwasniewski <213329731+greg-kwasniewski1@users.noreply.github.com>
2025-09-19 09:07:42 -07:00
Matthias Jouanneaux
1be7faef37
[TRTLLM-5966][feat] Helix: add custom position ids to MLA kernels (#6904)
Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>
Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com>
2025-09-19 20:55:32 +08:00
Liao Lanyu
18095a7cb8
[https://nvbugs/5503440][fix] Fix potential hang due to wrong type of ZMQ socket and protocol for worker_init_status_queue (#7646)
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
2025-09-19 18:13:33 +08:00
Gabriel Wu
0e72e8f7e6
[None][feat] Support EPLB in Qwen3 MoE (#7443)
Signed-off-by: Gabriel Wu <13583761+lucifer1004@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-09-19 16:45:35 +08:00
QI JUN
f1b362faac
[None][chore] polish error message in cute_dsl_utils.py (#7852)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-19 12:05:11 +08:00
HuiGao-NV
a6370fd143
[https://nvbugs/5481434][feat] cherry-pick fix to reuse pytorch memory segments occupied by cudagraph (#7747)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-09-19 10:25:21 +08:00
Yuxian Qiu
d6ebcf7c4a
[TRTLLM-6994][feat] FP8 Context MLA integration (Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/6059 from release/1.1.0rc2) (#7610)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-09-19 09:40:49 +08:00
Ziyi Xiong
420f0fbcf5
[https://nvbugs/5522851][fix] Correct the logic to update kv_lens_cuda (#7790)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-09-19 08:11:29 +08:00
sunnyqgg
80dd8fe197
[TRTLLM-6746][feat] Enable two-model spec dec for MTP Eagle (#7001)
Signed-off-by: qgai <qgai@nvidia.com>
2025-09-18 12:05:36 -04:00
Li Min
d921fc3352
[TRTLLM-6898][feat] Add swapab, tileN64, cga sync support for cute dsl nvfp4 gemm (#7764)
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
2025-09-18 21:20:04 +08:00
bhsueh_NV
c65457db8a
[None][fix] Revert "Revert "[None][feat] support attention dp for qwen3 dense model"" (#7780)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-09-18 20:11:05 +08:00
Wanli Jiang
fe104dc20d
[TRTLLM-7918][feat] Support kvcache reuse and chunk prefill for phi4mm (#7723)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-09-18 17:37:16 +08:00
Stefan Niebler
a55251bf75
[None][fix] Add TP information in weight scale loading in WeightOnlyQuantLinearMethod (#7732)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-09-18 10:30:50 +02:00
Wanli Jiang
a7ca0fff54
[TRTLLM-6577][feat] Support nano_v2_vlm in pytorch backend (#7207)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-09-18 16:26:20 +08:00
Leslie Fang
870cfcf9a0
[None][chore] Remove executor config in create_py_executor (#7599)
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
2025-09-18 14:24:58 +08:00
mpikulski
1c7f601265
[https://nvbugs/5508890][fix] gen. result cleanup when using PostprocWorker (#7771)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-09-18 14:01:18 +08:00
Li Min
14e455da3e
[None][fix] Fix CI issue for dsl pkg install (#7784)
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-09-18 13:58:20 +08:00
Barry Kang
4f0e6b5f96
[None][feat] Cherry-pick DeepGEMM related commits from release/1.1.0rc2 (#7716)
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
2025-09-18 13:51:48 +08:00
Ziyi Xiong
28469dbf27
[https://nvbugs/5523080][fix] Correct the batch index in device tensors (#7803)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-09-18 13:45:37 +08:00
Guoming Zhang
e0423bfaab
[https://nvbugs/5519544][fix] fix invalid expression for disabling pa… (#7806)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-09-18 12:54:52 +08:00
Yanchao Lu
f8e811d134
[None][chore] Version bump for 1.1.0rc6 (#7824)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-18 11:13:56 +08:00
Yukun He
cd80e0a7f1
[None][fix] Make tile_tokens_dim calculation just in time before kernel launching. (#7529)
tile_tokens_dim directly depends on the num_token, which is a dynamic shape during tuning and inference. When AutoTuner prepares dummy tensors with different num_tokens, it does not update the value of tile_tokens_dim automatically. Therefore, the value stored in the AutoTuner cache is misaligned, which will introduce a lot of cache misses during inference, which hurts perf a lot.

To avoid this issue, we move the calculation of tile_tokens_dim right before kernel launching, so that the value of tile_tokens_dim is always up to date with the num_tokens of the current input tensor used for the kernel runner.

Also, the tile_tokens_dim is calculated based on the number of tokens of a tuned bucket, instead of the original token number. Because we only tune the value for the buckets, not for the raw input token number, to avoid unexpected misalignment between tile_tokens_dim and the token number.

This PR also removes the warmup requests with the extra input shapes, which are triggered in the CUDA graph warmup phase.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-09-18 10:58:52 +08:00
Lucas Liebenwein
39eb120b96
[#7308] [feat] AutoDeploy: graph-less transformers mode for HF (#7635)
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Co-authored-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
2025-09-18 10:44:24 +08:00
Netanel Haber
a5cfc8368f
[https://nvbugs/5508536][fix] Revert #7041: Move stop_criteria to sample_async (#7041) (#7796)
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Co-authored-by: Mike Iovine <miovine@nvidia.com>
2025-09-17 21:27:01 -04:00
William Zhang
2614d71994
[TRTLLM-7410][feat] Enable KV cache reuse and chunked prefill for mistral3.1 (#7628)
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-09-17 08:11:16 -07:00
Zhenhuan Chen
6983e8a00d
[https://nvbugs/5517260][fix] move scaffolding contrib module's import to subdirectory (#7758)
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-09-17 11:36:33 +08:00
Kaiyu Xie
62042a9733
[TRTLLM-6741] [feat] enable LM tp for MTP, under attention dp case (cherry-pick #7128) (#7571)
Signed-off-by: Cheng Hang <chang@nvidia.com>
Co-authored-by: Cheng Hang <chang@nvidia.com>
2025-09-17 09:41:32 +08:00
Yukun He
6313c9799c
[https://nvbugs/5488582][fix] Cherry-pick 7495: Avoid unexpected Triton recompilation in DG fused_moe (#7708)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-09-17 09:00:28 +08:00
Shiyu Li
8bdbb48264
[https://nvbugs/5489015][fix] Support communicator split in MNNVL allreduce and fix the binding issues. (#7387)
Signed-off-by: Shiyu Li <shili@nvidia.com>
2025-09-17 07:43:20 +08:00
HuiGao-NV
a49cfb3e68
[https://nvbugs/5516666][fix] cherrypick fix to the CUDA graph warmup issue when using speculative decoding (#7737)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>
Co-authored-by: Signed-off-by: Hui Gao <huig@nvidia.com>
2025-09-17 06:24:20 +08:00
Aurelien Chartier
471723bce1
[None][chore] Remove unused get_quant_scales methods (#7687)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-09-16 12:56:11 -07:00
Lucas Liebenwein
9befd1a72f
[None][chore] AutoDeploy: neat disablement of transforms in pipeline (#7736)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-09-16 23:31:48 +08:00
bhsueh_NV
8226ef23dc
Revert "[None][feat] support attention dp for qwen3 dense model" (#7765) 2025-09-16 19:09:04 +08:00
Kaiyu Xie
6eef19297f
[None] [chore] cherry pick changes on slurm scripts from release/1.1.0rc2 (#7750)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-09-16 16:07:13 +08:00
Li Min
b278d06481
[TRTLLM-6898][feat] Add Cute DSL nvfp4 linear op (#7632)
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
2025-09-16 14:25:26 +08:00
Bo Li
3f4e160cba
[None][chore] Fix error when running trtllm-bench without cuda graph. (#7725)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-09-15 20:30:23 -07:00
Void
103b554734
[None][fix] Ensure that the W4A8 custom input scale remains aligned across all ranks (#7614)
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
2025-09-16 11:04:26 +08:00
Yanchao Lu
e5cead1eb9
[TRTLLM-6295][test] Exit as early as possible and propagate exit status correctly for multi-node testing (#7739)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-16 09:59:18 +08:00
xiweny
c076a02b38
[TRTLLM-4629] [feat] Add support of CUDA13 and sm103 devices (#7568)
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>
Signed-off-by: Daniel Stokes <dastokes@nvidia.com>
Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com>
Signed-off-by: Xiwen Yu <xiweny@nvidia.com>
Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Signed-off-by: Bo Deng <deemod@nvidia.com>
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: xiweny <13230610+VALLIS-NERIA@users.noreply.github.com>
Co-authored-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>
Co-authored-by: Daniel Stokes <dastokes@nvidia.com>
Co-authored-by: Zhanrui Sun <zhanruis@nvidia.com>
Co-authored-by: Jiagan Cheng <jiaganc@nvidia.com>
Co-authored-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Bo Deng <deemod@nvidia.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-09-16 09:56:18 +08:00
Necofish
96f11b10ae
[None][feat] support attention dp for qwen3 dense model (#7618)
Signed-off-by: Nekofish-L <liuxiangyang@mail.ustc.edu.cn>
2025-09-16 09:33:22 +08:00
Ziyi Xiong
536e8776cd
[TRTLLM-6668][feat] Enable overlap scheduler for two-model spec decoding (#7651)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-09-16 07:33:44 +08:00
Izzy Putterman
8097be7e9c
[None][feat] Eagle, use last hidden post norm (#7546)
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
2025-09-15 12:23:57 -04:00
jmydurant
7deefb3d2b
[TRTLLM-7192][feat] optimize MLA chunked prefill && support fp8 mla chunked prefill (#7477)
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2025-09-15 21:43:49 +08:00
Zheng Duan
24fc1f9acf
[None][fix] using arrival time in llmapi when creating LlmRequest in pytorch workflow (#7553)
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-09-15 07:26:01 -04:00
Wanli Jiang
e080294725
[TRTLLM-7918][feat] Revert "Support kvcache reuse for phi4mm (#7563)" (#7722)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-09-15 17:19:44 +08:00
Wanli Jiang
fc9f4c9295
[TRTLLM-7918][feat] Support kvcache reuse for phi4mm (#7563)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-09-15 15:47:00 +08:00
DylanChen-NV
d5df0af017
[https://nvbugs/5467981][fix] Fix Qwen2.5-VL fails with cuda graph padding (#7122)
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
2025-09-15 15:02:34 +08:00
Chang Liu
47e37755a3
[TRTLLM-6903][feat] Support chunked prefill for multimodal models (#6843)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-09-14 20:10:10 -07:00
Pengyun Lin
c2bc39af63
[TRTLLM-1302][feat] Topk logprobs for TRT backend and top1 logprob for PyT backend (#6097)
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-09-12 15:32:34 +08:00
Chang Liu
3a9847eb84
[https://nvbugs/5498165][fix] fix permission error for config file lock (#7656)
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
2025-09-11 10:36:51 +08:00
Dom Brown
fc9d426589
[https://nvbugs/5505402] [fix] Disable deep_gemm for Qwen3 QKNormRoPEAttention and Linear layers due to accuracy issues (#7616)
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-09-10 18:30:48 +01:00
Leslie Fang
d219a4f225
[None][chore] remove executor config in kv cache creator (#7526)
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
2025-09-10 21:14:44 +08:00
Yiqing Yan
76c5e1a12f
[None][infra] Bump version to 1.1.0rc5 (#7668)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-09-10 16:06:54 +08:00
Kanghwan
758c22f832
[#7208][fix] Fix config type of MedusaConfig (#7320)
Signed-off-by: Kanghwan Jang <861393+karljang@users.noreply.github.com>
2025-09-09 23:25:17 -07:00
Frida Hou
bbb5ae3349
[#5861][autodeploy] Refactor: Quantization Transforms with Inheritance (#7227)
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
2025-09-10 13:00:06 +08:00
Zheyu Fu
c353ff342e
[None][feat] Make the should_use_spec_decode logic a bit smarter (#7112)
Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>
2025-09-10 12:53:59 +08:00
Chang Liu
faa2f46554
[TRTLLM-5059][feat] Enable KV-cache reuse and add E2E tests for llava-next (#7349)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-09-09 14:51:36 -04:00
Jin Li
d49374bc45
[TRTLLM-7408][feat] Wrap MOE with custom op. (#7277)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-09-09 12:18:56 -04:00
Richard Huo
dcd110cfac
[None][chore] add TorchLlmArgs to the connector api (#7493)
Signed-off-by: richardhuo-nv <rihuo@nvidia.com>
2025-09-09 09:05:59 -04:00
NVJiangShao
cc7593987b
[https://nvbugs/5434424][fix] A quick fix for the wrong output issue of SM89 blocked scaling batched GEMM when the input tensor is non-contiguous. (#7615)
Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>
2025-09-09 08:58:15 -04:00
tomeras91
6e712dd1cc
[None][fix] enable NvFP4/FP8 quantization for Nemotron-H architecture (#7589)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-09-09 11:42:22 +03:00
Linda
9cb5410067
[https://nvbugs/5454559][fix] handle bias term in fuse_gate_mlp (#7449)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-09-09 10:26:17 +02:00
William Zhang
c53d1814a7
[None][feat] Extend VLM factory and add Mistral3 factory (#7583)
This commit:

* extends existing factory interfaces to enable Mistral3 in AutoDeploy.
* adds a Mistral3 VLM factory.
* adds various model patches for pixtral (the vision model) and mistral3
  to make the VLM export compliant.
* adjusts checkpoint loading code to take possible parameter name
  conversions into account.
* fixes a sampling bug (the `end_id` needs to be take into account when
  sampling, but it is not included in the stop words' token IDs).

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-09-09 02:47:18 -04:00
Guoming Zhang
f53fb4c803 [TRTLLM-5930][doc] 1.0 Documentation. (#6696)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-09 12:16:03 +08:00
zhanghaotong
96af324ff1
[None][fix] Add try-catch in stream generator (#7467)
Signed-off-by: Zhang Haotong <zhanghaotong.zht@antgroup.com>
Co-authored-by: Zhang Haotong <zhanghaotong.zht@antgroup.com>
2025-09-08 16:09:26 -04:00
Chuang Zhu
77657a1c12
[TRTLLM-7361][feat] KV cache transfer for uneven pp (#7117)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-09-08 13:37:46 -04:00
Leslie Fang
3e0073e86b
[None][chore] remove executor config in instantiate sampler (#7516)
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
2025-09-08 09:02:40 -07:00
Eran Geva
5f2a42b3df
[TRTLLM-6142][feat] AutoDeploy: set torch recompile_limit based on cuda_graph_batch_sizes and refactored (#7219)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-09-08 08:45:58 -04:00
Chang Liu
4a1e13897f
[None][feat] Update multimodal utility get_num_tokens_per_image for better generalization (#7544)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-09-08 07:42:46 -04:00
dominicshanshan
c9dca69e1b
[None][chore] Mass integration of release/1.0 - 3rd (#7519)
Signed-off-by: Nave Assaf <nassaf@nvidia.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Bo Deng <deemod@nvidia.com>
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: Pamela <179191831+pamelap-nvidia@users.noreply.github.com>
Signed-off-by: Hui Gao <huig@nvidia.com>
Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com>
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Co-authored-by: Nave Assaf <55059536+Naveassaf@users.noreply.github.com>
Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com>
Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Co-authored-by: Emma Qiao <qqiao@nvidia.com>
Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Co-authored-by: Bo Deng <deemod@nvidia.com>
Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Co-authored-by: yifeizhang-c <219273404+yifeizhang-c@users.noreply.github.com>
Co-authored-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Co-authored-by: Erin <14718778+hchings@users.noreply.github.com>
Co-authored-by: chenfeiz0326 <chenfeiz@nvidia.com>
Co-authored-by: ChristinaZ <83400082+ChristinaZ@users.noreply.github.com>
Co-authored-by: Venky <23023424+venkywonka@users.noreply.github.com>
Co-authored-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Co-authored-by: HuiGao-NV <huig@nvidia.com>
Co-authored-by: milesial <milesial@users.noreply.github.com>
Co-authored-by: Shi Xiaowei <39303645+Shixiaowei02@users.noreply.github.com>
Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>
Co-authored-by: peaceh-nv <103117813+peaceh-nv@users.noreply.github.com>
Co-authored-by: Guoming Zhang <137257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Co-authored-by: pcastonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Linda <57756729+Linda-Stadter@users.noreply.github.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: Jiagan Cheng <jiaganc@nvidia.com>
Co-authored-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>
Co-authored-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-09-08 14:03:04 +08:00
JunyiXu-nv
504bb7ffa9
[TRTLLM-7779][feat] Support multiple postprocess workers for chat completions API (#7508)
Signed-off-by: Junyi Xu 
Co-authored-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-09-08 11:11:35 +08:00
Yan Chunwei
205c3a144c
[None][chore] expose tokens_per_block into KvCacheConfig (#5911)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
2025-09-07 21:14:10 -04:00
Netanel Haber
0fee8cd028
[TRTLLM-7153] [feat] Move stop_criteria to sample_async (#7041)
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
2025-09-07 17:36:49 +03:00
Raayan Dhar
bae9560e62
[https://nvbugs/5448767][fix] sync termination of requests across PP ranks (#7455)
Signed-off-by: raayandhar <rdhar@nvidia.com>
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Co-authored-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2025-09-07 08:45:49 -04:00
Mike Iovine
45390402fc
[https://nvbugs/5502352][fix] Fix 2-model CDL path (#7543)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-09-06 23:53:27 -04:00
Chang Liu
99b98f1374
[TRTLLM-7440][fix] Split fused_input_embed to separate out host sync (#7280)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-09-06 23:11:39 -04:00
Chang Liu
23500b55c3
[TRTLLM-7398][feat] Support KV cache salting for secure KV cache reuse (#7106)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
2025-09-06 17:58:32 -04:00
QI JUN
12ecb864c2
[None][chore] share input_ids buffers among different cuda graphs (#7236)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-06 17:49:42 -04:00
Anthony Chang
12c66f7610
[None][fix] DeepSeek-R1 W4A8 weight loading issue; fixes regression from #6200 (#7123)
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-09-07 00:04:56 +08:00
Lucas Liebenwein
74105a45d9
[#6120][feat] AutoDeploy: flexible args for sequence interface + AD multi-modal input processor + llama4 VLM example (#7221)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-09-05 22:10:48 -04:00
Leslie Fang
9eb3911470
[None][chore] Remove executor_config in create_py_executor_instance (#7463)
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
2025-09-05 20:56:03 +08:00
Robin Kobus
a95d9616ba
[#6186][feat] Introduce QKNormRoPEAttention module (#6830)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-09-05 14:04:41 +02:00
Jin Li
2189a2f3ff
[https://nvbugs/5483615][fix] Remove unnecessary assertion to let mai… (#7441)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-09-05 10:56:21 +08:00
Naveenraj Kamalakannan
58d1036bb1
[#3325][feat] Add MCTS and TOT tree-based inference controllers to Scaffolding (#7490)
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
2025-09-04 19:46:49 -07:00
Shunkangz
bddf183e15
[None][feat] Add Request specific exception (#6931)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-09-04 18:43:42 -04:00
Rashid Kaleem
89889fb526
[https://nvbugs/5369366] [fix] Report failing requests (#7060)
Signed-off-by: Rashid Kaleem <4079439+arekay@users.noreply.github.com>
2025-09-04 12:56:23 -07:00
Chang Liu
08a0e06621
[TRTLLM-7410][feat] Support hashing and KV cache reuse for videos (#7360)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
2025-09-04 14:39:23 -04:00
sychen52
98a1bffb7c
[OMNIML-2336][feat] Add NVFP4 x FP8 (#6809)
Signed-off-by: Shiyang Chen <shiychen@nvidia.com>
2025-09-04 09:03:38 -07:00
Enwei Zhu
1745102e72
[TRTLLM-7027][feat] Fuse d2t to logitsBitmaskKernel and fix a race condition in one-model spec (#7481)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-09-04 23:30:14 +08:00