Commit Graph

988 Commits

Author SHA1 Message Date
Yilin Fan
d643aef73c
[Perf] Improve Llama4 performance for small max_seqlen cases (#6306)
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
2025-08-09 02:58:31 -04:00
Ye Zhang
bcf5ec0c9a
[None][feat] Core Metrics Implementation (#5785)
Signed-off-by: Ye Zhang <zhysishu@gmail.com>
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-08-09 02:48:53 -04:00
Yibin Li
97787883c3
[TRTLLM-6420][feat] add support for Eclairv2 model - cherry-pick changes and minor fix (#6493)
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-08-08 21:40:48 -04:00
dongfengy
d06675071e
[None][fix] WAR GPT OSS on H20 with Triton MOE (#6721)
Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
2025-08-08 19:47:09 -04:00
Mike Iovine
90145cf557
[None][feat] Optimize CUDA graph memory usage for spec decode cases (#6718)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-08-08 13:56:53 -04:00
Wanli Jiang
d45236b253
[TRTLLM-6308][feat] Support Aggregate mode for phi4-mm (#6184)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-08-08 20:09:26 +08:00
Stefan Niebler
b8f036f264
[TRTLLM-6650][fix] Enhance CUDA graph + Beam search to correctly handle padding (#6665)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-08-08 14:00:33 +02:00
Liao Lanyu
32ad7f3c12
[None][fix] Remove lock related typo in py_executor (#6653)
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
2025-08-08 17:48:57 +08:00
JunyiXu-nv
5f45227a93
[https://nvbugs/5437106][fix] Fix llama4 scout TRTLLM attn_backend (#6690)
Signed-off-by: Junyi Xu <junyix@nvidia.com>
2025-08-08 17:48:23 +08:00
Yuxian Qiu
9ff4e75f14
[None][refactor] Combine resmooth_to_fp8_e8m0 and transform_sf_into_required_layout (#6654)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-08-08 17:11:41 +08:00
Li Min
d913955952
[TRTLLM-6898][feat] make fused_moe_cute_dsl work on blackwell (#6616)
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
2025-08-08 15:03:48 +08:00
2ez4bz
064eb7a70f
[TRTLLM-5252][fix] Propagate mapping to intermediate layers (#6611)
This commit propagates the mapping to intermediate layers to enable
tensor parallelism (amongst other things) in them.

It also fixes issues with a unit test for TP for pixtral, and adds it to a
test list.

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-08-08 01:50:36 -04:00
Enwei Zhu
aee828d98a
[TRTLLM-6854][feat] Enable guided decoding with disagg serving (#6704)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-08-08 12:10:36 +08:00
zhanghaotong
1cf669496a
[None][fix] Fix unnecessary GPU synchronization in torch sampler caused by incorrect tensor reference (#6626)
Signed-off-by: 皓聪 <zhanghaotong.zht@alibaba-inc.com>
Co-authored-by: 皓聪 <zhanghaotong.zht@alibaba-inc.com>
2025-08-07 23:44:47 -04:00
NVJiangShao
2f2f5cc72c
[TRTLLM-6744][feat] Remove input_sf swizzle for module WideEPMoE (#6231)
Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>
2025-08-08 11:13:42 +08:00
Daniel Cámpora
efca359b66
[TRTLLM-6785][feat] BREAKING CHANGE Enable TRTLLM sampler by default (#6216)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-08-07 22:19:37 -04:00
Iman Tabrizian
82276167e6
[None][feat] Add NCCL Symmetric Integration for All Reduce (#4500)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-08-07 17:28:14 -07:00
Haohang Huang
980929e1a9
[https://nvbugs/5410687][fix] Hopper w4a8 groupwise MoE interleave (#6708)
Signed-off-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-08-07 15:30:16 -07:00
Yuan Tong
db8dc97b7b
[None][fix] Migrate to new cuda binding package name (#6700)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-08-07 16:29:55 -04:00
Mike Iovine
e968f98b43
[None][feat] Clean up ngram auto mode, add max_concurrency to configs (#6676)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-08-07 12:51:47 -04:00
Emma Qiao
3c44b44e45
[None][infra] Fix guardwords (#6711)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-08-07 21:06:47 +08:00
pcastonguay
453a06e6ab
[TRTLLM-6881][feat] Include attention dp rank info with KV cache events (#6563)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-08-07 14:17:07 +02:00
Enwei Zhu
1b9781e8e7
[TRTLLM-6409][feat] Enable guided decoding with speculative decoding (part 1: two-model engine) (#6300)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-08-07 05:53:48 -04:00
hlu1
8207d5fd39
[None] [feat] Add model gpt-oss (#6645)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-08-07 03:04:18 -04:00
amitz-nv
85af62184b
[TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter (#6510)
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-08-07 09:05:36 +03:00
Yiqing Yan
5fa1914cab
[None][chore] Bump version to 1.1.0rc0 (#6651)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-08-07 13:39:49 +08:00
Izzy Putterman
7e0158b583
Qwen3: Fix eagle hidden states (#6199)
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
2025-08-06 17:05:18 -04:00
Hanjun Cho
80f918cc22
[None][feat] Add Qwen3 MoE support to TensorRT backend (#6470)
Signed-off-by: gkswns0531 <gkswns0531@gmail.com>
Signed-off-by: hanjuncho <gkswns0531@gmail.com>
Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com>
2025-08-06 17:02:35 +08:00
Zongfei Jing
0ff8df95b7
[https://nvbugs/5433581][fix] DeepGEMM installation on SBSA (#6588)
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-08-06 16:44:21 +08:00
Netanel Haber
83ee91e17b
[None][fix] Fix 6522 mpi.pkl5.intracomm.Request has wait not Wait (#6646)
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
2025-08-06 14:18:09 +08:00
JunyiXu-nv
13e0214fe0
[TRTLLM-6263][feat] Enable fp8 SwiGLU to minimize host overhead (#6540)
Signed-off-by: Junyi Xu <junyix@nvidia.com>
2025-08-06 10:42:19 +08:00
brb-nv
9a01934dbf
[None][feat] Switch to internal version of MMProjector in Gemma3 (#6572)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-08-05 21:48:23 -04:00
yunruis
3ff4f503ad
[None][opt] ADP schedule balance optimization (#6061)
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
2025-08-06 09:38:02 +08:00
Yechan Kim
c17f4984e2
[None][feat] Refactor Llava-Next (#6478)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-08-05 17:53:53 -07:00
Aurelien Chartier
6da95f29a9
[None][feat] Add support for fused gate_up_proj scales for FP8 blockwise (#6496)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-08-05 11:22:32 -07:00
Wanli Jiang
46df8712c8
[https://nvbugs/5355007][fix] Set enable_chunked_context as True by default in trtllm bench (#6582)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-08-05 11:11:36 -07:00
ixlmar
1ebceb790d
[TRTLLM-5508][feat] check input tokens + improve error handling (#5170)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-08-05 18:27:43 +01:00
liji-nv
dcbfa7e509
[https://nvbugs/5252313][fix] Fix torch compile + MTP (#6554)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-08-05 10:31:29 -04:00
Venky
61da2daeb4
[TRTLLM-6761][refactor] Replace LogitBiasLogitsProcessor with embedding bias tensor system (#6464)
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
2025-08-05 07:14:24 -07:00
Pengbo Wang @ NVIDIA
c289880afb
[None][fix] fix kimi k2 serving and add test for Kimi-K2 (#6589)
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
2025-08-05 18:05:33 +08:00
amitz-nv
dc84695520
[TRTLLM-6826][feat] Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5 (#6522)
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-08-05 11:28:26 +03:00
danielafrimi
ed801ff74b
[None][fix] Remove expand configuration from mamba2 mixer (#6521)
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>
2025-08-05 04:18:25 -04:00
Haohang Huang
c9eebcb454
[TRTLLM-6674][feat] (Breaking Change) Hopper SWA non-cyclic kernels + KV reuse + Spec Dec (#6379)
Signed-off-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
Signed-off-by: symphonylyh <31998628+symphonylyh@users.noreply.github.com>
2025-08-05 07:47:41 +00:00
kris1025
6a3a921284
[TRTLLM-6685][feat] Add speculative metrics for trt llm bench (#6476)
Signed-off-by: linquanh <linquanh@nvidia.com>
2025-08-04 15:22:57 -07:00
Olya Kozlova
13cc1c4878
[TRTLLM-5271][feat] best_of/n for pytorch workflow (#5997)
Signed-off-by: Olya Kozlova <okozlova@nvidia.com>
2025-08-04 14:08:06 +02:00
brb-nv
87e4e9f468
[None][chore] Add unit test for Gemma3 lora (#6560)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-08-04 04:56:57 -04:00
Yiqing Yan
3916dbd98b
[None][chore] Bump version to 1.0.0rc6 (#6597)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-08-04 04:39:15 -04:00
Pengyun Lin
a15e33351d
[None][fix] Revert commit 48ddc3d & add test for disagg server with different max_num_tokens (#6259)
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-08-04 15:09:51 +08:00
Yuan Tong
a2f271c8e0
[TRTLLM-4406][feat] LLM sleep & wakeup Part 1: virtual device memory (#5034)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-08-04 13:51:01 +08:00
Yechan Kim
ee6ab5be96
chore: add EXAONE4 accuracy test (#6397)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-08-04 10:14:16 +08:00