Iman Tabrizian
82276167e6
[None][feat] Add NCCL Symmetric Integration for All Reduce ( #4500 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-08-07 17:28:14 -07:00
Yuan Tong
db8dc97b7b
[None][fix] Migrate to new cuda binding package name ( #6700 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-08-07 16:29:55 -04:00
pcastonguay
453a06e6ab
[TRTLLM-6881][feat] Include attention dp rank info with KV cache events ( #6563 )
...
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-08-07 14:17:07 +02:00
Enwei Zhu
1b9781e8e7
[TRTLLM-6409][feat] Enable guided decoding with speculative decoding (part 1: two-model engine) ( #6300 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-08-07 05:53:48 -04:00
peaceh-nv
8ec3b1de10
[None][feat] : Add FP8 context MLA support for SM120 ( #6059 )
...
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
2025-08-07 16:16:34 +08:00
hlu1
8207d5fd39
[None] [feat] Add model gpt-oss ( #6645 )
...
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-08-07 03:04:18 -04:00
amitz-nv
85af62184b
[TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter ( #6510 )
...
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-08-07 09:05:36 +03:00
Chuang Zhu
ee471df07c
[None][chore] optimize kv cache transfer for context TEP and gen DEP ( #6657 )
...
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-08-07 11:36:05 +08:00
Zongfei Jing
0ff8df95b7
[ https://nvbugs/5433581 ][fix] DeepGEMM installation on SBSA ( #6588 )
...
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-08-06 16:44:21 +08:00
Ransiki
19b7524ff6
[None][feat] Add vLLM KV Pool support for XQA kernel ( #6013 )
...
Signed-off-by: Ransiki Zhang <ransikiz@nvidia.com>
2025-08-06 09:29:37 +08:00
ixlmar
1ebceb790d
[TRTLLM-5508][feat] check input tokens + improve error handling ( #5170 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-08-05 18:27:43 +01:00
Haohang Huang
c9eebcb454
[TRTLLM-6674][feat] (Breaking Change) Hopper SWA non-cyclic kernels + KV reuse + Spec Dec ( #6379 )
...
Signed-off-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
Signed-off-by: symphonylyh <31998628+symphonylyh@users.noreply.github.com>
2025-08-05 07:47:41 +00:00
Chuang Zhu
4d040b50b7
[None][chore] ucx establish connection with zmq ( #6090 )
...
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-08-05 02:50:45 -04:00
Olya Kozlova
13cc1c4878
[TRTLLM-5271][feat] best_of/n for pytorch workflow ( #5997 )
...
Signed-off-by: Olya Kozlova <okozlova@nvidia.com>
2025-08-04 14:08:06 +02:00
Bruce-Lee-LY
8c82ee2803
[fix] xqa precision for fp16/bf16 kv cache ( #6573 )
...
Signed-off-by: Bruce-Lee-LY <yong-li14@tsinghua.org.cn>
Co-authored-by: Bruce-Lee-LY <yong-li14@tsinghua.org.cn>
2025-08-04 14:34:20 +08:00
Yuan Tong
a2f271c8e0
[TRTLLM-4406][feat] LLM sleep & wakeup Part 1: virtual device memory ( #5034 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-08-04 13:51:01 +08:00
Perkz Zheng
03430ed379
[ https://nvbugspro.nvidia.com/bug/5415268 ] fix illegal smem access with chunked attention ( #6401 )
...
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>
2025-08-04 11:19:58 +08:00
Jhao-Ting Chen
6edaa23c1c
[None][feat] Multi-block mode for Hopper spec dec XQA kernel ( #4416 )
...
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
2025-08-03 14:31:33 -07:00
Chuang Zhu
542f552d0b
use cudaSetDevice to create context ,fix nvbug 5394497 ( #6403 )
...
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-08-03 13:32:55 -04:00
Robin Kobus
918fedf952
[None][refactor] Simplify finish reasons handling in DecoderState ( #6524 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-08-02 07:17:43 +02:00
yunruis
a20ab5cbdb
[ https://nvbugs/5381276 ][fix] fix warning for fused_a_gemm ( #6402 )
...
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
2025-08-01 09:37:21 -04:00
Yang Li
ac23f4a80d
[TRTLLM-4279] fix: Add a protection test for checking trtllm custom ops ( #6515 )
...
Signed-off-by: Yang Li <56944310+yali-arch@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-08-01 15:59:09 +08:00
Robin Kobus
d3c14682f0
refactor: Remove unused buffers and bindings from sampler ( #6484 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-08-01 00:43:03 -04:00
Yao Yao
942e080415
[fix] Fix missing fields in xqa kernel cache key ( #6282 )
...
Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>
2025-08-01 10:41:26 +08:00
Jaedeok Kim
fbee279909
fix: remove duplicate layer multiplication in KV cache size calculation ( #6481 )
...
Signed-off-by: Jaedeok Kim <jaedeokk@nvidia.com>
2025-07-31 22:34:34 -04:00
Michal Guzek
b8719fe96d
[nvbug/5374773] chore: Update nanobind with fail_fast_on_attention_window_too_large changes ( #6491 )
...
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
2025-07-31 20:25:29 +01:00
Enwei Zhu
4b299cb77e
feat: Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21 ( #6408 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-07-31 09:53:52 +08:00
pcastonguay
e7ae5e2824
feat: Add support for disaggregation with pp with pytorch backend ( #6369 )
...
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Signed-off-by: raayandhar <rdhar@nvidia.com>
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Signed-off-by: pcastonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: raayandhar <rdhar@nvidia.com>
Co-authored-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-07-30 09:42:13 -04:00
Zheng Duan
c9ed1ab436
[TRTLLM-6549] chore: record delay introduced by disaggregated serving in kv cache measure ( #6135 )
...
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-07-30 10:39:40 +08:00
Yuan Tong
413a83ff80
fix: compatibility with CUDA < 12.9 on __CUDA_ARCH_SPECIFIC__ macro ( #5917 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-07-28 16:02:26 +08:00
Void
f172face98
DeepEP LL dispatch FP4 ( #6296 )
...
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
2025-07-28 11:25:42 +08:00
Yukun He
93a0fd0a23
[TRTLLM-6445] feat: Enable AllReduce-associated fusion patterns in Llama3/4. ( #6205 )
...
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-07-28 09:36:26 +08:00
Jhao-Ting Chen
54f68287fc
fix precompiled multi_query_token kernel not having is_fp8_out hash key ( #6279 )
...
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
2025-07-25 20:45:53 -04:00
Michal Guzek
08d57123f9
[nvbug/5374773] chore: Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache ( #5974 )
...
Signed-off-by: moraxu <mguzek@nvidia.com>
2025-07-25 18:10:40 -04:00
liji-nv
e07fff4f78
[ https://nvbugs/5340941 ] - fix: Correct custom ops used by Qwen3 Moe … ( #6285 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-07-25 14:49:45 +08:00
Linda
9a99e6d6d7
fix: integration tests with nanobind ( #6326 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-25 09:23:20 +08:00
Shiyu Li
375f74ecb2
[fix][nvbugs/5399355] Fix Lamport buffer clear issue for MNNVL TwoShot Allreduce and add FP16 support. ( #6237 )
...
Signed-off-by: Shiyu Li <shili@nvidia.com>
2025-07-25 08:01:40 +08:00
Bo Deng
ff72ca90de
Improve TransferAgentTest.SyncMessage ( #6250 )
...
Signed-off-by: Bo Deng <deemod@nvidia.com>
2025-07-24 23:41:36 +08:00
Perkz Zheng
706f421cb0
[Fix] the bug in the trtllm-gen heurisitcf for MLA kernels. ( #6284 )
...
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-07-24 23:40:27 +08:00
Zhenhua Wang
62298bc473
perf: customize cublastLt algo for Llamba 3.3 70B TP4 ( #6315 )
...
Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>
2025-07-24 23:01:15 +08:00
Zhou Yuxin
0ffcf9a863
Update fmhaRunner.cpp to fix guardwords scan error ( #6327 )
...
Signed-off-by: Zhou Yuxin <yuxinz@nvidia.com>
2025-07-24 18:32:36 +08:00
Zhou Yuxin
fca13b8c95
hopper-style context MLA ( #5713 )
...
Signed-off-by: Yuxin <yuxinz@nvidia.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Rashid K <rkaleem@nvidia.com>
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
Signed-off-by: Po-Wei Wang (Vincent) <poweiw@nvidia.com>
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Clay <ccs96307@gmail.com>
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
Signed-off-by: Hui Gao <huig@nvidia.com>
Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
Signed-off-by: Xianjie Qiao <5410381+qiaoxj07@users.noreply.github.com>
Signed-off-by: Julien Debache <julien.debache@hotmail.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com>
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Signed-off-by: David Clark <215764518+davidclark-nv@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: JieXin Liang <Alcanderian@users.noreply.github.com>
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Yegor <75512761+Wokzy@users.noreply.github.com>
Signed-off-by: Yegor Yershov <yegor6741@gmail.com>
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: raayandhar <rdhar@nvidia.com>
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: xsimmons <xsimmons@nvidia.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Signed-off-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal>
Signed-off-by: Hanjun Cho <46752251+gkswns0531@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
Signed-off-by: narutolhy <582909902@qq.com>
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Signed-off-by: Frank <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
Signed-off-by: William Tambellini <wtambellini@sdl.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Co-authored-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Emma Qiao <qqiao@nvidia.com>
Co-authored-by: WeiHaocheng <20514172+WeiHaocheng@users.noreply.github.com>
Co-authored-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: Rashid Kaleem <4079439+arekay@users.noreply.github.com>
Co-authored-by: Zhihan Jiang <68881590+nvzhihanj@users.noreply.github.com>
Co-authored-by: Zhenhuan Chen <chenzhh3671@gmail.com>
Co-authored-by: Po-Wei (Vincent) <poweiw@nvidia.com>
Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Co-authored-by: Neta Zmora <nzmora@nvidia.com>
Co-authored-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: Clay <ccs96307@gmail.com>
Co-authored-by: Venky <23023424+venkywonka@users.noreply.github.com>
Co-authored-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Co-authored-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
Co-authored-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com>
Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com>
Co-authored-by: Linda <57756729+Linda-Stadter@users.noreply.github.com>
Co-authored-by: Shunkangz <182541032+Shunkangz@users.noreply.github.com>
Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: Tailing Yuan <yuantailing@gmail.com>
Co-authored-by: Faraz <58580514+farazkh80@users.noreply.github.com>
Co-authored-by: peaceh-nv <103117813+peaceh-nv@users.noreply.github.com>
Co-authored-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
Co-authored-by: HuiGao-NV <huig@nvidia.com>
Co-authored-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
Co-authored-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Co-authored-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
Co-authored-by: jthomson04 <jwillthomson19@gmail.com>
Co-authored-by: Xianjie Qiao <5410381+qiaoxj07@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Julien Debache <jdebache@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com>
Co-authored-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com>
Co-authored-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Co-authored-by: ChristinaZ <83400082+ChristinaZ@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
Co-authored-by: DylanChen-NV <191843203+DylanChen-NV@users.noreply.github.com>
Co-authored-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
Co-authored-by: davidclark-nv <215764518+davidclark-nv@users.noreply.github.com>
Co-authored-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>
Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com>
Co-authored-by: liji-nv <59594262+liji-nv@users.noreply.github.com>
Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: xiweny <13230610+VALLIS-NERIA@users.noreply.github.com>
Co-authored-by: Yegor <75512761+Wokzy@users.noreply.github.com>
Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Co-authored-by: Raayan Dhar <58057652+raayandhar@users.noreply.github.com>
Co-authored-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Co-authored-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
Co-authored-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Co-authored-by: xavier-nvidia <xsimmons@nvidia.com>
Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
Co-authored-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Co-authored-by: Erin <14718778+hchings@users.noreply.github.com>
Co-authored-by: chenfeiz0326 <chenfeiz@nvidia.com>
Co-authored-by: dongxuy04 <78518666+dongxuy04@users.noreply.github.com>
Co-authored-by: 2ez4bz <133824995+2ez4bz@users.noreply.github.com>
Co-authored-by: Hanjun Cho <46752251+gkswns0531@users.noreply.github.com>
Co-authored-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Co-authored-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Co-authored-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
Co-authored-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>
Co-authored-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
Co-authored-by: narutolhy <582909902@qq.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Co-authored-by: wili <98001977+wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: Void <18275976+yilin-void@users.noreply.github.com>
Co-authored-by: William Tambellini <wtambellini@sdl.com>
2025-07-23 14:37:20 +08:00
Perkz Zheng
2193ad3aac
[ https://nvbugs/5387771 ] fix deadlocks due to insufficient numSemaphores ( #6262 )
...
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-07-23 11:20:55 +08:00
Linda
60073731ca
fix: bindings unit tests for nanobind ( #6221 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-22 14:51:43 +01:00
WeiHaocheng
fddb7f1141
feat: moe prepare support topk % 4 != 0 ( #5742 )
...
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
2025-07-22 10:42:46 +08:00
Chang Liu
7381f1dba7
[TRTLLM-5059][feat] Add KV cache reuse support for multimodal models ( #5444 )
...
Only supports qwen in this PR
2025-07-21 16:11:58 -07:00
Linda
3efad2e58c
feat: nanobind bindings ( #6185 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-21 08:56:57 +01:00
Yuening Li
e8c068b4b1
[TRTLLM-5863][feat] Support Weight-Only-Quantization in PyTorch Workflow ( #5850 )
...
Signed-off-by: Yuening Li <62227368+yueningl@users.noreply.github.com>
Co-authored-by: Yuening Li <62227368+yueningl@users.noreply.github.com>
2025-07-21 15:17:35 +08:00
danielafrimi
5300a99bd8
W4A8 GEMM ( #6005 )
...
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>
2025-07-20 17:34:57 +03:00
amitz-nv
98428f330e
[TRTLLM-5826][feat] Support pytorch LoRA adapter eviction ( #5616 )
...
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-07-20 08:00:14 +03:00
Martin Marciniszyn Mehringer
943fd418dd
fix: Ensure mlx5 library is installed for deep_ep and remove deprecated python bindings ( #6189 )
...
Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>
2025-07-20 10:38:51 +08:00
bhsueh_NV
2e14c8f443
[Fix][Chore][Qwen3] fix bug of using fp4 on sm120 ( #6065 )
...
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-07-20 10:25:25 +08:00
Void
118307c224
DeepEP LL support variable hidden size and tokens num ( #6141 )
...
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
2025-07-20 09:32:41 +08:00
Ziyi Xiong
66030ef815
[TRTLLM-6452][feat]: Two-model engine KV cache reuse support ( #6133 )
...
Signed-off-by: ziyixiong-nv <fxiong@nvidia.com>
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-07-19 13:17:15 +08:00
Bo Deng
0388ff9083
[ https://nvbugs/5393961 ][fix] record kv-cache size in MLACacheFormatter ( #6181 )
...
Signed-off-by: Bo Deng <deemod@nvidia.com>
2025-07-19 05:06:45 +08:00
Stefan Niebler
d475c97c82
[nvbugs/5354884][fix] Update beam search workspace estimation to new upper bound ( #5926 )
...
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-07-19 01:54:51 +08:00
Stefan Niebler
6d7874a467
[nvbugs/5369799] fix: Update disaggregation handling in sampler ( #5762 )
...
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-07-19 01:40:46 +08:00
Robin Kobus
ec2b953e7e
refactor: Enhanced handling of decoder requests and logits within the batch manager ( #6055 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-18 12:12:08 +02:00
QI JUN
a95f31e72a
chore: add more log in FmhaDispatcher ( #6170 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-07-18 16:53:02 +08:00
xavier-nvidia
200ea9ee81
fix TMA error with GEMM+AR on TP=2 ( #6075 )
...
Signed-off-by: Xavier Simmons <xsimmons@nvidia.com>
2025-07-18 10:26:08 +08:00
yifeizhang-c
0155e7a3a1
[TRTLLM-6368] Update deepep dispatch API ( #6037 )
...
Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>
2025-07-18 10:13:31 +08:00
Iman Tabrizian
b75e53ab69
Revert "feat: nanobind bindings ( #5961 )" ( #6160 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-18 10:12:54 +08:00
Daniel Stokes
ae28b3a664
feat: Add support for benchmarking individual gemms in MOE benchmark ( #6080 )
...
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-07-18 09:00:12 +12:00
Linda
5bff317abf
feat: nanobind bindings ( #5961 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-17 22:42:52 +08:00
Enwei Zhu
21efb50068
[TRTLLM-6406] feat: Enable guided decoding with overlap scheduler ( #6000 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-07-17 17:46:10 +08:00
Chuang Zhu
44c70c88f9
chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service ( #5234 )
...
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-07-17 17:42:07 +08:00
ChristinaZ
7e033c392e
Feat: Add vectorized loading for finalize kernel in MoE Trtllm backend ( #5919 )
...
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-07-17 12:38:29 +08:00
Shiyu Li
6e1aee6fd6
[fix] Performance Optimization for MNNVL TwoShot Kernel ( #5934 )
...
Signed-off-by: Shiyu Li <shili@nvidia.com>
Co-authored-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-07-17 10:49:51 +08:00
qixiang-99
e09e409dfb
Fix: Enhance ModelConfig for kv cache size calculations ( #5868 )
...
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
2025-07-16 14:41:31 -07:00
qsang-nv
8ef8e73002
update spec_dec ( #6079 )
...
Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
2025-07-16 17:50:43 +08:00
Tomer Shmilovich
0552a02943
BlockManager copy constructor fix ( #5982 )
...
Signed-off-by: Tomer Shmilovich <tshmilovich@nvidia.com>
2025-07-16 17:33:17 +08:00
Bo Deng
ec3ebae43e
[TRTLLM-6471] Infra: Upgrade NIXL to 0.3.1 ( #5991 )
...
Signed-off-by: Rabia Loulou <174243936+rabial-nv@users.noreply.github.com>
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: Bo Deng <deemod@nvidia.com>
Co-authored-by: Rabia Loulou <174243936+rabial-nv@users.noreply.github.com>
Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-07-16 13:54:42 +08:00
Zheng Duan
38db4bc7fb
feat: use session abstraction in data transceiver and cache formatter ( #5611 )
...
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-07-16 13:52:44 +08:00
Jinyang Yuan
e761231c0b
[fix] Move NCCL group in all-gather and reduce-scatter OPs outside the outer loop ( #6053 )
...
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-07-16 00:25:32 +09:00
Daniel Stokes
dd2491f47d
fix: Fix MOE benchmark to rotate buffers to prevent L2 cache reuse ( #4135 )
...
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-07-15 13:40:42 +12:00
Daniel Stokes
f277afdd93
perf: Enable 128x256 tile shapes for FP4 MOE CUTLASS backend ( #5986 )
...
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-07-14 14:04:15 -07:00
Robin Kobus
6d4b045d1f
refactor: Remove enforced sorted order of batch slots ( #3502 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-14 17:23:02 +02:00
Perkz Zheng
4a0b7a0cf1
[ https://nvbugspro.nvidia.com/bug/5355054 ] fallback to cubins for fp8 fmha kernels on Ada. ( #5779 )
...
Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
Co-authored-by: qsang-nv <200703406+qsang-nv@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
Yi Zhang
9cc4e5d50e
[nvbugs/5336321][fix] Enable attention dp = False test case, Fix TRTLLM Gen Moe workspace allocation ( #5463 )
...
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Signed-off-by: yizhan <187001205+yizhang-nv@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
Dom Brown
afaa388bee
[TRTLLM-6100] fix: Nvbug 5356427: autotuned TRTLLM Gen fp8 block scale MoE illegal memory access ( #5676 )
...
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
dongxuy04
c04570a506
Use huge page mapping for host accessible memory on GB200 ( #5963 )
...
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-07-14 16:11:04 +08:00
Enwei Zhu
ed77ef2ff4
fix: Fix MoE benchmark ( #5966 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-07-14 15:17:26 +09:00
Yuan Tong
a36ac45c4d
fix: fast redux detection in trtllm gen routing kernel ( #5941 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-07-13 16:35:07 +08:00
Enwei Zhu
bc1d4fb5da
[NvBug 5378370] fix: Fix alltoall for llama4 (apply_router_weight_on_input=True) ( #5902 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-07-12 15:50:31 +09:00
ChristinaZ
c5fb692a7d
Refactor the rest routing part for the routing kernels in the MoE TRT-LLM backend ( #5771 )
...
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-07-11 16:37:56 +08:00
Zhihan Jiang
682acd40da
[nvbugs/5321981] Cherrypick fix: Fix the Llama3.1 405B hanging issue. ( #5698 ) ( #5925 )
...
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-07-11 07:51:43 +08:00
Linda
4d071eb2d1
feat: binding type build argument (pybind, nanobind) ( #5802 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-11 00:48:50 +09:00
narutolhy
41ef1ade19
feat:enable kvcache to be reused during request generation ( #4028 )
...
Signed-off-by: narutolhy <582909902@qq.com>
2025-07-10 22:18:01 +09:00
Jinyang Yuan
8b9a030a5c
[fix] Fix MoE workspace info by storing Torch tensor itself instead of data_ptr ( #5900 )
...
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-07-10 20:07:32 +09:00
CarstyYou
dc32f9ae73
[fix] fix tileN cannot % 16==0 & support sm89 deepgemm bmm ( #5531 )
...
Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>
2025-07-10 15:16:18 +08:00
Anthony Chang
7d21b55b5a
[feat] Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE ( #5723 )
...
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-07-10 14:06:50 +08:00
QI JUN
e289a98d5a
avoid nesting NCCL group in allgather and reduce scatter OPs ( #5866 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-07-10 12:32:59 +09:00
peaceh-nv
76c3a12bcb
[fix] WAR to fix the illegal memory access issue in moe gemm on SM120 ( #5636 )
...
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
2025-07-10 09:20:30 +08:00
DylanChen-NV
74dca0aa7b
[NVBUG-5304516/5319741]Qwen2.5VL FP8 support ( #5029 )
...
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
2025-07-09 23:16:42 +08:00
peaceh-nv
52684d79f7
Fix : fix moe regression for sm120 ( #5823 )
...
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
2025-07-09 21:25:11 +08:00
Dom Brown
3e3b1769ad
[TRTLLM-5881] feat: Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner ( #5764 )
...
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-07-09 08:21:58 +01:00
Jhao-Ting Chen
e4c777df7d
Add is_fp8_output key to XQA kernel cubin hashing (solves Eagle3-one-engine Hopper fp8 bug) ( #5813 )
...
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
2025-07-09 09:26:27 +08:00
xavier-nvidia
b6013da198
Fix GEMM+AR fusion on blackwell ( #5563 )
...
Signed-off-by: xsimmons <xsimmons@nvidia.com>
2025-07-09 08:48:47 +08:00
Pamela Peng
da8c7372d4
[TRTLLM-5366][feat]Add support for sm121 ( #5524 )
...
Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Initial CI run failed a single step A30-CPP-3 due to timeout. Rerunning that step succeeded.
2025-07-08 14:27:00 -07:00
Tailing Yuan
ba0aea1da6
Fix a quote error introduced in #5534 ( #5816 )
...
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2025-07-08 18:48:32 +08:00