Jinyang Yuan
e761231c0b
[fix] Move NCCL group in all-gather and reduce-scatter OPs outside the outer loop ( #6053 )
...
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-07-16 00:25:32 +09:00
Daniel Stokes
dd2491f47d
fix: Fix MOE benchmark to rotate buffers to prevent L2 cache reuse ( #4135 )
...
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-07-15 13:40:42 +12:00
Daniel Stokes
f277afdd93
perf: Enable 128x256 tile shapes for FP4 MOE CUTLASS backend ( #5986 )
...
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-07-14 14:04:15 -07:00
Robin Kobus
6d4b045d1f
refactor: Remove enforced sorted order of batch slots ( #3502 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-14 17:23:02 +02:00
Perkz Zheng
4a0b7a0cf1
[ https://nvbugspro.nvidia.com/bug/5355054 ] fallback to cubins for fp8 fmha kernels on Ada. ( #5779 )
...
Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
Co-authored-by: qsang-nv <200703406+qsang-nv@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
Yi Zhang
9cc4e5d50e
[nvbugs/5336321][fix] Enable attention dp = False test case, Fix TRTLLM Gen Moe workspace allocation ( #5463 )
...
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Signed-off-by: yizhan <187001205+yizhang-nv@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
Dom Brown
afaa388bee
[TRTLLM-6100] fix: Nvbug 5356427: autotuned TRTLLM Gen fp8 block scale MoE illegal memory access ( #5676 )
...
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
dongxuy04
c04570a506
Use huge page mapping for host accessible memory on GB200 ( #5963 )
...
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-07-14 16:11:04 +08:00
Enwei Zhu
ed77ef2ff4
fix: Fix MoE benchmark ( #5966 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-07-14 15:17:26 +09:00
Yuan Tong
a36ac45c4d
fix: fast redux detection in trtllm gen routing kernel ( #5941 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-07-13 16:35:07 +08:00
Enwei Zhu
bc1d4fb5da
[NvBug 5378370] fix: Fix alltoall for llama4 (apply_router_weight_on_input=True) ( #5902 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-07-12 15:50:31 +09:00
ChristinaZ
c5fb692a7d
Refactor the rest routing part for the routing kernels in the MoE TRT-LLM backend ( #5771 )
...
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-07-11 16:37:56 +08:00
Zhihan Jiang
682acd40da
[nvbugs/5321981] Cherrypick fix: Fix the Llama3.1 405B hanging issue. ( #5698 ) ( #5925 )
...
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-07-11 07:51:43 +08:00
Linda
4d071eb2d1
feat: binding type build argument (pybind, nanobind) ( #5802 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-11 00:48:50 +09:00
narutolhy
41ef1ade19
feat:enable kvcache to be reused during request generation ( #4028 )
...
Signed-off-by: narutolhy <582909902@qq.com>
2025-07-10 22:18:01 +09:00
Jinyang Yuan
8b9a030a5c
[fix] Fix MoE workspace info by storing Torch tensor itself instead of data_ptr ( #5900 )
...
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-07-10 20:07:32 +09:00
CarstyYou
dc32f9ae73
[fix] fix tileN cannot % 16==0 & support sm89 deepgemm bmm ( #5531 )
...
Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>
2025-07-10 15:16:18 +08:00
Anthony Chang
7d21b55b5a
[feat] Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE ( #5723 )
...
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-07-10 14:06:50 +08:00
QI JUN
e289a98d5a
avoid nesting NCCL group in allgather and reduce scatter OPs ( #5866 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-07-10 12:32:59 +09:00
peaceh-nv
76c3a12bcb
[fix] WAR to fix the illegal memory access issue in moe gemm on SM120 ( #5636 )
...
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
2025-07-10 09:20:30 +08:00
DylanChen-NV
74dca0aa7b
[NVBUG-5304516/5319741]Qwen2.5VL FP8 support ( #5029 )
...
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
2025-07-09 23:16:42 +08:00
peaceh-nv
52684d79f7
Fix : fix moe regression for sm120 ( #5823 )
...
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
2025-07-09 21:25:11 +08:00
Dom Brown
3e3b1769ad
[TRTLLM-5881] feat: Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner ( #5764 )
...
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-07-09 08:21:58 +01:00
Jhao-Ting Chen
e4c777df7d
Add is_fp8_output key to XQA kernel cubin hashing (solves Eagle3-one-engine Hopper fp8 bug) ( #5813 )
...
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
2025-07-09 09:26:27 +08:00
xavier-nvidia
b6013da198
Fix GEMM+AR fusion on blackwell ( #5563 )
...
Signed-off-by: xsimmons <xsimmons@nvidia.com>
2025-07-09 08:48:47 +08:00
Pamela Peng
da8c7372d4
[TRTLLM-5366][feat]Add support for sm121 ( #5524 )
...
Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Initial CI run failed a single step A30-CPP-3 due to timeout. Rerunning that step succeeded.
2025-07-08 14:27:00 -07:00
Tailing Yuan
ba0aea1da6
Fix a quote error introduced in #5534 ( #5816 )
...
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2025-07-08 18:48:32 +08:00
xiweny
eaf8bec88b
fix: Disaggregate serving with attention DP ( #4993 )
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-07-08 16:15:03 +08:00
JieXin Liang
664bf95892
[fix] improve fp4_block_scale_moe_runner type check ( #5681 )
...
Signed-off-by: JieXin Liang <Alcanderian@users.noreply.github.com>
Co-authored-by: ChristinaZ <83400082+ChristinaZ@users.noreply.github.com>
2025-07-08 14:32:14 +09:00
davidclark-nv
a1235ee978
[feat] Adds optional module cache for TRT-LLM Gen Gemm interfaces ( #5743 )
...
Signed-off-by: David Clark <215764518+davidclark-nv@users.noreply.github.com>
Co-authored-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>
2025-07-07 13:34:55 -07:00
Tailing Yuan
85b4a6808d
Refactor: move DeepEP from Docker images to wheel building ( #5534 )
...
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2025-07-07 22:57:03 +09:00
Daniel Cámpora
1260e2f33f
feat: Optimize TRTLLM Sampler perf single beam single step ( #5550 )
...
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-07-07 15:44:47 +02:00
DylanChen-NV
5ca2b9bb15
[TRTLLM-5812][feat] support FP8 row-wise dense GEMM in torch flow ( #5615 )
...
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
2025-07-07 18:04:57 +08:00
ChristinaZ
12d8c7d129
Refactor the topk parallelization part for the routing kernels ( #5567 )
...
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-07-07 15:53:25 +08:00
Daniel Stokes
ec6c7dff1a
feat: Add support for MXFP8xMXFP4 in pytorch ( #5535 )
...
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-07-06 15:32:06 -07:00
Robin Kobus
ae27261094
refactor: decoding inputs ( #5679 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-06 08:21:02 +02:00
Julien Debache
6bddaf6df6
chore: Improve documentation of Kv_block_array ( #5765 )
...
Signed-off-by: Julien Debache <julien.debache@hotmail.com>
2025-07-05 22:25:27 +02:00
jthomson04
1b588f8390
feat: KV events for sliding window attention ( #5580 )
...
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
2025-07-05 06:05:20 +08:00
Stefan Niebler
d1112aac37
[TRTLLM-3442] feat: added beam search support to the PyTorch Workflow ( #5333 )
...
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-07-05 01:35:13 +09:00
Chuang Zhu
ffc0b8f5da
Cache transceiver support VSWA ( #5505 )
...
Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
Co-authored-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-07-05 01:18:42 +09:00
Faraz
81c0764012
Cherry pick "[NVBUG:5355009] Modify check for fuse_fp4_quant on SM120 ( #5724 )
...
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
Co-authored-by: peaceh-nv <103117813+peaceh-nv@users.noreply.github.com>
2025-07-04 16:53:20 +09:00
Robin Kobus
07f9cf1519
fix: Improve chunking test and skip empty kernel calls ( #5710 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-04 09:08:15 +02:00
Yuan Tong
32b244af38
feat: reduce unnecessary kernel generation ( #5476 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-07-04 14:37:49 +08:00
Netanel Haber
134b2383ff
[fix: nvbugs/5355493] Correctly clamp max sequence len to max attention window ( #5720 )
...
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
2025-07-04 08:16:25 +02:00
Robin Kobus
1a3bd140ed
chore: Remove unused isFullContextRequest method ( #5666 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-03 15:08:09 +02:00
WeiHaocheng
dccbfc8b1e
fix: Set init value for moe expert id ( #5660 )
...
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
2025-07-03 07:05:31 -04:00
Jhao-Ting Chen
77082cde38
[ https://nvbugspro.nvidia.com/bug/5329655 ] [feat] Pytorch path add spec dec param to attention op ( #5146 )
...
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
2025-07-02 04:54:43 -04:00
Robin Kobus
4cd8543d8c
[TRTLLM-1316] refactor: Remove unnecessary pipeline parallelism logic from postProcessRequest ( #5489 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-02 10:13:31 +02:00
qixiang-99
ca7b6ec8d8
Feat/pytorch vswa kvcachemanager ( #5151 )
...
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
2025-07-02 15:58:00 +08:00
Xiaowei Wang
32dfdfba30
feat: fuse w4a8 moe pre-quant scale on Hopper ( #5613 )
...
Signed-off-by: Xiaowei Wang <100599594+xiaoweiw-nv@users.noreply.github.com>
2025-07-01 23:02:41 -04:00