Chenghao Zhang
ddf2d010e2
[TRTLLM-8814][feat] AutoDeploy: Use TRTLLM kernels for FP8 linear ( #8820 )
...
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-11-06 11:00:10 -08:00
DylanChen-NV
b275635a9a
[ https://nvbugs/5498478 ][fix] Fix eagle3 fp8 kv target model + bf16 draft model + chunked prefill ( #8910 )
...
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
2025-11-06 07:41:21 -08:00
shuyixiong
c73efe12e7
[None][chore] Use cached model in all ray tests ( #8962 )
...
Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com>
2025-11-06 15:14:15 +01:00
yunruis
51545560da
[TRTLLM-8803][feat] Add rope and uk-bgemm overlap for mla generation ( #8495 )
...
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
2025-11-06 17:39:57 +08:00
Lucas Liebenwein
7a552c450a
[ https://nvbugs/5606166 ][fix] AutoDeploy: unwaive test for use tuples for cudagraph shape lookup ( #8957 )
...
also updated test waive for another nvbug
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-11-05 16:27:00 -08:00
Lucas Liebenwein
b181568d6f
[TRTLLM-8201][feat] Nemotron H MoE Sharding ( #8744 )
...
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
Co-authored-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-11-05 12:35:29 -08:00
Chang Liu
e57d83c5dc
[TRTLLM-8768][chore] Fuse QK down_proj with indexer K + weight_proj for FP4 ckpt ( #8771 )
2025-11-05 07:57:09 -08:00
Emma Qiao
31116825b3
[None][infra] Waive failed cases on main 11/05 ( #8936 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-11-04 22:54:45 -08:00
Shiyu Li
eeb56c2848
[None][feat] MNNVLAllreduce Kernel Refactor ( #8018 )
...
Signed-off-by: Shiyu Li <timlee0212@outlook.com>
Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-11-05 08:49:47 +08:00
Frida Hou
11ded113cd
[ #8389 ][fix] Update group attention matching to first map to custom torch attention ( #8638 )
...
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2025-11-04 12:00:43 -08:00
shuyixiong
70e4d72ffa
[TRTLLM-8511][feat] Add update_weights and sleep_wakeup support for rl integration ( #8302 )
...
Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com>
Co-authored-by: Liwei Ma <liweim@nvidia.com>
Co-authored-by: Jonas Yang CN <joyang@nvidia.com>
2025-11-04 10:19:24 -08:00
Bo Li
e4bf29bc66
[None][feat] Integrate MnnvlThroughput into TRTLLM MoE. ( #8728 )
...
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-11-04 21:36:29 +08:00
Zhanrui Sun
4de31bece2
[TRTLLM-8994][infra] upgrade to DLFW 25.10 and pytorch 2.9.0 / triton 3.5.0 ( #8838 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-11-04 18:59:34 +08:00
CarstyYou
4296c9553d
[TRTLLM-1234][feat] Add fp8 blockscaled Gemm for sm120 ( #8844 )
...
Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>
2025-11-04 18:10:36 +08:00
xiweny
ce23e24123
[ https://nvbugs/5565565 ] [fix] Remove waiver ( #8450 )
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-11-04 16:42:31 +08:00
Yukun He
2225745782
[TRTLLM-8129][feat] Allreduce tuning and benchmark script revising ( #7870 )
...
Because we have encountered some perf regression due to using a one-shot kernel instead of NCCL on A100/H100, it will be beneficial if we can have a solid benchmarking of allreduce Op and analyze the data collected from it.
Implemented new AllreduceOp heuristic:
- Added Linear programming-based heuristic implementation.
- Added LUT-based heuristic implementation and corresponding code generation script.
AllreduceOp minor fixing:
- Fixed a minor issue in AllreduceOp, that the strategy can not be overridden when ONESHOT or TWOSHOT is set.
- Fixed a minor TWOSHOT kernel perf issue.
- Cleaned up Dispatching code in AllReduceOp.
This PR will fix the perf gaps reported in:
https://nvbugspro.nvidia.com/bug/5517023
For Deepseek-R1, it shows a performance gain of about 3-4% in concurrency levels of 256 and 512.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-11-04 16:42:31 +08:00
Yechan Kim
67208f1512
[None][fix] InputProcessor config naming convention fix ( #8705 )
...
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-11-03 22:29:21 -08:00
Matthias Jouanneaux
d0f107e4dd
[TRTLLM-5966][feat] Helix: add full MLA support for Helix ( #8104 )
...
Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>
2025-11-04 09:06:58 +08:00
Yechan Kim
f48968b6cc
[TRTLLM-6928][fix] Refactor multimodal unittest ( #8453 )
...
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-11-03 06:01:07 -08:00
Chang Liu
f57dc01e6f
[ https://nvbugs/5625380 ][chore] Remove multimodal related fields from decoder llm input ( #8846 )
2025-11-02 17:44:08 -08:00
QI JUN
89e0117097
[TRTLLM-8836][chore] Create ModelEngine from LlmArgs ( #8600 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-11-01 05:26:06 -07:00
Fanrong Li
f0dc746738
[TRTLLM-8541][feat] Add trtllm-gen sparse MLA kernels to support per-Tensor FP8 KV Cache ( #8692 )
...
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Co-authored-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
Co-authored-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-10-31 14:38:31 -07:00
dongfengy
0edba5a7e2
[ https://nvbugs/5474119 ][fix] Re-enable test ( #8809 )
...
Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
2025-10-31 10:17:58 -07:00
Patrice Castonguay
afa75c9494
[ https://nvbugs/5614506 ][chore] Adding e+p+d e2e test ( #8801 )
...
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-10-31 09:52:42 -07:00
HuiGao-NV
1a338e1a05
[None][chore] use cached vila model ( #8788 )
...
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-10-30 20:26:45 -07:00
Anthony Chang
f666ad2f6b
[None][feat] Autotuner can iterate through all tactics for test purposes ( #8663 )
...
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-10-30 13:11:25 +01:00
ChristinaZ
13cfd70f57
[None][feat] Add unit tests and revision in block_level kernel for invalid input ( #8718 )
...
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-10-30 16:42:18 +08:00
Iman Tabrizian
ae6875fe10
[TRTLLM-8976][feat] Move indexer-k-cache to KVCacheManager ( #8699 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-10-29 08:04:26 -07:00
kris1025
e2c5a38879
[ https://nvbugs/5534574 ][fix] disable spec decoding forever once the request spec decoding is disabled ( #8446 )
...
Signed-off-by: linquanh <linquanh@nvidia.com>
2025-10-29 19:28:43 +08:00
Stefan Niebler
19ca7b15c7
[ https://nvbugs/5593199 ][test] Enhance beam search tests deterministic dummy model ( #8625 )
...
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-10-29 06:12:22 +01:00
Chang Liu
5f737b8dbe
[None][perf] Use fp8 quant kernel in DS3.2 indexer module ( #8701 )
...
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-10-29 12:45:09 +08:00
Cheng Hang
15c293a90b
[None][feat] Enable nvfp4 cuda core for sm120 ( #8620 )
...
Signed-off-by: Cheng Hang <chang@nvidia.com>
2025-10-29 12:39:03 +08:00
Kaiyu Xie
227c288441
[TRTLLM-8827] [feat] Enable low precision alltoall for Cutlass and TRTLLMGen backends ( #8675 )
...
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-10-29 07:56:48 +08:00
Mike Iovine
00161b315f
[ https://nvbugs/5549111 ][fix] Fix 2-model overlap scheduler accuracy on very long prompts ( #8076 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Michael Iovine <miovine@nvidia.com>
2025-10-28 14:55:34 -07:00
Lucas Liebenwein
0ee71d95ec
[ https://nvbugs/5606166 ][fix] AutoDeploy: use tuples for cudagraph shape lookup ( #8658 )
...
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-10-28 10:52:43 -07:00
gramnarayan
88b0fbc8ff
[ #8245 ][feat] Autodeploy: Guided Decoding Support ( #8551 )
...
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Co-authored-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-10-28 09:29:57 +08:00
Yechan Kim
a6017f6266
[ https://nvbugs/5608723 ][fix] Use local data on multimodal tests and unwaive tests ( #8673 )
...
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-10-28 09:20:02 +09:00
Aurelien Chartier
1401a3c09c
[None][feat] Add FP8 rowwise GEMMs for B200 ( #8332 )
...
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-10-27 16:33:14 -04:00
Bo Li
9c4432f8a4
[TRTLLM-7318][feat] MnnvlThroughput AlltoAll implementation. ( #7499 )
...
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-10-27 13:23:06 -04:00
Chenghao Zhang
b9b2802599
[None][feat] Autodeploy: Update the ssm to use slice ( #8667 )
...
Signed-off-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com>
2025-10-27 09:45:20 -07:00
mpikulski
7c8ba71b49
[TRTLLM-8832][feat] fully async _select_generated_logits with tests ( #8628 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-10-27 16:15:32 +01:00
Kaiyu Xie
c9b08790c2
[None] [test] Add MNNVL AlltoAll tests to pre-merge ( #8601 )
2025-10-27 21:39:44 +08:00
Chenghao Zhang
a6d20f6f9b
[None][feat] AutoDeploy: Add FP8 MOE for Nemotron ( #8599 )
...
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Co-authored-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2025-10-25 15:26:45 -04:00
Erin
812bc8c954
[TRTLLM-8513][feat] Add back worker extension ( #8482 )
...
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-10-24 20:30:28 -04:00
Chang Liu
e47c787dd7
[TRTLLM-8535][feat] Support DeepSeek V3.2 with FP8 + BF16 KV cache/NVFP4 + BF16 KV cache ( #8405 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-10-24 13:40:41 -04:00
Yechan Kim
2d86d6be40
[TRTLLM-8737][feat] Support media_io_kwargs on trtllm-serve ( #8528 )
...
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-10-24 12:53:40 -04:00
Shijie
928247a3f9
[ https://nvbugs/5451205 ][feat] Add cuBLASLt NVFP4 GEMM backend support ( #7943 )
...
Signed-off-by: Shijie Wang <jaywan@nvidia.com>
2025-10-23 15:55:10 +08:00
Suyog Gupta
2956978da3
[None][feat] Enable rms norm fusion for Nemotron MOE ( #8563 )
...
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Co-authored-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-10-23 00:09:42 -04:00
Lucas Liebenwein
77fa5dfee9
[ https://nvbugs/5604136 ][fix] AutoDeploy: correct import for mxfp4_moe unit test ( #8593 )
...
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-10-22 22:11:18 -04:00
sunnyqgg
ea3e0eea51
[TRTLLM-7954][feat] Target model KV cache rellocation ( #8421 )
...
Signed-off-by: qgai <qgai@nvidia.com>
2025-10-23 09:36:50 +08:00