Robin Kobus
dd11e08d26
[ #6187 ][feat] add LayerNorm module ( #6625 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-08-12 21:43:30 +02:00
Jhao-Ting Chen
a060e12041
[ https://nvbugs/5438869 ][fix] Set nvfp4 expert w1 w3 weight scale to the same value if they're not ( #6656 )
...
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
2025-08-12 20:47:10 +08:00
Sergey Klevtsov
27fc35175e
[None][feat] CUTLASS MoE FC2+Finalize fusion ( #3294 )
...
Signed-off-by: Sergey Klevtsov <sklevtsov@nvidia.com>
2025-08-12 15:56:48 +08:00
Jinyang Yuan
ead89a0e40
[None][perf] Improve the performance of online EPLB on Hopper by better overlapping ( #6624 )
...
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-08-12 09:25:13 +08:00
rakib-hasan
7ab8112450
[None][fix] Refactoring to avoid circular import when importing torch models ( #6720 )
...
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
2025-08-11 18:00:42 -04:00
shaharmor98
14b36e07d7
[TRTLLM-6174][feat] Enable FP32 mamba ssm cache ( #6574 )
...
Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-08-10 16:27:51 -04:00
dongfengy
d06675071e
[None][fix] WAR GPT OSS on H20 with Triton MOE ( #6721 )
...
Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
2025-08-08 19:47:09 -04:00
JunyiXu-nv
5f45227a93
[ https://nvbugs/5437106 ][fix] Fix llama4 scout TRTLLM attn_backend ( #6690 )
...
Signed-off-by: Junyi Xu <junyix@nvidia.com>
2025-08-08 17:48:23 +08:00
Li Min
d913955952
[TRTLLM-6898][feat] make fused_moe_cute_dsl work on blackwell ( #6616 )
...
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
2025-08-08 15:03:48 +08:00
NVJiangShao
2f2f5cc72c
[TRTLLM-6744][feat] Remove input_sf swizzle for module WideEPMoE ( #6231 )
...
Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>
2025-08-08 11:13:42 +08:00
hlu1
8207d5fd39
[None] [feat] Add model gpt-oss ( #6645 )
...
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-08-07 03:04:18 -04:00
Zongfei Jing
0ff8df95b7
[ https://nvbugs/5433581 ][fix] DeepGEMM installation on SBSA ( #6588 )
...
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-08-06 16:44:21 +08:00
JunyiXu-nv
13e0214fe0
[TRTLLM-6263][feat] Enable fp8 SwiGLU to minimize host overhead ( #6540 )
...
Signed-off-by: Junyi Xu <junyix@nvidia.com>
2025-08-06 10:42:19 +08:00
Aurelien Chartier
6da95f29a9
[None][feat] Add support for fused gate_up_proj scales for FP8 blockwise ( #6496 )
...
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-08-05 11:22:32 -07:00
danielafrimi
ed801ff74b
[None][fix] Remove expand configuration from mamba2 mixer ( #6521 )
...
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>
2025-08-05 04:18:25 -04:00
brb-nv
7447d6ed85
[TRTLLM-6657][feat] Add LoRA support for Gemma3 ( #6371 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-08-01 09:19:54 -04:00
liji-nv
1daa8c3232
[ https://nvbugs/5340941 ][ https://nvbugs/5375785 ] - fix: Wrap attentio… ( #6355 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-08-01 07:38:06 -04:00
Zongfei Jing
7bb0a78631
Deepseek R1 FP8 Support on Blackwell ( #6486 )
...
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-08-01 10:26:28 +08:00
Jinyang Yuan
a427f5bece
[fix] Fix wide EP when using DeepEP with online EPLB ( #6429 )
...
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-07-30 00:13:18 -04:00
peaceh-nv
5b420ad267
Rename layer to comply with deepseek ( #6393 )
...
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
2025-07-30 10:00:48 +08:00
Jinyang Yuan
97f7e12588
[fix] Fix perf regression caused by MoE autotuner when using DeepEPLowLatency ( #6288 )
...
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-07-28 01:37:11 -04:00
Void
f172face98
DeepEP LL dispatch FP4 ( #6296 )
...
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
2025-07-28 11:25:42 +08:00
liji-nv
3e0fb60e50
[TRTLLM-4279] feat: Multistream initial support for torch compile flow ( #5847 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-07-21 19:10:22 +08:00
Yuening Li
e8c068b4b1
[TRTLLM-5863][feat] Support Weight-Only-Quantization in PyTorch Workflow ( #5850 )
...
Signed-off-by: Yuening Li <62227368+yueningl@users.noreply.github.com>
Co-authored-by: Yuening Li <62227368+yueningl@users.noreply.github.com>
2025-07-21 15:17:35 +08:00
Jinyang Yuan
88076eecd0
[fix] Fix can_use_alltoall in fused_moe_wide_ep.py ( #6173 )
...
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-07-21 10:53:07 +08:00
danielafrimi
5300a99bd8
W4A8 GEMM ( #6005 )
...
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>
2025-07-20 17:34:57 +03:00
Void
118307c224
DeepEP LL support variable hidden size and tokens num ( #6141 )
...
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
2025-07-20 09:32:41 +08:00
Bo Li
07e8813984
feat: Remove padding in attention DP. ( #6064 )
...
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-07-18 23:30:34 +08:00
Aurelien Chartier
812243bdd6
feat: add support for Modelopt fp8_pb_wo quantization scheme ( #6106 )
...
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-07-18 10:35:12 +08:00
yifeizhang-c
0155e7a3a1
[TRTLLM-6368] Update deepep dispatch API ( #6037 )
...
Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>
2025-07-18 10:13:31 +08:00
chenfeiz0326
fe070a0168
test: Update Llama4 Scout FP4 & FP8 accuracy tests ( #5901 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-07-17 09:41:18 +08:00
Bo Li
fc2347eaf5
chore: Cleanup disable_fp4_allgather. ( #6006 )
...
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-07-16 17:54:36 +08:00
Xiaodong (Vincent) Huang
0523f77b36
support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-con… ( #5684 )
...
Signed-off-by: Vincent Huang <vincenth@nvidia.com>
2025-07-15 18:34:21 +03:00
Tailing Yuan
4a26bd6500
Fix: pad DeepEP fp4 recv tensors if empty ( #6048 )
...
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2025-07-15 23:14:01 +09:00
Lucas Liebenwein
e499f6c44a
[Fix] check for ImportError or ModuleNotFoundError for deep_ep_utils ( #6026 )
...
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-07-15 14:31:35 +09:00
Zhenhuan Chen
30608a5e6d
[ https://nvbugs/5355316 ] fix: update torch.compile option to fix triton store_cubin error ( #5865 )
...
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-07-14 17:17:30 +08:00
Enwei Zhu
bc1d4fb5da
[NvBug 5378370] fix: Fix alltoall for llama4 (apply_router_weight_on_input=True) ( #5902 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-07-12 15:50:31 +09:00
Void
854655f2f7
deepEP fp4 post quant all2all dispatch ( #5881 )
...
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
2025-07-11 08:18:54 +08:00
CarstyYou
dc32f9ae73
[fix] fix tileN cannot % 16==0 & support sm89 deepgemm bmm ( #5531 )
...
Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>
2025-07-10 15:16:18 +08:00
Anthony Chang
7d21b55b5a
[feat] Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE ( #5723 )
...
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-07-10 14:06:50 +08:00
brb-nv
3209b31665
feat: Custom masking utils for Gemma3 VLM ( #5853 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-10 06:18:04 +09:00
Wanli Jiang
3f7cedec7c
Update transformers to 4.53.0 ( #5747 )
...
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-07-09 09:32:24 -07:00
DylanChen-NV
74dca0aa7b
[NVBUG-5304516/5319741]Qwen2.5VL FP8 support ( #5029 )
...
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
2025-07-09 23:16:42 +08:00
dongxuy04
dd3c736c7e
chore: some refactor on WideEP ( #5727 )
...
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-07-09 14:26:57 +08:00
Tailing Yuan
85b4a6808d
Refactor: move DeepEP from Docker images to wheel building ( #5534 )
...
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2025-07-07 22:57:03 +09:00
DylanChen-NV
5ca2b9bb15
[TRTLLM-5812][feat] support FP8 row-wise dense GEMM in torch flow ( #5615 )
...
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
2025-07-07 18:04:57 +08:00
Xianjie Qiao
089fd55eda
Add dummy all_reduce for kernel breakdown ( #5745 )
...
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
2025-07-05 13:08:58 +09:00
Tailing Yuan
e134a52e07
Perf: reduce DeepEPLowLatency memory and time ( #5712 )
...
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2025-07-04 14:46:28 +08:00
Rashid Kaleem
2b0c87e613
[ModelLoad] Concurrent load model ( #5291 )
...
Signed-off-by: Rashid K <rkaleem@nvidia.com>
Co-authored-by: Zhihan Jiang <68881590+nvzhihanj@users.noreply.github.com>
2025-07-03 22:18:04 +08:00
tomeras91
7dbecf7272
[TRTLLM-4923][feat] Enable CUDA graphs for Nemotron-H ( #5646 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-07-03 11:07:51 +03:00