Xiwen Yu
f8864b9061
update trtllm gemm
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-05 23:56:24 +08:00
Xiwen Yu
2c3f4cbeee
Merge remote-tracking branch 'origin/main' into feat/b300_cu13
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-05 15:53:43 +08:00
sychen52
98a1bffb7c
[OMNIML-2336][feat] Add NVFP4 x FP8 ( #6809 )
...
Signed-off-by: Shiyang Chen <shiychen@nvidia.com>
2025-09-04 09:03:38 -07:00
Xiwen Yu
5bd50d477e
update mha cubins and support 103a
...
Signed-off-by: Xiwen Yu <xiweny@nvidia.com>
2025-09-02 19:26:24 -07:00
Xiwen Yu
38ef850552
Merge remote-tracking branch 'gitlab/main' into user/xiweny/merge_0901
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-01 11:46:44 +08:00
Tian Zheng
e257cb3533
[None][feat] Support NVFP4 KV Cache ( #6244 )
...
Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>
2025-09-01 09:24:52 +08:00
Xiwen Yu
80ea0628d7
fix cubins
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-25 11:41:25 +08:00
Xiwen Yu
90a9bc463d
fix build error
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-23 23:03:32 +08:00
Xiwen Yu
808059da34
Merge remote-tracking branch 'gitlab/main' into user/xiweny/merge_main_0819
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-23 16:13:30 +08:00
Xiwen Yu
fa8b52ed33
fix more sm version check
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-23 15:17:59 +08:00
Xiwen Yu
f4de8840ec
Merge remote-tracking branch 'gitlab/main' into user/xiweny/merge_main_0819
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-23 15:17:48 +08:00
Xiwen Yu
5391191d7f
update tg cubins (temp ver)
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-23 15:14:38 +08:00
ChristinaZ
c7269ea93a
[ https://nvbugs/5392414 ] [fix] Add customized default routing method ( #6818 )
...
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-08-21 16:58:41 +08:00
zhhuang-nv
7e135d2ea7
[None][feat] Use Separate QKV Input Layout for Context MLA ( #6538 )
...
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
2025-08-19 22:04:48 +08:00
Xiwen Yu
4a95d88ce2
revert tlg kernels for ease of merge
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-19 11:44:36 +08:00
ChristinaZ
55f4f2d80c
[None] [fix] Fix the macro name ( #6983 )
...
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-08-18 03:08:32 -04:00
ChristinaZ
1e72721e8c
[None][feat] Add single block version renormalized routing kernel ( #6756 )
...
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-08-17 13:47:13 +08:00
Perkz Zheng
6037fe3716
[ https://nvbugs/5394685 ][fix] proper fix for the accuracy issue in 2CTA MLA kernels ( #6941 )
...
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-08-15 23:29:36 +08:00
Xiwen Yu
0bf6a18627
Fix and waive to clean L0
...
Signed-off-by: Xiwen Yu <xiweny@nvidia.com>
2025-08-15 04:37:43 -07:00
Perkz Zheng
11d89a3732
[ https://nvbugs/5394685 ][fix] using static scheduler 2CTA MLA as WAR for an accuracy issue ( #6896 )
...
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-08-15 08:51:04 +08:00
Perkz Zheng
58f7783ea4
[ https://nvbugs/5394685 ][fix] the bug with spec-decoding + SWA && an accuracy issue related to 2CTA MLA ( #6834 )
...
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-08-13 13:55:56 -07:00
hlu1
8207d5fd39
[None] [feat] Add model gpt-oss ( #6645 )
...
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-08-07 03:04:18 -04:00
Xiwen Yu
345c2bceaa
update trtllm-gen sm100f cubins of gemm kernels
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-06 14:25:12 +08:00
Tian Zheng
3a94d80839
Update SM100f cubins
...
Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>
2025-08-06 14:25:00 +08:00
Xiwen Yu
5c09dc8304
CUDA13 breaking changes: c++ compile successful
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-06 14:24:40 +08:00
Perkz Zheng
706f421cb0
[Fix] the bug in the trtllm-gen heurisitcf for MLA kernels. ( #6284 )
...
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-07-24 23:40:27 +08:00
ChristinaZ
7e033c392e
Feat: Add vectorized loading for finalize kernel in MoE Trtllm backend ( #5919 )
...
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-07-17 12:38:29 +08:00
Yuan Tong
a36ac45c4d
fix: fast redux detection in trtllm gen routing kernel ( #5941 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-07-13 16:35:07 +08:00
ChristinaZ
c5fb692a7d
Refactor the rest routing part for the routing kernels in the MoE TRT-LLM backend ( #5771 )
...
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-07-11 16:37:56 +08:00
Anthony Chang
7d21b55b5a
[feat] Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE ( #5723 )
...
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-07-10 14:06:50 +08:00
davidclark-nv
a1235ee978
[feat] Adds optional module cache for TRT-LLM Gen Gemm interfaces ( #5743 )
...
Signed-off-by: David Clark <215764518+davidclark-nv@users.noreply.github.com>
Co-authored-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>
2025-07-07 13:34:55 -07:00
ChristinaZ
12d8c7d129
Refactor the topk parallelization part for the routing kernels ( #5567 )
...
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-07-07 15:53:25 +08:00
Yuan Tong
32b244af38
feat: reduce unnecessary kernel generation ( #5476 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-07-04 14:37:49 +08:00
Yan Chunwei
a5eff139f1
[TRTLLM-5277] chore: refine llmapi examples for 1.0 (part1) ( #5431 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-07-01 19:06:41 +08:00
ChristinaZ
a608b00d38
Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) ( #5519 )
...
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-06-27 20:17:40 +08:00
Anthony Chang
de7cd0de05
fix: MoE autotune fallback failed to query default heuristic ( #5520 )
...
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-06-26 17:28:48 +01:00
ChristinaZ
d135f5993d
Add unit test for routing kernels ( #5405 )
...
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-06-26 09:49:11 +08:00
Perkz Zheng
1f292ff2a0
[ https://jirasw.nvidia.com/browse/TRTLLM-4645 ] support mutliCtasKvMode for high-throughput MLA kernels ( #5426 )
...
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-06-25 16:31:10 +08:00
Dom Brown
44fb3c1673
[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner ( #5207 )
...
- Adds a new Python custom op (fp8_block_scale_moe_runner) and a FP8BlockScaleMoERunner class for autotuning.
- Updates C++ MoE and batched GEMM kernels to accept a configIndex for workspace sizing and execution.
- Extends the unit test to run both autotuned and non-autotuned code paths.
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-06-17 21:01:56 +08:00
Anthony Chang
4f9fa9f21d
feat: MoE trtllm backend kernel update ( #5183 )
...
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-06-16 14:46:13 +08:00
Perkz Zheng
3d87770e15
[ https://nvbugspro.nvidia.com/bug/5295470 ] support headDim 256 for blackwell fmha kernels ( #5164 )
...
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-06-13 23:01:01 +08:00
Matthias Jouanneaux
514baf1287
[fix] Fix comment to pass guardwords check ( #5191 )
...
Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>
2025-06-13 15:49:59 +08:00
Matthias Jouanneaux
a0b6c635b1
[feat] trtllmGen MoE routing: added support for top groups and top K bounds ( #4063 )
...
Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
Co-authored-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>
2025-06-13 06:00:02 +08:00
Zongfei Jing
6d1f2d0fd7
[TRTLLM-3927] [feat] Finalize + Allreduce + add + rmsnorm fusion ( #4756 )
...
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-06-10 19:55:16 +08:00
Dom Brown
9c012d5bf8
[TRTLLM-5589] feat: Integrate TRT-LLM Gen FP8 Batched GEMM with Pytorch workflow kernel autotuner ( #4872 )
...
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-06-09 11:02:48 +01:00
ChristinaZ
f45aff2b7d
Add customized renormalized moe routing kernel for moe cutlass backend ( #4955 )
...
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-06-09 17:38:50 +08:00
Anthony Chang
eeb555e37b
chore: memoize weight shuffle index to speed up weight preproc in moe_backend=TRTLLM ( #4826 )
...
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-06-06 16:13:54 +08:00
ChristinaZ
d64af85e8c
Replace memset with data initialization within kernels ( #4851 )
...
Signed-off-by: Christina Zhang <christinaz@nvidia.com>
2025-06-04 08:56:46 +08:00
Perkz Zheng
a089aa3225
[ https://nvbugspro.nvidia.com/bug/5300080 ] Fix the bug of setting attention_chunk_size and enable chunked-attention in the generation-phase by default ( #4693 )
...
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-06-03 19:02:57 -04:00
Nikita Korobov
8043d7a03c
feat: update DeepSeek FP8 TRT-LLM Gen cubins ( #4643 )
...
Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
2025-06-03 14:07:54 -07:00