Anthony Chang
8a3b870e09
[None][feat] Update TRTLLM MoE MxFP4 cubins; autotune tileN ( #8156 )
...
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-10-23 09:14:18 +08:00
ChristinaZ
c8b9998acb
[TRTLLM-8637][feat] Optimize the routing kernel for DeepseekV3 (MoE CUTLASS backend); Add support for KimiK2 and Qwen-next (MoE TRTLLM backend) ( #7761 )
...
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-10-20 10:08:31 +08:00
Jhao-Ting Chen
c33f43e13a
[ https://nvbugs/5518713 ][fix] Trtllm-gen moe backend for blockwise fp8 ckpt (Qwen3-235B-A22B-FP8) ( #7856 )
...
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
2025-09-26 14:29:32 -07:00
ChristinaZ
be576a3152
[None] [feat] Enable run_post_quant_allgather for MoE TRTLLM backend ( #6794 )
...
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-09-23 08:24:21 +08:00
xiweny
c076a02b38
[TRTLLM-4629] [feat] Add support of CUDA13 and sm103 devices ( #7568 )
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>
Signed-off-by: Daniel Stokes <dastokes@nvidia.com>
Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com>
Signed-off-by: Xiwen Yu <xiweny@nvidia.com>
Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Signed-off-by: Bo Deng <deemod@nvidia.com>
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: xiweny <13230610+VALLIS-NERIA@users.noreply.github.com>
Co-authored-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>
Co-authored-by: Daniel Stokes <dastokes@nvidia.com>
Co-authored-by: Zhanrui Sun <zhanruis@nvidia.com>
Co-authored-by: Jiagan Cheng <jiaganc@nvidia.com>
Co-authored-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Bo Deng <deemod@nvidia.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-09-16 09:56:18 +08:00
hlu1
8207d5fd39
[None] [feat] Add model gpt-oss ( #6645 )
...
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-08-07 03:04:18 -04:00
Dom Brown
afaa388bee
[TRTLLM-6100] fix: Nvbug 5356427: autotuned TRTLLM Gen fp8 block scale MoE illegal memory access ( #5676 )
...
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
ChristinaZ
a608b00d38
Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) ( #5519 )
...
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-06-27 20:17:40 +08:00
Dom Brown
44fb3c1673
[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner ( #5207 )
...
- Adds a new Python custom op (fp8_block_scale_moe_runner) and a FP8BlockScaleMoERunner class for autotuning.
- Updates C++ MoE and batched GEMM kernels to accept a configIndex for workspace sizing and execution.
- Extends the unit test to run both autotuned and non-autotuned code paths.
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-06-17 21:01:56 +08:00
Matthias Jouanneaux
a0b6c635b1
[feat] trtllmGen MoE routing: added support for top groups and top K bounds ( #4063 )
...
Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
Co-authored-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>
2025-06-13 06:00:02 +08:00
Nikita Korobov
8043d7a03c
feat: update DeepSeek FP8 TRT-LLM Gen cubins ( #4643 )
...
Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
2025-06-03 14:07:54 -07:00
Anthony Chang
bbea2647b1
Qwen3 supports TRTLLM FP4 MoE backend ( #4530 )
...
* MoE TRTLLM backend for Qwen3
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* add extra moe_backend to test
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* address comments
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* conditionally compile kernels on newer archs
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* missing positional arg
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* Update the routing kernels
Signed-off-by: Christina Zhang <christinaz@nvidia.com>
* Revise usage of TLLM_LOG_ERROR
Signed-off-by: Christina Zhang <christinaz@nvidia.com>
* Add unit test for Qwen3 moe (trtllm_gen backend)
Signed-off-by: Christina Zhang <christinaz@nvidia.com>
* improve weight processing speed of moe_backend=TRTLLM; roughly 2x
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* tidy and minor fix
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* temporarily disable accuracy test that has known issue
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
---------
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
Signed-off-by: Christina Zhang <christinaz@nvidia.com>
Co-authored-by: Christina Zhang <christinaz@nvidia.com>
2025-05-23 18:31:08 +08:00
Nikita Korobov
fa3879629e
feat: TRT-LLM Gen integration for BMM and MoE refactoring ( #4280 )
...
- Adds BatchedGemm cubins and the respective call interface from TensorRT-LLM Generator.
- Refactors TRT-LLM Gen MoE runner to call to BMM interface
- The accuracy is verified for DeepSeek R1 FP4
Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
2025-05-16 13:31:53 +02:00
chenfeiz0326
7f5716ef83
Cherry-pick trtllm-gen from feat/llama4 to main ( #4086 )
...
* feat: TRT-LLM Gen FP8 MoE Llama4
Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
* feat: TRT-LLM Gen llama4 MoE Top1 routing
Signed-off-by: Jiqun Tu <jtu@nvidia.com>
* feat: add per tensor FP8 TRT-LLM Gen GEMMs
Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
* Update
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
* Update
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
* Add license for cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/gemmCubins
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
* Add guard for routingIndicesClusterKernel
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
* Guard sm90+ for routingkernels
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
* Guard sm90+ for routingkernels
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
---------
Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
Signed-off-by: Jiqun Tu <jtu@nvidia.com>
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Co-authored-by: Nikita Korobov <nkorobov@nvidia.com>
Co-authored-by: Jiqun Tu <jtu@nvidia.com>
2025-05-08 14:13:01 -07:00
hlu1
31624b079a
feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend ( #3387 )
...
* Add TRT-LLM Gen MOE to Deepseek
fix fused moe rebase bug.
Fix atol in test_fp4_gemm_quantize.py
fix fused moe rebase bug.
Fix FusedMoe.
Disable 2nd routing kernel preexit
Bump routing reduction to fp32
Disable PDL for fc1
[DEBUG] Lift token limit to 16k
[Bugfix] Token limit to 16k + fp32 routing + tanh
Make fp8 tileN 8
Fix FP8 MoE + Remove redundent temp output for FP4
[FP8-only] Avoid wasting CTAs for activation kernel
fix: unblock FP8 weightloading with trtllm-gen
Remove max_token limit for trtllm-gen path
perf: avoid type-conversion and fill_ from aten
Minor fix
Signed-off-by: Hao Lu <haolu@nvidia.com>
* Fix rebase issues
Signed-off-by: Hao Lu <haolu@nvidia.com>
* Fix compile issue
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
* CI clean
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
---------
Signed-off-by: Hao Lu <haolu@nvidia.com>
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
Co-authored-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-04-21 10:01:33 +08:00
Kaiyu Xie
9b931c0f63
Update TensorRT-LLM ( #2873 )
2025-03-11 21:13:42 +08:00