TensorRT-LLMs/cpp/tensorrt_llm/kernels/cutlass_kernels
Neta Zmora 34dc6869f3
[#8732][feat] Update TRTLLM Cutlass MoE kernels with ReLU2 (#9011)
Update TRTLLM Cutlass MoE kernels with ReLU2 activation.

Nemotron-6 requires ReLU2 (i.e. squared ReLU) MoE activation function.
The PR adds this and adds an API to set the activation function, in general.
The ReLU2 changes are based on this FlashInfer PR: https://github.com/flashinfer-ai/flashinfer/pull/1954.

The PR also updates the Auto Deploy MoE backend for 16-bit and FP8 from
Triton (`torch.ops.auto_deploy.triton_moe_fused`, `torch.ops.auto_deploy.triton_quant_fp8_moe`) to TRTLLM/Cutlass (`torch.ops.auto_deploy.trtllm_moe_fused`, `torch.ops.auto_deploy.trtllm_quant_fp8_moe_fused`).

Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
2025-11-13 16:54:45 -08:00
..
allreduce_gemm fix TMA error with GEMM+AR on TP=2 (#6075) 2025-07-18 10:26:08 +08:00
fp4_gemm [None][chroe] Rename TensorRT-LLM to TensorRT LLM for source code. (#7851) 2025-09-25 21:02:35 +08:00
fp8_blockscale_gemm [TRTLLM-1234][feat] Add fp8 blockscaled Gemm for sm120 (#8844) 2025-11-04 18:10:36 +08:00
fp8_rowwise_gemm [https://nvbugs/5619396][fix] Add sm103 to CutlassFP8RowwiseGemm (#9042) 2025-11-10 08:12:14 -08:00
fpA_intB_gemm [None][chroe] Rename TensorRT-LLM to TensorRT LLM for source code. (#7851) 2025-09-25 21:02:35 +08:00
fused_gated_gemm [None][chroe] Rename TensorRT-LLM to TensorRT LLM for source code. (#7851) 2025-09-25 21:02:35 +08:00
include [#8732][feat] Update TRTLLM Cutlass MoE kernels with ReLU2 (#9011) 2025-11-13 16:54:45 -08:00
int8_gemm [None][chroe] Rename TensorRT-LLM to TensorRT LLM for source code. (#7851) 2025-09-25 21:02:35 +08:00
low_latency_gemm [None][chroe] Rename TensorRT-LLM to TensorRT LLM for source code. (#7851) 2025-09-25 21:02:35 +08:00
moe_gemm [#8732][feat] Update TRTLLM Cutlass MoE kernels with ReLU2 (#9011) 2025-11-13 16:54:45 -08:00
python [None][feat] GPT-OSS Sm120/Sm121 Support (#7937) 2025-10-06 16:59:06 -04:00
CMakeLists.txt [TRTLLM-1234][feat] Add fp8 blockscaled Gemm for sm120 (#8844) 2025-11-04 18:10:36 +08:00
cutlass_heuristic.cpp [None][feat] GPT-OSS Sm120/Sm121 Support (#7937) 2025-10-06 16:59:06 -04:00
cutlass_heuristic.h [None][perf] Add MOE support for dynamic cluster shapes and custom epilogue schedules (#6126) 2025-09-02 21:54:43 -04:00
cutlass_preprocessors.cpp [TRTLLM-5366][feat]Add support for sm121 (#5524) 2025-07-08 14:27:00 -07:00
cutlass_preprocessors.h Update TensorRT-LLM (#1492) 2024-04-24 14:44:22 +08:00
cutlass_type_conversion.h chore: cutlass cleanup (#3165) 2025-04-01 13:57:38 +08:00