TensorRT-LLMs/cpp/tensorrt_llm/kernels
Neta Zmora 34dc6869f3
[#8732][feat] Update TRTLLM Cutlass MoE kernels with ReLU2 (#9011)
Update TRTLLM Cutlass MoE kernels with ReLU2 activation.

Nemotron-6 requires ReLU2 (i.e. squared ReLU) MoE activation function.
The PR adds this and adds an API to set the activation function, in general.
The ReLU2 changes are based on this FlashInfer PR: https://github.com/flashinfer-ai/flashinfer/pull/1954.

The PR also updates the Auto Deploy MoE backend for 16-bit and FP8 from
Triton (`torch.ops.auto_deploy.triton_moe_fused`, `torch.ops.auto_deploy.triton_quant_fp8_moe`) to TRTLLM/Cutlass (`torch.ops.auto_deploy.trtllm_moe_fused`, `torch.ops.auto_deploy.trtllm_quant_fp8_moe_fused`).

Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
2025-11-13 16:54:45 -08:00
..
beamSearchKernels Feat: Variable-Beam-Width-Search (VBWS) part4 (#3979) 2025-05-12 22:32:29 +02:00
causalConv1d fix: fix license bug (#5200) 2025-06-13 18:58:15 +08:00
communicationKernels [None][feat] MNNVLAllreduce Kernel Refactor (#8018) 2025-11-05 08:49:47 +08:00
contextFusedMultiHeadAttention [TRTLLM-8541][feat] Add trtllm-gen sparse MLA kernels to support per-Tensor FP8 KV Cache (#8692) 2025-10-31 14:38:31 -07:00
cutlass_kernels [#8732][feat] Update TRTLLM Cutlass MoE kernels with ReLU2 (#9011) 2025-11-13 16:54:45 -08:00
decoderMaskedMultiheadAttention [None][fix] fix eagle3 accuracy issue on sm120 (#8944) 2025-11-10 14:02:03 +08:00
dsv3MinLatencyKernels [https://nvbugs/5545522][fix] move PREEXIT in UB kernels to fix accuracy issue (#8318) 2025-11-04 16:42:31 +08:00
flashMLA feat: reduce unnecessary kernel generation (#5476) 2025-07-04 14:37:49 +08:00
fusedLayernormKernels [None] [feat] Add model gpt-oss (#6645) 2025-08-07 03:04:18 -04:00
groupRmsNormKernels feat: Add heuristic for GroupRMSNorm kernel selection. (#4047) 2025-05-13 08:52:53 +08:00
internal_cutlass_kernels [#8732][feat] Update TRTLLM Cutlass MoE kernels with ReLU2 (#9011) 2025-11-13 16:54:45 -08:00
llama4MinLatencyKernels feat: reduce unnecessary kernel generation (#5476) 2025-07-04 14:37:49 +08:00
lora chore: Stabilize ABI boundary for internal kernel library (#3117) 2025-04-11 15:07:50 +08:00
moeLoadBalance [None][fix] fix EPLB init hang (#8649) 2025-10-28 05:22:49 -04:00
selectiveScan fix: fix license bug (#5200) 2025-06-13 18:58:15 +08:00
speculativeDecoding [TRTLLM-4629] [feat] Add support of CUDA13 and sm103 devices (#7568) 2025-09-16 09:56:18 +08:00
tinygemm2 [TRTLLM-7775][feat] Integrate tinygemm2 for gpt-oss (#7916) 2025-10-02 10:47:04 -07:00
trtllmGenKernels [None][fix] support topk autotuner input for expert slot per group larger than 32 (#9087) 2025-11-14 08:37:20 +08:00
unfusedAttentionKernels [TRTLLM-8536][feat] Add the sparse attention framework and one use case--RocketKV support (#8086) 2025-10-14 08:23:16 -07:00
userbuffers [https://nvbugs/5545522][fix] move PREEXIT in UB kernels to fix accuracy issue (#8318) 2025-11-04 16:42:31 +08:00
weightOnlyBatchedGemv [None][feat] Enable nvfp4 cuda core for sm120 (#8620) 2025-10-29 12:39:03 +08:00
attentionMask.cu Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
attentionMask.h
banBadWords.cu Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
banBadWords.h
banRepeatNgram.cu Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
banRepeatNgram.h
beamSearchKernels.cu [TRTLLM-4629] [feat] Add support of CUDA13 and sm103 devices (#7568) 2025-09-16 09:56:18 +08:00
beamSearchKernels.h [TRTLLM-4629] [feat] Add support of CUDA13 and sm103 devices (#7568) 2025-09-16 09:56:18 +08:00
buildRelativeAttentionBiasKernel.cu
buildRelativeAttentionBiasKernel.h
CMakeLists.txt feat: reduce unnecessary kernel generation (#5476) 2025-07-04 14:37:49 +08:00
cumsumLastDim.cu
cumsumLastDim.h
customAllReduceKernels.cu Cherry pick feat/llama4 to main (#4739) 2025-05-30 05:28:40 +08:00
customAllReduceKernels.h [TRTLLM-8129][feat] Allreduce tuning and benchmark script revising (#7870) 2025-11-04 16:42:31 +08:00
customMoeRoutingKernels.cu [None] [feat] Enable run_post_quant_allgather for MoE TRTLLM backend (#6794) 2025-09-23 08:24:21 +08:00
customMoeRoutingKernels.h [None] [feat] Enable run_post_quant_allgather for MoE TRTLLM backend (#6794) 2025-09-23 08:24:21 +08:00
decoderMaskedMultiheadAttention.cu
decoderMaskedMultiheadAttention.h [None] [feat] Add model gpt-oss (#6645) 2025-08-07 03:04:18 -04:00
decoderMaskedMultiheadAttentionUtils.h [None] [feat] Add model gpt-oss (#6645) 2025-08-07 03:04:18 -04:00
decodingCommon.cu
decodingKernels.cu Feat: Variable-Beam-Width-Search (VBWS) part4 (#3979) 2025-05-12 22:32:29 +02:00
decodingKernels.h refactor: Improve decoder finalize function (#3077) 2025-03-28 14:33:59 +08:00
delayStream.cu Update (#2978) 2025-03-23 16:39:35 +08:00
delayStream.h Update (#2978) 2025-03-23 16:39:35 +08:00
doraScaling.cu
doraScaling.h
fmhaDispatcher.cpp [TRTLLM-8541][feat] Add trtllm-gen sparse MLA kernels to support per-Tensor FP8 KV Cache (#8692) 2025-10-31 14:38:31 -07:00
fmhaDispatcher.h
fusedMoeCommKernels.cu [TRTLLM-6748][feat] add PDL support for more kernels (#7977) 2025-10-11 08:32:05 +08:00
fusedMoeCommKernels.h [TRTLLM-6748][feat] add PDL support for more kernels (#7977) 2025-10-11 08:32:05 +08:00
fusedQKNormRopeKernel.cu [None][feat] Support Qwen3 next (#7892) 2025-09-29 21:16:07 +08:00
fusedQKNormRopeKernel.h [None][feat] Support Yarn on Qwen3 (#6785) 2025-08-17 07:21:29 +08:00
gptKernels.cu [None][feat] Support NVFP4 KV Cache (#6244) 2025-09-01 09:24:52 +08:00
gptKernels.h [None][feat] Support NVFP4 KV Cache (#6244) 2025-09-01 09:24:52 +08:00
groupGemm.cu Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
groupGemm.h
helixKernels.cu [TRTLLM-5966][feat] Helix: add full MLA support for Helix (#8104) 2025-11-04 09:06:58 +08:00
helixKernels.h [TRTLLM-5966][feat] Helix: add full MLA support for Helix (#8104) 2025-11-04 09:06:58 +08:00
indexerKCacheScatter.cu [None][perf] Add custom indexer k cache scatter op (#8960) 2025-11-07 11:24:26 -08:00
IndexerKCacheScatter.h [None][perf] Add custom indexer k cache scatter op (#8960) 2025-11-07 11:24:26 -08:00
indexerTopK.cu [None][feat] Add customized topk and related unit tests for DSA (#8882) 2025-11-10 03:35:35 -08:00
IndexerTopK.h [None][feat] Add customized topk and related unit tests for DSA (#8882) 2025-11-10 03:35:35 -08:00
kvCachePartialCopy.cu [fix] Fix illegal mem access and possible accuracy lose. Cherry-pick … (#5017) 2025-06-09 17:50:57 +08:00
kvCacheUtils.h chore: Improve documentation of Kv_block_array (#5765) 2025-07-05 22:25:27 +02:00
layernormKernels.cu feat: Add support for fp8 rowwise quantization (#4876) 2025-06-14 06:37:48 -07:00
layernormKernels.h feat: Add support for fp8 rowwise quantization (#4876) 2025-06-14 06:37:48 -07:00
logitsBitmask.cu [TRTLLM-7027][feat] Fuse d2t to logitsBitmaskKernel and fix a race condition in one-model spec (#7481) 2025-09-04 23:30:14 +08:00
logitsBitmask.h [TRTLLM-7027][feat] Fuse d2t to logitsBitmaskKernel and fix a race condition in one-model spec (#7481) 2025-09-04 23:30:14 +08:00
lookupKernels.cu
lookupKernels.h
lruKernel.cu
lruKernel.h
mambaConv1dKernels.cu feat: Add FP8 support for SM 120 (#3248) 2025-04-14 16:05:41 -07:00
mambaConv1dKernels.h
mlaChunkedPrefill.cu [TRTLLM-7192][feat] optimize MLA chunked prefill && support fp8 mla chunked prefill (#7477) 2025-09-15 21:43:49 +08:00
mlaChunkedPrefill.cuh [TRTLLM-7192][feat] optimize MLA chunked prefill && support fp8 mla chunked prefill (#7477) 2025-09-15 21:43:49 +08:00
mlaKernels.cu [TRTLLM-5966][feat] Helix: add full MLA support for Helix (#8104) 2025-11-04 09:06:58 +08:00
mlaKernels.h [TRTLLM-8541][feat] Add trtllm-gen sparse MLA kernels to support per-Tensor FP8 KV Cache (#8692) 2025-10-31 14:38:31 -07:00
moe_utils.cuh [https://nvbugs/5378031] [feat] W4A8 AWQ MoE supports Per Expert Pre-quant Scale Factor for PyT backend (#7286) 2025-10-16 11:07:48 +08:00
moeCommKernelsCommon.h [TRTLLM-6743][feat] Optimize and refactor alltoall in WideEP (#6973) 2025-08-24 08:15:29 -04:00
moePrepareKernels.cu [None][fix] Rename: slot_count -> invalid_expert_id (#8783) 2025-11-01 21:36:59 +08:00
moePrepareKernels.h [None][fix] Rename: slot_count -> invalid_expert_id (#8783) 2025-11-01 21:36:59 +08:00
moeTopKFuncs.cuh [TRTLLM-8637][feat] Optimize the routing kernel for DeepseekV3 (MoE CUTLASS backend); Add support for KimiK2 and Qwen-next (MoE TRTLLM backend) (#7761) 2025-10-20 10:08:31 +08:00
multiHeadAttentionCommon.h [TRTLLM-4629] [feat] trtllm-gen kernels support sm103 (#7570) 2025-09-07 10:04:10 +08:00
noAuxTcKernels.cu [TRTLLM-8637][feat] Optimize the routing kernel for DeepseekV3 (MoE CUTLASS backend); Add support for KimiK2 and Qwen-next (MoE TRTLLM backend) (#7761) 2025-10-20 10:08:31 +08:00
noAuxTcKernels.h [TRTLLM-8637][feat] Optimize the routing kernel for DeepseekV3 (MoE CUTLASS backend); Add support for KimiK2 and Qwen-next (MoE TRTLLM backend) (#7761) 2025-10-20 10:08:31 +08:00
penaltyKernels.cu [None][feat] Support ignored prompt length for penalties via new sampling config parameter (#8127) 2025-10-27 13:12:31 -04:00
penaltyKernels.h [None][feat] Support ignored prompt length for penalties via new sampling config parameter (#8127) 2025-10-27 13:12:31 -04:00
penaltyTypes.h [None][feat] Support ignored prompt length for penalties via new sampling config parameter (#8127) 2025-10-27 13:12:31 -04:00
preQuantScaleKernel.cu [https://nvbugs/5378031] [feat] W4A8 AWQ MoE supports Per Expert Pre-quant Scale Factor for PyT backend (#7286) 2025-10-16 11:07:48 +08:00
preQuantScaleKernel.h [https://nvbugs/5378031] [feat] W4A8 AWQ MoE supports Per Expert Pre-quant Scale Factor for PyT backend (#7286) 2025-10-16 11:07:48 +08:00
qserveGemm.h
qserveGemmPerChannel.cu
qserveGemmPerGroup.cu
quantization.cu [None][perf] Accelerate global scale calculations for deepEP fp4 combine (#7126) 2025-08-27 00:13:13 +08:00
quantization.cuh [None] [feat] Add model gpt-oss (#6645) 2025-08-07 03:04:18 -04:00
quantization.h [None][perf] Accelerate global scale calculations for deepEP fp4 combine (#7126) 2025-08-27 00:13:13 +08:00
recoverFromRingAtten.cu [https://nvbugs/5503138] [fix] Remove compile warnings (#8167) 2025-10-13 13:24:23 +08:00
recoverFromRingAtten.h Support RingAttention in the BertAttention plugin and the DiT model (#3661) 2025-05-09 08:06:54 +08:00
rmsnormKernels.cu
rmsnormKernels.h
sageAttentionKernels.cu [TRTLLM-4629] [feat] Add support of CUDA13 and sm103 devices (#7568) 2025-09-16 09:56:18 +08:00
sageAttentionKernels.h Update TensorRT-LLM (#2849) 2025-03-04 18:44:00 +08:00
samplingAirTopPKernels.cu
samplingTopKKernels.cu Update TensorRT-LLM (#2849) 2025-03-04 18:44:00 +08:00
samplingTopKKernels.h [TRTLLM-6785][feat] BREAKING CHANGE Enable TRTLLM sampler by default (#6216) 2025-08-07 22:19:37 -04:00
samplingTopPKernels.cu chore: remove usernames from comments (#3291) 2025-04-05 13:44:28 +08:00
samplingTopPKernels.h
sparseAttentionKernels.cu [TRTLLM-8536][feat] Add the sparse attention framework and one use case--RocketKV support (#8086) 2025-10-14 08:23:16 -07:00
sparseAttentionKernels.h [TRTLLM-8541][feat] Add trtllm-gen sparse MLA kernels to support per-Tensor FP8 KV Cache (#8692) 2025-10-31 14:38:31 -07:00
splitkGroupGemm.cu Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
splitkGroupGemm.h
stopCriteriaKernels.cu Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
stopCriteriaKernels.h
topkLastDim.cu [None][chore] Mass integration of release/1.0 (#6864) 2025-08-22 09:25:15 +08:00
topkLastDim.h
unfusedAttentionKernels.cu fix: fix for cp > kvHeadNum (#3002) 2025-03-26 12:39:02 +08:00
unfusedAttentionKernels.h [TRTLLM-8536][feat] Add the sparse attention framework and one use case--RocketKV support (#8086) 2025-10-14 08:23:16 -07:00
xqaDispatcher.cpp [TRTLLM-8536][feat] Add the sparse attention framework and one use case--RocketKV support (#8086) 2025-10-14 08:23:16 -07:00
xqaDispatcher.h [None][feat] Support NVFP4 KV Cache (#6244) 2025-09-01 09:24:52 +08:00