TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Neta Zmora 34dc6869f3 [#8732 ][feat] Update TRTLLM Cutlass MoE kernels with ReLU2 (#9011 ) Update TRTLLM Cutlass MoE kernels with ReLU2 activation. Nemotron-6 requires ReLU2 (i.e. squared ReLU) MoE activation function. The PR adds this and adds an API to set the activation function, in general. The ReLU2 changes are based on this FlashInfer PR: https://github.com/flashinfer-ai/flashinfer/pull/1954. The PR also updates the Auto Deploy MoE backend for 16-bit and FP8 from Triton (`torch.ops.auto_deploy.triton_moe_fused`, `torch.ops.auto_deploy.triton_quant_fp8_moe`) to TRTLLM/Cutlass (`torch.ops.auto_deploy.trtllm_moe_fused`, `torch.ops.auto_deploy.trtllm_quant_fp8_moe_fused`). Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com> Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com> Co-authored-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>		2025-11-13 16:54:45 -08:00
..
allgatherOp.cpp	[TRTLLM-7349][feat] Adding new orchestrator type -- ray (#7520 )	2025-10-04 08:12:24 +08:00
allreduceOp.cpp	[None][feat] MNNVLAllreduce Kernel Refactor (#8018 )	2025-11-05 08:49:47 +08:00
alltoallOp.cpp	[TRTLLM-5966][feat] Helix: add alltoall op (#6815 )	2025-09-25 07:18:29 -07:00
attentionOp.cpp	[None][fix] Remove unnecessary attention workspace memory check (#9064 )	2025-11-12 11:18:50 +08:00
attentionOp.h	[TRTLLM-8803][feat] Add rope and uk-bgemm overlap for mla generation (#8495 )	2025-11-06 17:39:57 +08:00
causalConv1dOp.cpp	fix: fix license bug (#5200 )	2025-06-13 18:58:15 +08:00
CMakeLists.txt	[None][feat] Add customized topk and related unit tests for DSA (#8882 )	2025-11-10 03:35:35 -08:00
convertSpecDecodingMaskToPackedMaskOp.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
cublasFp4ScaledMM.cpp	[https://nvbugs/5451205 ][feat] Add cuBLASLt NVFP4 GEMM backend support (#7943 )	2025-10-23 15:55:10 +08:00
cublasScaledMM.cpp	[None][feat] GPT-OSS Sm120/Sm121 Support (#7937 )	2025-10-06 16:59:06 -04:00
cublasScaledMM.h	Feat/ds r1 min latency opt round3, add router gemm, fused a gemm, PDL (#4560 )	2025-06-14 17:36:22 +08:00
cudaNvfp4MM.cpp	[None][feat] Enable nvfp4 cuda core for sm120 (#8620 )	2025-10-29 12:39:03 +08:00
cudaScaledMM.cpp	[NVBUG-5304516/5319741]Qwen2.5VL FP8 support (#5029 )	2025-07-09 23:16:42 +08:00
customMoeRoutingOp.cpp	[None] [feat] Enable run_post_quant_allgather for MoE TRTLLM backend (#6794 )	2025-09-23 08:24:21 +08:00
cutlassScaledMM.cpp	refactoring: port customized kernels with public cutlass version (#5027 )	2025-06-13 16:19:31 +08:00
dsv3FusedAGemmOp.cpp	Feat/ds r1 min latency opt round3, add router gemm, fused a gemm, PDL (#4560 )	2025-06-14 17:36:22 +08:00
dsv3RopeOp.cpp	[TRTLLM-8803][feat] Add rope and uk-bgemm overlap for mla generation (#8495 )	2025-11-06 17:39:57 +08:00
dsv3RouterGemmOp.cpp	Feat/ds r1 min latency opt round3, add router gemm, fused a gemm, PDL (#4560 )	2025-06-14 17:36:22 +08:00
dynamicDecodeOp.cpp	[None][feat] Support ignored prompt length for penalties via new sampling config parameter (#8127 )	2025-10-27 13:12:31 -04:00
dynamicDecodeOp.h	[None][feat] Support ignored prompt length for penalties via new sampling config parameter (#8127 )	2025-10-27 13:12:31 -04:00
finegrained_mixed_dtype_gemm_thop.cpp	W4A8 GEMM (#6005 )	2025-07-20 17:34:57 +03:00
finegrained_mixed_dtype_gemm_thop.h	W4A8 GEMM (#6005 )	2025-07-20 17:34:57 +03:00
fmhaPackMaskOp.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
fp4BatchedQuantize.cpp	[None] [feat] Add model gpt-oss (#6645 )	2025-08-07 03:04:18 -04:00
fp4BlockScaleMoe.cpp	[None][feat] Update TRTLLM MoE MxFP4 cubins; autotune tileN (#8156 )	2025-10-23 09:14:18 +08:00
fp4Gemm.cpp	[TRTLLM-4629] [feat] Add support of CUDA13 and sm103 devices (#7568 )	2025-09-16 09:56:18 +08:00
fp4GemmTrtllmGen.cpp	[OMNIML-2336][feat] Add NVFP4 x FP8 (#6809 )	2025-09-04 09:03:38 -07:00
fp4Op.cpp	[None] [feat] Add model gpt-oss (#6645 )	2025-08-07 03:04:18 -04:00
fp4Quantize.cpp	[None][perf] Accelerate global scale calculations for deepEP fp4 combine (#7126 )	2025-08-27 00:13:13 +08:00
fp4Quantize.h	[None][perf] Accelerate global scale calculations for deepEP fp4 combine (#7126 )	2025-08-27 00:13:13 +08:00
fp4xFp8GemmTrtllmGen.cpp	[OMNIML-2336][feat] Add NVFP4 x FP8 (#6809 )	2025-09-04 09:03:38 -07:00
fp8BatchedGemmTrtllmGen.cpp	[None] [feat] Add model gpt-oss (#6645 )	2025-08-07 03:04:18 -04:00
fp8BlockScaleMoe.cpp	[None][feat] Autotuner can iterate through all tactics for test purposes (#8663 )	2025-10-30 13:11:25 +01:00
fp8BlockScalingGemm.cpp	[TRTLLM-1234][feat] Add fp8 blockscaled Gemm for sm120 (#8844 )	2025-11-04 18:10:36 +08:00
fp8Op.cpp	[None] [feat] Add model gpt-oss (#6645 )	2025-08-07 03:04:18 -04:00
fp8Op.h	[None] [feat] Add model gpt-oss (#6645 )	2025-08-07 03:04:18 -04:00
fp8PerTensorScaleMoe.cpp	[None][feat] Update TRTLLM MoE MxFP4 cubins; autotune tileN (#8156 )	2025-10-23 09:14:18 +08:00
fp8PerTensorScalingTrtllmGenGemm.cpp	[TRTLLM-4629] [feat] trtllm-gen kernels support sm103 (#7570 )	2025-09-07 10:04:10 +08:00
fp8Quantize.cpp	[None][perf] Use fp8 quant kernel in DS3.2 indexer module (#8701 )	2025-10-29 12:45:09 +08:00
fp8RowwiseGemm.cpp	[TRTLLM-5812][feat] support FP8 row-wise dense GEMM in torch flow (#5615 )	2025-07-07 18:04:57 +08:00
fusedQKNormRopeOp.cpp	[None][feat] Support Yarn on Qwen3 (#6785 )	2025-08-17 07:21:29 +08:00
fusedTopkSoftmax.cpp	refactoring: port customized kernels with public cutlass version (#5027 )	2025-06-13 16:19:31 +08:00
gatherTreeOp.cpp	[TRTLLM-5171] chore: Remove GptSession/V1 from TRT workflow (#4092 )	2025-05-14 23:10:04 +02:00
groupRmsNormOp.cpp	feat: Add heuristic for GroupRMSNorm kernel selection. (#4047 )	2025-05-13 08:52:53 +08:00
helixPostProcessOp.cpp	[TRTLLM-5966][feat] Helix: add full MLA support for Helix (#8104 )	2025-11-04 09:06:58 +08:00
IndexerKCacheScatterOp.cpp	[None][perf] Add custom indexer k cache scatter op (#8960 )	2025-11-07 11:24:26 -08:00
IndexerTopKOp.cpp	[None][feat] Add customized topk and related unit tests for DSA (#8882 )	2025-11-10 03:35:35 -08:00
llama4MinLatency.cpp	Cherry pick feat/llama4 to main (#4739 )	2025-05-30 05:28:40 +08:00
logitsBitmaskOp.cpp	[TRTLLM-7027][feat] Fuse d2t to logitsBitmaskKernel and fix a race condition in one-model spec (#7481 )	2025-09-04 23:30:14 +08:00
loraOp.cpp	[TRTLLM-7263][fix] Prevent recreation of cublas handles in lora_grouped_gemm every call (#6968 )	2025-08-19 15:39:56 +03:00
mambaConv1dOp.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
mlaPreprocessOp.cpp	[TRTLLM-7192][feat] optimize MLA chunked prefill && support fp8 mla chunked prefill (#7477 )	2025-09-15 21:43:49 +08:00
moeAlltoAllMeta.h	[None][feat] Integrate MnnvlThroughput into TRTLLM MoE. (#8728 )	2025-11-04 21:36:29 +08:00
moeAlltoAllOp.cpp	[None][feat] Integrate MnnvlThroughput into TRTLLM MoE. (#8728 )	2025-11-04 21:36:29 +08:00
moeCommOp.cpp	[None][fix] Rename: slot_count -> invalid_expert_id (#8783 )	2025-11-01 21:36:59 +08:00
moeLoadBalanceOp.cpp	[TRTLLM-6743][feat] Optimize and refactor alltoall in WideEP (#6973 )	2025-08-24 08:15:29 -04:00
moeOp.cpp	[#8732 ][feat] Update TRTLLM Cutlass MoE kernels with ReLU2 (#9011 )	2025-11-13 16:54:45 -08:00
moeUtilOp.cpp	[TRTLLM-7319][perf] Fuse slicing into MoE. (#6728 )	2025-08-25 16:52:30 -04:00
mtpOp.cpp	fix: refactor and fix mtp vanilla (#4762 )	2025-06-20 05:23:39 +08:00
mxFp4BlockScaleMoe.cpp	[None][feat] Integrate MnnvlThroughput into TRTLLM MoE. (#8728 )	2025-11-04 21:36:29 +08:00
mxFp8Quantize.cpp	[None] [feat] Add model gpt-oss (#6645 )	2025-08-07 03:04:18 -04:00
ncclCommunicatorOp.cpp	feat: [nvbugs/5261055][nvbugs/5170160] non-invasive pipeline parallelism (#4034 )	2025-05-16 04:16:53 +08:00
ncclCommunicatorOp.h	feat: [nvbugs/5261055][nvbugs/5170160] non-invasive pipeline parallelism (#4034 )	2025-05-16 04:16:53 +08:00
noAuxTcOp.cpp	[TRTLLM-8637][feat] Optimize the routing kernel for DeepseekV3 (MoE CUTLASS backend); Add support for KimiK2 and Qwen-next (MoE TRTLLM backend) (#7761 )	2025-10-20 10:08:31 +08:00
parallelDecodeKVCacheUpdateOp.cpp	Update TensorRT-LLM (#2582 )	2024-12-16 21:50:47 -08:00
redrafterCurandOp.cpp	[TRTLLM-5171] chore: Remove GptSession/V1 from TRT workflow (#4092 )	2025-05-14 23:10:04 +02:00
reducescatterOp.cpp	[TRTLLM-7349][feat] Adding new orchestrator type -- ray (#7520 )	2025-10-04 08:12:24 +08:00
relativeAttentionBiasOp.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
selectiveScanOp.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
thUtils.cpp	Update TensorRT-LLM (#2849 )	2025-03-04 18:44:00 +08:00
thUtils.h	[TRTLLM-5812][feat] support FP8 row-wise dense GEMM in torch flow (#5615 )	2025-07-07 18:04:57 +08:00
tinygemm2.cpp	[None][fix] Restrict tinygemm use to certain SMs (#8182 )	2025-10-08 17:55:57 -07:00
userbuffersFinalizeOp.cpp	feat: Introduce UB allocator for pytorch flow (#3257 )	2025-04-08 18:39:49 +08:00
userbuffersTensor.cpp	feat: Introduce UB allocator for pytorch flow (#3257 )	2025-04-08 18:39:49 +08:00
userbuffersTensor.h	feat: Introduce UB allocator for pytorch flow (#3257 )	2025-04-08 18:39:49 +08:00
virtualMemoryAllocator.cpp	[TRTLLM-4406][feat] LLM sleep & wakeup Part 1: virtual device memory (#5034 )	2025-08-04 13:51:01 +08:00
weightOnlyQuantGemm.cpp	[TRTLLM-5863][feat] Support Weight-Only-Quantization in PyTorch Workflow (#5850 )	2025-07-21 15:17:35 +08:00
weightOnlyQuantGemm.h	[TRTLLM-5863][feat] Support Weight-Only-Quantization in PyTorch Workflow (#5850 )	2025-07-21 15:17:35 +08:00
weightOnlyQuantOp.cpp	[None] [feat] Add model gpt-oss (#6645 )	2025-08-07 03:04:18 -04:00