TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Simeng Liu 286a789549 feat: Add heuristic for GroupRMSNorm kernel selection. (#4047 ) * feat: Add heuristic for GroupRMSNorm kernel selection. Implements a logistic regression model to dynamically select between: - GroupRMSNormBaseKernel: Allocates warps proportional to sum of dimensions (better SM occupancy in most cases) - GroupRMSNormLargeBatch: Allocates warps proportional to max dimension (better block scheduling in large batch scenarios) Selection heuristic considers batch size, allocated warps, and scheduling efficiency on the current GPU architecture. Models for Compute Capability 9.x and 10.x are trained base on nsys kernel runtime data. The default kernel selection is the base kernel. The python operator group_rms_norm will use the heuristic by default. User can pick to use the base or large batch kernels as well. Signed-off-by: Simeng Liu <simengl@nvidia.com> * Address the comments. Signed-off-by: Simeng Liu <simengl@nvidia.com> --------- Signed-off-by: Simeng Liu <simengl@nvidia.com>		2025-05-13 08:52:53 +08:00
..
beamSearchKernels	Feat: Variable-Beam-Width-Search (VBWS) part4 (#3979 )	2025-05-12 22:32:29 +02:00
communicationKernels	chore: bump version to 0.19.0 (#3598 ) (#3841 )	2025-04-29 16:57:22 +08:00
contextFusedMultiHeadAttention	fix: Fix FMHA-based MLA in the generation phase and add MLA unit test (#3863 )	2025-04-29 09:09:43 +08:00
cutlass_kernels	feat: support add internal cutlass kernels as subproject (#3658 )	2025-05-06 11:35:07 +08:00
decoderMaskedMultiheadAttention	infra: open source XQA kernels (#3762 )	2025-04-30 18:05:15 +08:00
flashMLA	feat: Add support for FP8 MLA on Hopper and Blackwell. (#3190 )	2025-04-07 15:14:13 +08:00
fusedLayernormKernels	update FP4 quantize layout (#3045 )	2025-04-03 13:13:54 -04:00
groupRmsNormKernels	feat: Add heuristic for GroupRMSNorm kernel selection. (#4047 )	2025-05-13 08:52:53 +08:00
internal_cutlass_kernels	feat: support add internal cutlass kernels as subproject (#3658 )	2025-05-06 11:35:07 +08:00
lora	chore: Stabilize ABI boundary for internal kernel library (#3117 )	2025-04-11 15:07:50 +08:00
selectiveScan	Update TensorRT-LLM (#2820 )	2025-02-25 21:21:49 +08:00
speculativeDecoding	feat: add relaxed acceptance for DS (#3865 )	2025-05-01 21:50:36 +08:00
trtllmGenKernels	Feat: support exporting softmax statistics and update the kernel-selection heuristic (#4155 )	2025-05-12 15:31:46 +08:00
unfusedAttentionKernels	feat: add CGA reduction fmha kernels on Blackwell. (#3763 )	2025-04-29 10:43:54 +08:00
userbuffers	feat: register ENABLE_MULTI_DEVICE and ENABLE_UCX as CMake options (#3343 )	2025-04-14 10:30:23 +08:00
weightOnlyBatchedGemv	feat: Add FP8 support for SM 120 (#3248 )	2025-04-14 16:05:41 -07:00
attentionMask.cu	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
attentionMask.h	Update TensorRT-LLM (#2363 )	2024-10-22 20:27:35 +08:00
banBadWords.cu	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
banBadWords.h	Update TensorRT-LLM (#2008 )	2024-07-23 23:05:09 +08:00
banRepeatNgram.cu	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
banRepeatNgram.h	Update TensorRT-LLM (#1598 )	2024-05-14 16:43:41 +08:00
beamSearchKernels.cu	Feat: Variable-Beam-Width-Search (VBWS) part4 (#3979 )	2025-05-12 22:32:29 +02:00
beamSearchKernels.h	Feat: Variable-Beam-Width-Search (VBWS) part4 (#3979 )	2025-05-12 22:32:29 +02:00
buildRelativeAttentionBiasKernel.cu	Update TensorRT-LLM (#1763 )	2024-06-11 16:59:02 +08:00
buildRelativeAttentionBiasKernel.h	Update TensorRT-LLM (#1763 )	2024-06-11 16:59:02 +08:00
CMakeLists.txt	feat: support add internal cutlass kernels as subproject (#3658 )	2025-05-06 11:35:07 +08:00
cumsumLastDim.cu	open source 7f370deb0090d885d7518c2b146399ba3933c004 (#2273 )	2024-09-30 13:51:19 +02:00
cumsumLastDim.h	Update TensorRT-LLM (#1725 )	2024-06-04 20:26:32 +08:00
customAllReduceKernels.cu	chore: bump version to 0.19.0 (#3598 ) (#3841 )	2025-04-29 16:57:22 +08:00
customAllReduceKernels.h	Unify two versions of AllReduce custom op (#3032 )	2025-04-22 21:58:42 +08:00
decoderMaskedMultiheadAttention.cu	Update TensorRT-LLM (#2502 )	2024-11-26 16:51:34 +08:00
decoderMaskedMultiheadAttention.h	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
decoderMaskedMultiheadAttentionUtils.h	Update TensorRT-LLM (#2363 )	2024-10-22 20:27:35 +08:00
decodingCommon.cu	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
decodingKernels.cu	Feat: Variable-Beam-Width-Search (VBWS) part4 (#3979 )	2025-05-12 22:32:29 +02:00
decodingKernels.h	refactor: Improve decoder finalize function (#3077 )	2025-03-28 14:33:59 +08:00
delayStream.cu	Update (#2978 )	2025-03-23 16:39:35 +08:00
delayStream.h	Update (#2978 )	2025-03-23 16:39:35 +08:00
doraScaling.cu	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
doraScaling.h	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
fmhaDispatcher.cpp	optimize cudaMemGetInfo for TllmGenFmhaRunner (#3907 )	2025-04-29 14:17:07 +08:00
fmhaDispatcher.h	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
gptKernels.cu	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
gptKernels.h	feat: add CGA reduction fmha kernels on Blackwell. (#3763 )	2025-04-29 10:43:54 +08:00
groupGemm.cu	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
groupGemm.h	Update TensorRT-LLM (#2562 )	2024-12-11 00:31:05 -08:00
kvCachePartialCopy.cu	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
kvCacheUtils.h	Update TensorRT-LLM (#2582 )	2024-12-16 21:50:47 -08:00
layernormKernels.cu	Update TensorRT-LLM (#1274 )	2024-03-12 18:15:52 +08:00
layernormKernels.h	Update TensorRT-LLM (#1274 )	2024-03-12 18:15:52 +08:00
logitsBitmask.cu	bitmask v3 (#3009 )	2025-03-26 15:21:29 +08:00
logitsBitmask.h	Update TensorRT-LLM (#2532 )	2024-12-04 21:16:56 +08:00
lookupKernels.cu	Update TensorRT-LLM (#1639 )	2024-05-21 17:51:02 +08:00
lookupKernels.h	Update TensorRT-LLM (#1639 )	2024-05-21 17:51:02 +08:00
lruKernel.cu	Update TensorRT-LLM (#1688 )	2024-05-28 20:07:49 +08:00
lruKernel.h	Update TensorRT-LLM (#1688 )	2024-05-28 20:07:49 +08:00
mambaConv1dKernels.cu	feat: Add FP8 support for SM 120 (#3248 )	2025-04-14 16:05:41 -07:00
mambaConv1dKernels.h	Update TensorRT-LLM (#1954 )	2024-07-16 15:30:25 +08:00
mlaKernels.cu	feat: Add support for FP8 MLA on Hopper and Blackwell. (#3190 )	2025-04-07 15:14:13 +08:00
mlaKernels.h	feat: Add support for FP8 MLA on Hopper and Blackwell. (#3190 )	2025-04-07 15:14:13 +08:00
moeCommKernels.cu	feat: Add MNNVL MoE A2A support (#3504 )	2025-04-25 17:29:08 +08:00
moeCommKernels.h	feat: Add MNNVL MoE A2A support (#3504 )	2025-04-25 17:29:08 +08:00
multiHeadAttentionCommon.h	chore: Stabilize ABI boundary for internal kernel library (#3117 )	2025-04-11 15:07:50 +08:00
noAuxTcKernels.cu	Update (#2978 )	2025-03-23 16:39:35 +08:00
noAuxTcKernels.h	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
penaltyKernels.cu	Update TensorRT-LLM (#2849 )	2025-03-04 18:44:00 +08:00
penaltyKernels.h	Update TensorRT-LLM (#2502 )	2024-11-26 16:51:34 +08:00
penaltyTypes.h	Update TensorRT-LLM (#1554 )	2024-05-07 23:34:28 +08:00
preQuantScaleKernel.cu	[#4085 ][fix] Fix `apply_per_channel_scale` for extremely large input sequence length. (#4089 )	2025-05-09 11:57:01 +08:00
preQuantScaleKernel.h	[#4085 ][fix] Fix `apply_per_channel_scale` for extremely large input sequence length. (#4089 )	2025-05-09 11:57:01 +08:00
qserveGemm.h	Update TensorRT-LLM (#2436 )	2024-11-12 15:27:49 +08:00
qserveGemmPerChannel.cu	Update TensorRT-LLM (#2532 )	2024-12-04 21:16:56 +08:00
qserveGemmPerGroup.cu	Update TensorRT-LLM (#2502 )	2024-11-26 16:51:34 +08:00
quantization.cu	feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend (#3387 )	2025-04-21 10:01:33 +08:00
quantization.cuh	feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend (#3387 )	2025-04-21 10:01:33 +08:00
quantization.h	feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend (#3387 )	2025-04-21 10:01:33 +08:00
recoverFromRingAtten.cu	Support RingAttention in the BertAttention plugin and the DiT model (#3661 )	2025-05-09 08:06:54 +08:00
recoverFromRingAtten.h	Support RingAttention in the BertAttention plugin and the DiT model (#3661 )	2025-05-09 08:06:54 +08:00
rmsnormKernels.cu	Update TensorRT-LLM (#2436 )	2024-11-12 15:27:49 +08:00
rmsnormKernels.h	Update TensorRT-LLM (#2436 )	2024-11-12 15:27:49 +08:00
sageAttentionKernels.cu	Update TensorRT-LLM (#2849 )	2025-03-04 18:44:00 +08:00
sageAttentionKernels.h	Update TensorRT-LLM (#2849 )	2025-03-04 18:44:00 +08:00
samplingAirTopPKernels.cu	Update TensorRT-LLM (#2783 )	2025-02-13 18:40:22 +08:00
samplingTopKKernels.cu	Update TensorRT-LLM (#2849 )	2025-03-04 18:44:00 +08:00
samplingTopKKernels.h	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
samplingTopPKernels.cu	chore: remove usernames from comments (#3291 )	2025-04-05 13:44:28 +08:00
samplingTopPKernels.h	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
splitkGroupGemm.cu	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
splitkGroupGemm.h	Update TensorRT-LLM (#2792 )	2025-02-18 21:27:39 +08:00
stopCriteriaKernels.cu	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
stopCriteriaKernels.h	open source 4dbf696ae9b74a26829d120b67ab8443d70c8e58 (#2297 )	2024-10-08 12:19:19 +02:00
topkLastDim.cu	Update TensorRT-LLM (#2436 )	2024-11-12 15:27:49 +08:00
topkLastDim.h	Update TensorRT-LLM (#2436 )	2024-11-12 15:27:49 +08:00
unfusedAttentionKernels.cu	fix: fix for cp > kvHeadNum (#3002 )	2025-03-26 12:39:02 +08:00
unfusedAttentionKernels.h	fix: fix for cp > kvHeadNum (#3002 )	2025-03-26 12:39:02 +08:00
xqaDispatcher.cpp	optimize cudaMemGetInfo for TllmGenFmhaRunner (#3907 )	2025-04-29 14:17:07 +08:00
xqaDispatcher.h	Update TensorRT-LLM (#2783 )	2025-02-13 18:40:22 +08:00