TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Simeng Liu 286a789549 feat: Add heuristic for GroupRMSNorm kernel selection. (#4047 ) * feat: Add heuristic for GroupRMSNorm kernel selection. Implements a logistic regression model to dynamically select between: - GroupRMSNormBaseKernel: Allocates warps proportional to sum of dimensions (better SM occupancy in most cases) - GroupRMSNormLargeBatch: Allocates warps proportional to max dimension (better block scheduling in large batch scenarios) Selection heuristic considers batch size, allocated warps, and scheduling efficiency on the current GPU architecture. Models for Compute Capability 9.x and 10.x are trained base on nsys kernel runtime data. The default kernel selection is the base kernel. The python operator group_rms_norm will use the heuristic by default. User can pick to use the base or large batch kernels as well. Signed-off-by: Simeng Liu <simengl@nvidia.com> * Address the comments. Signed-off-by: Simeng Liu <simengl@nvidia.com> --------- Signed-off-by: Simeng Liu <simengl@nvidia.com>		2025-05-13 08:52:53 +08:00
..
allgatherOp.cpp	Update TensorRT-LLM (#2792 )	2025-02-18 21:27:39 +08:00
allreduceOp.cpp	feat: Fallback to NCCL for various patterns when input size is large. (#4080 )	2025-05-08 11:13:13 -07:00
attentionOp.cpp	[TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804 )	2025-05-09 11:04:01 +08:00
CMakeLists.txt	chore: Clean up the legacy DeepseekAllreudceFusionOp. (#4081 )	2025-05-09 10:20:41 +08:00
convertSpecDecodingMaskToPackedMaskOp.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
cublasScaledMM.cpp	feat: Introduce UB allocator for pytorch flow (#3257 )	2025-04-08 18:39:49 +08:00
cutlassScaledMM.cpp	feat: support add internal cutlass kernels as subproject (#3658 )	2025-05-06 11:35:07 +08:00
dynamicDecodeOp.cpp	Feat: Variable-Beam-Width-Search (VBWS) part3 (#3338 )	2025-04-08 23:51:27 +08:00
dynamicDecodeOp.h	Update TensorRT-LLM (#2783 )	2025-02-13 18:40:22 +08:00
fmhaPackMaskOp.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
fp4BatchedQuantize.cpp	update FP4 quantize layout (#3045 )	2025-04-03 13:13:54 -04:00
fp4BlockScaleMoe.cpp	Cherry-pick trtllm-gen from feat/llama4 to main (#4086 )	2025-05-08 14:13:01 -07:00
fp4Gemm.cpp	feat: support add internal cutlass kernels as subproject (#3658 )	2025-05-06 11:35:07 +08:00
fp4GemmTrtllmGen.cpp	Cherry-pick trtllm-gen from feat/llama4 to main (#4086 )	2025-05-08 14:13:01 -07:00
fp4Op.cpp	feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend (#3387 )	2025-04-21 10:01:33 +08:00
fp4Quantize.cpp	feat: Fallback to NCCL for various patterns when input size is large. (#4080 )	2025-05-08 11:13:13 -07:00
fp4Quantize.h	feat: Fallback to NCCL for various patterns when input size is large. (#4080 )	2025-05-08 11:13:13 -07:00
fp8BatchedGemmTrtllmGen.cpp	feat: Adding FP8 BMM from Codegen (#3541 )	2025-04-16 10:37:15 +02:00
fp8BlockScaleMoe.cpp	Cherry-pick trtllm-gen from feat/llama4 to main (#4086 )	2025-05-08 14:13:01 -07:00
fp8BlockScalingGemm.cpp	[feat] open source fp8_blockscale_gemm (#3071 )	2025-04-02 12:12:52 +08:00
fp8Op.cpp	feat: Fallback to NCCL for various patterns when input size is large. (#4080 )	2025-05-08 11:13:13 -07:00
fp8Op.h	feat: Fallback to NCCL for various patterns when input size is large. (#4080 )	2025-05-08 11:13:13 -07:00
fp8PerTensorScaleMoe.cpp	Cherry-pick trtllm-gen from feat/llama4 to main (#4086 )	2025-05-08 14:13:01 -07:00
fp8PerTensorScalingTrtllmGenGemm.cpp	Cherry-pick trtllm-gen from feat/llama4 to main (#4086 )	2025-05-08 14:13:01 -07:00
fp8Quantize.cpp	[feat] open source fp8_blockscale_gemm (#3071 )	2025-04-02 12:12:52 +08:00
fusedTopkSoftmax.cpp	feat: support add internal cutlass kernels as subproject (#3658 )	2025-05-06 11:35:07 +08:00
gatherTreeOp.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
groupRmsNormOp.cpp	feat: Add heuristic for GroupRMSNorm kernel selection. (#4047 )	2025-05-13 08:52:53 +08:00
logitsBitmaskOp.cpp	Update (#2978 )	2025-03-23 16:39:35 +08:00
loraOp.cpp	added loraOp into lora layer + test for mlp and comparison to lora plugin (#3455 )	2025-04-17 12:48:27 +08:00
mambaConv1dOp.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
moeCommOp.cpp	feat: Add MNNVL MoE A2A support (#3504 )	2025-04-25 17:29:08 +08:00
moeOp.cpp	Cherry-pick trtllm-gen from feat/llama4 to main (#4086 )	2025-05-08 14:13:01 -07:00
mtpOp.cpp	feat: add relaxed acceptance for DS (#3865 )	2025-05-01 21:50:36 +08:00
ncclCommunicatorOp.cpp	Update TensorRT-LLM (#941 )	2024-01-23 23:22:35 +08:00
ncclCommunicatorOp.h	Update TensorRT-LLM (#787 )	2024-01-02 17:54:32 +08:00
noAuxTcOp.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
parallelDecodeKVCacheUpdateOp.cpp	Update TensorRT-LLM (#2582 )	2024-12-16 21:50:47 -08:00
redrafterCurandOp.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
reducescatterOp.cpp	Update TensorRT-LLM (#2792 )	2025-02-18 21:27:39 +08:00
relativeAttentionBiasOp.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
selectiveScanOp.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
thUtils.cpp	Update TensorRT-LLM (#2849 )	2025-03-04 18:44:00 +08:00
thUtils.h	Update TensorRT-LLM (#2849 )	2025-03-04 18:44:00 +08:00
userbuffersFinalizeOp.cpp	feat: Introduce UB allocator for pytorch flow (#3257 )	2025-04-08 18:39:49 +08:00
userbuffersTensor.cpp	feat: Introduce UB allocator for pytorch flow (#3257 )	2025-04-08 18:39:49 +08:00
userbuffersTensor.h	feat: Introduce UB allocator for pytorch flow (#3257 )	2025-04-08 18:39:49 +08:00
weightOnlyQuantOp.cpp	chore: remove usernames from comments (#3291 )	2025-04-05 13:44:28 +08:00