TensorRT-LLMs/cpp/tensorrt_llm/thop
Simeng Liu 286a789549
feat: Add heuristic for GroupRMSNorm kernel selection. (#4047)
* feat: Add heuristic for GroupRMSNorm kernel selection.

Implements a logistic regression model to dynamically select between:
- GroupRMSNormBaseKernel: Allocates warps proportional to sum of dimensions
  (better SM occupancy in most cases)
- GroupRMSNormLargeBatch: Allocates warps proportional to max dimension
  (better block scheduling in large batch scenarios)

Selection heuristic considers batch size, allocated warps, and scheduling
efficiency on the current GPU architecture. Models for Compute Capability
9.x and 10.x are trained base on nsys kernel runtime data.
The default kernel selection is the base kernel.

The python operator group_rms_norm will use the heuristic by default.
User can pick to use the base or large batch kernels as well.

Signed-off-by: Simeng Liu <simengl@nvidia.com>

* Address the comments.

Signed-off-by: Simeng Liu <simengl@nvidia.com>

---------

Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-05-13 08:52:53 +08:00
..
allgatherOp.cpp Update TensorRT-LLM (#2792) 2025-02-18 21:27:39 +08:00
allreduceOp.cpp feat: Fallback to NCCL for various patterns when input size is large. (#4080) 2025-05-08 11:13:13 -07:00
attentionOp.cpp [TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804) 2025-05-09 11:04:01 +08:00
CMakeLists.txt chore: Clean up the legacy DeepseekAllreudceFusionOp. (#4081) 2025-05-09 10:20:41 +08:00
convertSpecDecodingMaskToPackedMaskOp.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
cublasScaledMM.cpp feat: Introduce UB allocator for pytorch flow (#3257) 2025-04-08 18:39:49 +08:00
cutlassScaledMM.cpp feat: support add internal cutlass kernels as subproject (#3658) 2025-05-06 11:35:07 +08:00
dynamicDecodeOp.cpp Feat: Variable-Beam-Width-Search (VBWS) part3 (#3338) 2025-04-08 23:51:27 +08:00
dynamicDecodeOp.h Update TensorRT-LLM (#2783) 2025-02-13 18:40:22 +08:00
fmhaPackMaskOp.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
fp4BatchedQuantize.cpp update FP4 quantize layout (#3045) 2025-04-03 13:13:54 -04:00
fp4BlockScaleMoe.cpp Cherry-pick trtllm-gen from feat/llama4 to main (#4086) 2025-05-08 14:13:01 -07:00
fp4Gemm.cpp feat: support add internal cutlass kernels as subproject (#3658) 2025-05-06 11:35:07 +08:00
fp4GemmTrtllmGen.cpp Cherry-pick trtllm-gen from feat/llama4 to main (#4086) 2025-05-08 14:13:01 -07:00
fp4Op.cpp feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend (#3387) 2025-04-21 10:01:33 +08:00
fp4Quantize.cpp feat: Fallback to NCCL for various patterns when input size is large. (#4080) 2025-05-08 11:13:13 -07:00
fp4Quantize.h feat: Fallback to NCCL for various patterns when input size is large. (#4080) 2025-05-08 11:13:13 -07:00
fp8BatchedGemmTrtllmGen.cpp feat: Adding FP8 BMM from Codegen (#3541) 2025-04-16 10:37:15 +02:00
fp8BlockScaleMoe.cpp Cherry-pick trtllm-gen from feat/llama4 to main (#4086) 2025-05-08 14:13:01 -07:00
fp8BlockScalingGemm.cpp [feat] open source fp8_blockscale_gemm (#3071) 2025-04-02 12:12:52 +08:00
fp8Op.cpp feat: Fallback to NCCL for various patterns when input size is large. (#4080) 2025-05-08 11:13:13 -07:00
fp8Op.h feat: Fallback to NCCL for various patterns when input size is large. (#4080) 2025-05-08 11:13:13 -07:00
fp8PerTensorScaleMoe.cpp Cherry-pick trtllm-gen from feat/llama4 to main (#4086) 2025-05-08 14:13:01 -07:00
fp8PerTensorScalingTrtllmGenGemm.cpp Cherry-pick trtllm-gen from feat/llama4 to main (#4086) 2025-05-08 14:13:01 -07:00
fp8Quantize.cpp [feat] open source fp8_blockscale_gemm (#3071) 2025-04-02 12:12:52 +08:00
fusedTopkSoftmax.cpp feat: support add internal cutlass kernels as subproject (#3658) 2025-05-06 11:35:07 +08:00
gatherTreeOp.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
groupRmsNormOp.cpp feat: Add heuristic for GroupRMSNorm kernel selection. (#4047) 2025-05-13 08:52:53 +08:00
logitsBitmaskOp.cpp Update (#2978) 2025-03-23 16:39:35 +08:00
loraOp.cpp added loraOp into lora layer + test for mlp and comparison to lora plugin (#3455) 2025-04-17 12:48:27 +08:00
mambaConv1dOp.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
moeCommOp.cpp feat: Add MNNVL MoE A2A support (#3504) 2025-04-25 17:29:08 +08:00
moeOp.cpp Cherry-pick trtllm-gen from feat/llama4 to main (#4086) 2025-05-08 14:13:01 -07:00
mtpOp.cpp feat: add relaxed acceptance for DS (#3865) 2025-05-01 21:50:36 +08:00
ncclCommunicatorOp.cpp Update TensorRT-LLM (#941) 2024-01-23 23:22:35 +08:00
ncclCommunicatorOp.h Update TensorRT-LLM (#787) 2024-01-02 17:54:32 +08:00
noAuxTcOp.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
parallelDecodeKVCacheUpdateOp.cpp Update TensorRT-LLM (#2582) 2024-12-16 21:50:47 -08:00
redrafterCurandOp.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
reducescatterOp.cpp Update TensorRT-LLM (#2792) 2025-02-18 21:27:39 +08:00
relativeAttentionBiasOp.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
selectiveScanOp.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
thUtils.cpp Update TensorRT-LLM (#2849) 2025-03-04 18:44:00 +08:00
thUtils.h Update TensorRT-LLM (#2849) 2025-03-04 18:44:00 +08:00
userbuffersFinalizeOp.cpp feat: Introduce UB allocator for pytorch flow (#3257) 2025-04-08 18:39:49 +08:00
userbuffersTensor.cpp feat: Introduce UB allocator for pytorch flow (#3257) 2025-04-08 18:39:49 +08:00
userbuffersTensor.h feat: Introduce UB allocator for pytorch flow (#3257) 2025-04-08 18:39:49 +08:00
weightOnlyQuantOp.cpp chore: remove usernames from comments (#3291) 2025-04-05 13:44:28 +08:00