TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Simeng Liu 286a789549 feat: Add heuristic for GroupRMSNorm kernel selection. (#4047 ) * feat: Add heuristic for GroupRMSNorm kernel selection. Implements a logistic regression model to dynamically select between: - GroupRMSNormBaseKernel: Allocates warps proportional to sum of dimensions (better SM occupancy in most cases) - GroupRMSNormLargeBatch: Allocates warps proportional to max dimension (better block scheduling in large batch scenarios) Selection heuristic considers batch size, allocated warps, and scheduling efficiency on the current GPU architecture. Models for Compute Capability 9.x and 10.x are trained base on nsys kernel runtime data. The default kernel selection is the base kernel. The python operator group_rms_norm will use the heuristic by default. User can pick to use the base or large batch kernels as well. Signed-off-by: Simeng Liu <simengl@nvidia.com> * Address the comments. Signed-off-by: Simeng Liu <simengl@nvidia.com> --------- Signed-off-by: Simeng Liu <simengl@nvidia.com>		2025-05-13 08:52:53 +08:00
..
CMakeLists.txt	feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. (#3438 )	2025-05-02 13:25:30 +08:00
groupRmsNormKernels.cu	feat: Add heuristic for GroupRMSNorm kernel selection. (#4047 )	2025-05-13 08:52:53 +08:00
groupRmsNormKernels.h	feat: Add heuristic for GroupRMSNorm kernel selection. (#4047 )	2025-05-13 08:52:53 +08:00