TensorRT-LLMs/cpp/tensorrt_llm/kernels/groupRmsNormKernels
Simeng Liu bb766eca0a
feat: Reduce branch overhead in groupRMSNorm kernels (#4067)
* feat: Reduce branch overhead in groupRMSNorm kernels
* Fix race condition with sm < 90 and avoid all threads in one warp writing to the same shared memory.

Signed-off-by: Simeng Liu <simengl@nvidia.com>

---------

Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-05-08 00:55:27 +08:00
..
CMakeLists.txt feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. (#3438) 2025-05-02 13:25:30 +08:00
groupRmsNormKernels.cu feat: Reduce branch overhead in groupRMSNorm kernels (#4067) 2025-05-08 00:55:27 +08:00
groupRmsNormKernels.h feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. (#3438) 2025-05-02 13:25:30 +08:00