TensorRT-LLMs/tensorrt_llm/_torch
Simeng Liu 286a789549
feat: Add heuristic for GroupRMSNorm kernel selection. (#4047)
* feat: Add heuristic for GroupRMSNorm kernel selection.

Implements a logistic regression model to dynamically select between:
- GroupRMSNormBaseKernel: Allocates warps proportional to sum of dimensions
  (better SM occupancy in most cases)
- GroupRMSNormLargeBatch: Allocates warps proportional to max dimension
  (better block scheduling in large batch scenarios)

Selection heuristic considers batch size, allocated warps, and scheduling
efficiency on the current GPU architecture. Models for Compute Capability
9.x and 10.x are trained base on nsys kernel runtime data.
The default kernel selection is the base kernel.

The python operator group_rms_norm will use the heuristic by default.
User can pick to use the base or large batch kernels as well.

Signed-off-by: Simeng Liu <simengl@nvidia.com>

* Address the comments.

Signed-off-by: Simeng Liu <simengl@nvidia.com>

---------

Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-05-13 08:52:53 +08:00
..
attention_backend fix: Reset planned states to avoid memory leak in TrtllmAttentionWrapper (#4227) 2025-05-12 23:25:54 +08:00
auto_deploy [TRTLLM-5188] fix: [AutoDeploy] update output shape of prepare_fused_mha_metadata_fake (#4199) 2025-05-12 11:11:40 -04:00
compilation [TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804) 2025-05-09 11:04:01 +08:00
custom_ops feat: Add heuristic for GroupRMSNorm kernel selection. (#4047) 2025-05-13 08:52:53 +08:00
distributed chore: Fix pipeline break caused by previous PR (#4081) rebase + pipeline reuse (#4169) 2025-05-09 12:51:02 +08:00
models refactor: Allow models to override apply_qk_norm. (#4078) 2025-05-12 19:38:24 +08:00
modules feat: Add heuristic for GroupRMSNorm kernel selection. (#4047) 2025-05-13 08:52:53 +08:00
peft feat: support multi lora adapters and TP (#3885) 2025-05-08 23:45:45 +08:00
pyexecutor fix: reshape token_ids for lp in torch backend (#4239) 2025-05-13 08:43:47 +08:00
speculative [fix] Fix relaxed acceptance to support enabling it in context phase (#4126) 2025-05-09 14:11:14 +08:00
__init__.py Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
autotuner.py feat: Apply the new torch-flow compatible AutoTuner to both Fused MoE and NVFP4 Linear operators. (#3151) 2025-04-08 14:28:36 +08:00
llm.py test: [TRTLLM-4334] Create 1.0 criteria scope from API stability references (#3069) 2025-03-26 18:14:35 +08:00
metadata.py feat: no-cache attention in PyTorch workflow (#3085) 2025-04-05 01:54:32 +08:00
model_config.py feat: support multi lora adapters and TP (#3885) 2025-05-08 23:45:45 +08:00
pipeline_interface.py chore: bump version to 0.19.0 (#3598) (#3841) 2025-04-29 16:57:22 +08:00
utils.py [TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804) 2025-05-09 11:04:01 +08:00