TensorRT-LLMs/tensorrt_llm
Simeng Liu 286a789549
feat: Add heuristic for GroupRMSNorm kernel selection. (#4047)
* feat: Add heuristic for GroupRMSNorm kernel selection.

Implements a logistic regression model to dynamically select between:
- GroupRMSNormBaseKernel: Allocates warps proportional to sum of dimensions
  (better SM occupancy in most cases)
- GroupRMSNormLargeBatch: Allocates warps proportional to max dimension
  (better block scheduling in large batch scenarios)

Selection heuristic considers batch size, allocated warps, and scheduling
efficiency on the current GPU architecture. Models for Compute Capability
9.x and 10.x are trained base on nsys kernel runtime data.
The default kernel selection is the base kernel.

The python operator group_rms_norm will use the heuristic by default.
User can pick to use the base or large batch kernels as well.

Signed-off-by: Simeng Liu <simengl@nvidia.com>

* Address the comments.

Signed-off-by: Simeng Liu <simengl@nvidia.com>

---------

Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-05-13 08:52:53 +08:00
..
_torch feat: Add heuristic for GroupRMSNorm kernel selection. (#4047) 2025-05-13 08:52:53 +08:00
auto_parallel fix: Fix NVLink version decoding. (#3996) 2025-05-06 13:56:50 +08:00
bench [TRTLLM-4717][perf] Set CUDA graph max batch size and padding in throughput benchmark. (#3875) 2025-05-09 23:20:52 +08:00
commands feat: add kv cache aware router (#3831) 2025-05-12 07:23:57 -04:00
evaluate [TRTLLM-4480][doc] Documentation for new accuracy test suite and trtllm-eval (#3946) 2025-05-08 19:35:23 +08:00
executor chore: Cleanup deprecated APIs from LLM-API (part 1/2) (#3732) 2025-05-07 13:20:25 +08:00
inputs [TRTLLM-3925, https://nvbugs/5245262] [fix] Normalize LLM.generate API (#3985) 2025-05-07 11:06:23 +08:00
layers Support RingAttention in the BertAttention plugin and the DiT model (#3661) 2025-05-09 08:06:54 +08:00
llmapi feat: add kv cache aware router (#3831) 2025-05-12 07:23:57 -04:00
models Support RingAttention in the BertAttention plugin and the DiT model (#3661) 2025-05-09 08:06:54 +08:00
plugin fix: [nvbug/5241627] Fix AllReduce kernel hang issue when both tp and pp are enabled. (#3988) 2025-05-05 11:33:25 +08:00
quantization chore: bump version to 0.19.0 (#3598) (#3841) 2025-04-29 16:57:22 +08:00
runtime Feat: Variable-Beam-Width-Search (VBWS) part4 (#3979) 2025-05-12 22:32:29 +02:00
scaffolding [TRTLLM-4911] feat(scaffolding): make sampling_params only setable by controller (#4151) 2025-05-12 15:29:09 +08:00
serve feat: add kv cache aware router (#3831) 2025-05-12 07:23:57 -04:00
tools test: Fix breaking Phi3 multimodal tests (#3544) 2025-04-15 08:02:34 +08:00
__init__.py fix: revert https://github.com/NVIDIA/TensorRT-LLM/pull/3858 (#3928) 2025-04-29 11:26:13 +08:00
_common.py Update (#2978) 2025-03-23 16:39:35 +08:00
_dlpack_utils.py feat: Add MNNVL MoE A2A support (#3504) 2025-04-25 17:29:08 +08:00
_ipc_utils.py fix: Proper error bubbling for PyExecutor (#3321) 2025-04-15 14:49:46 +08:00
_mnnvl_utils.py feat: Add MNNVL MoE A2A support (#3504) 2025-04-25 17:29:08 +08:00
_utils.py [TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804) 2025-05-09 11:04:01 +08:00
builder.py chore: remove usernames from comments (#3291) 2025-04-05 13:44:28 +08:00
disaggregated_params.py Update TensorRT-LLM (#2936) 2025-03-18 21:25:19 +08:00
functional.py Support RingAttention in the BertAttention plugin and the DiT model (#3661) 2025-05-09 08:06:54 +08:00
graph_rewriting.py Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
logger.py Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
lora_manager.py feat: support multi lora adapters and TP (#3885) 2025-05-08 23:45:45 +08:00
mapping.py Add smart router for moe (#3641) 2025-04-23 12:21:59 +08:00
module.py Update (#2978) 2025-03-23 16:39:35 +08:00
network.py chore: remove usernames from comments (#3291) 2025-04-05 13:44:28 +08:00
parameter.py Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
profiler.py test [TRTLLM-4477,TRTLLM-4481]: Accuracy test improvement (Part 3.5): Support GSM8K and GPQA (#3483) 2025-04-22 07:38:16 +08:00
prompt_adapter_manager.py Update TensorRT-LLM (#2333) 2024-10-15 15:28:40 +08:00
python_plugin.py Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
sampling_params.py feat: Support the Structural Tag in guided decoding (#4066) 2025-05-12 17:24:50 +08:00
top_model_mixin.py Update TensorRT-LLM (#2053) 2024-07-30 21:25:01 +08:00
version.py chore: bump version to 0.20.0rc2 (#3949) 2025-04-30 11:44:43 +08:00