TensorRT-LLMs/tensorrt_llm/_torch
Simeng Liu 873c7532fd
feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. (#3438)
* feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator.

Previously, the RMSNorm implementation only supported a single input tensor. With group_rms_norm, multiple tensors can be normalized together:
```python
input_a, input_b, ... = group_rms_norm([input_a, input_b, ...])
```
All input tensors must share the same batch dimension. The kernel partitions work by dynamically assigning warp groups proportional to the last dimension of each input, improving launch efficiency and reducing overhead.

This MR provides two implementations:
GroupRMSNormKernel: Optimized for small-to-medium batch sizes
GroupRMSNormKernelLargeBatch: Contains additional optimizations for large batch sizes

Both kernels are currently exposed as custom PyTorch ops. A future MR will implement heuristic-based kernel selection and expose a unified interface.

Signed-off-by: Simeng Liu <simengl@nvidia.com>

* Resolve comments and fix typo with IS_FLASHINFER_AVAILABLE

Signed-off-by: Simeng Liu <simengl@nvidia.com>

---------

Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-05-02 13:25:30 +08:00
..
attention_backend feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. (#3438) 2025-05-02 13:25:30 +08:00
auto_deploy feat: [AutoDeploy] unfusing attention for native support (#3668) 2025-05-02 09:06:49 +08:00
compilation Unify two versions of AllReduce custom op (#3032) 2025-04-22 21:58:42 +08:00
custom_ops feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. (#3438) 2025-05-02 13:25:30 +08:00
distributed Fallback to NCCL for various patterns when input size is large. (#4009) 2025-05-01 15:17:16 -07:00
models feat: LogitsProcessor in PyTorch backend (#3145) 2025-05-01 14:15:30 -07:00
modules feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. (#3438) 2025-05-02 13:25:30 +08:00
peft add passing E2E LoRA flow (#3788) 2025-04-23 18:38:06 +03:00
pyexecutor feat: LogitsProcessor in PyTorch backend (#3145) 2025-05-01 14:15:30 -07:00
speculative feat: add relaxed acceptance for DS (#3865) 2025-05-01 21:50:36 +08:00
__init__.py Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
autotuner.py feat: Apply the new torch-flow compatible AutoTuner to both Fused MoE and NVFP4 Linear operators. (#3151) 2025-04-08 14:28:36 +08:00
llm.py test: [TRTLLM-4334] Create 1.0 criteria scope from API stability references (#3069) 2025-03-26 18:14:35 +08:00
metadata.py feat: no-cache attention in PyTorch workflow (#3085) 2025-04-05 01:54:32 +08:00
model_config.py Fix fp8 kvcache (#3877) 2025-04-29 10:31:10 +08:00
pipeline_interface.py chore: bump version to 0.19.0 (#3598) (#3841) 2025-04-29 16:57:22 +08:00
utils.py refactor: (part1) Add contraints doc for fusedMoe module. (#3882) 2025-04-29 22:23:02 +08:00