TensorRT-LLMs/tensorrt_llm
Simeng Liu 873c7532fd
feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. (#3438)
* feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator.

Previously, the RMSNorm implementation only supported a single input tensor. With group_rms_norm, multiple tensors can be normalized together:
```python
input_a, input_b, ... = group_rms_norm([input_a, input_b, ...])
```
All input tensors must share the same batch dimension. The kernel partitions work by dynamically assigning warp groups proportional to the last dimension of each input, improving launch efficiency and reducing overhead.

This MR provides two implementations:
GroupRMSNormKernel: Optimized for small-to-medium batch sizes
GroupRMSNormKernelLargeBatch: Contains additional optimizations for large batch sizes

Both kernels are currently exposed as custom PyTorch ops. A future MR will implement heuristic-based kernel selection and expose a unified interface.

Signed-off-by: Simeng Liu <simengl@nvidia.com>

* Resolve comments and fix typo with IS_FLASHINFER_AVAILABLE

Signed-off-by: Simeng Liu <simengl@nvidia.com>

---------

Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-05-02 13:25:30 +08:00
..
_torch feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. (#3438) 2025-05-02 13:25:30 +08:00
auto_parallel chore: remove usernames from comments (#3291) 2025-04-05 13:44:28 +08:00
bench [feat]: Allow for a settable end-of-sequence/padding token in max throughput benchmark. (#3776) 2025-05-01 09:42:46 +08:00
commands Add smart router for moe (#3641) 2025-04-23 12:21:59 +08:00
evaluate [TRTLLM-4763][test] Accuracy test improvement (Part 3.6): Deprecate mmlu_llmapi.py (#3802) 2025-04-23 23:05:13 +08:00
executor feat: LogitsProcessor in PyTorch backend (#3145) 2025-05-01 14:15:30 -07:00
inputs feat: llama4 input processor (#3383) 2025-04-25 16:47:14 -07:00
layers feat: Add FP8 support for SM 120 (#3248) 2025-04-14 16:05:41 -07:00
llmapi feat: Support Top-K logprobs and prompt_logprobs in LLMAPI (#3388) 2025-05-01 12:47:14 -04:00
models chore: bump version to 0.19.0 (#3598) (#3841) 2025-04-29 16:57:22 +08:00
plugin chore: bump version to 0.19.0 (#3598) (#3841) 2025-04-29 16:57:22 +08:00
quantization chore: bump version to 0.19.0 (#3598) (#3841) 2025-04-29 16:57:22 +08:00
runtime feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode (#3380) 2025-04-21 14:31:01 +08:00
scaffolding feat: fix erros on scaffolding README (#3899) 2025-04-29 10:15:06 +08:00
serve feat: Support Top-K logprobs and prompt_logprobs in LLMAPI (#3388) 2025-05-01 12:47:14 -04:00
tools test: Fix breaking Phi3 multimodal tests (#3544) 2025-04-15 08:02:34 +08:00
__init__.py fix: revert https://github.com/NVIDIA/TensorRT-LLM/pull/3858 (#3928) 2025-04-29 11:26:13 +08:00
_common.py Update (#2978) 2025-03-23 16:39:35 +08:00
_dlpack_utils.py feat: Add MNNVL MoE A2A support (#3504) 2025-04-25 17:29:08 +08:00
_ipc_utils.py fix: Proper error bubbling for PyExecutor (#3321) 2025-04-15 14:49:46 +08:00
_mnnvl_utils.py feat: Add MNNVL MoE A2A support (#3504) 2025-04-25 17:29:08 +08:00
_utils.py chore: Remove duplicated get_sm_version. (#3935) 2025-04-30 11:43:53 +08:00
builder.py chore: remove usernames from comments (#3291) 2025-04-05 13:44:28 +08:00
disaggregated_params.py Update TensorRT-LLM (#2936) 2025-03-18 21:25:19 +08:00
functional.py Unify two versions of AllReduce custom op (#3032) 2025-04-22 21:58:42 +08:00
graph_rewriting.py Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
logger.py Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
lora_manager.py add passing E2E LoRA flow (#3788) 2025-04-23 18:38:06 +03:00
mapping.py Add smart router for moe (#3641) 2025-04-23 12:21:59 +08:00
module.py Update (#2978) 2025-03-23 16:39:35 +08:00
network.py chore: remove usernames from comments (#3291) 2025-04-05 13:44:28 +08:00
parameter.py Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
profiler.py test [TRTLLM-4477,TRTLLM-4481]: Accuracy test improvement (Part 3.5): Support GSM8K and GPQA (#3483) 2025-04-22 07:38:16 +08:00
prompt_adapter_manager.py Update TensorRT-LLM (#2333) 2024-10-15 15:28:40 +08:00
python_plugin.py Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
sampling_params.py feat: LogitsProcessor in PyTorch backend (#3145) 2025-05-01 14:15:30 -07:00
top_model_mixin.py Update TensorRT-LLM (#2053) 2024-07-30 21:25:01 +08:00
version.py chore: bump version to 0.20.0rc2 (#3949) 2025-04-30 11:44:43 +08:00