TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Simeng Liu 873c7532fd feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. (#3438 ) * feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. Previously, the RMSNorm implementation only supported a single input tensor. With group_rms_norm, multiple tensors can be normalized together: ```python input_a, input_b, ... = group_rms_norm([input_a, input_b, ...]) ``` All input tensors must share the same batch dimension. The kernel partitions work by dynamically assigning warp groups proportional to the last dimension of each input, improving launch efficiency and reducing overhead. This MR provides two implementations: GroupRMSNormKernel: Optimized for small-to-medium batch sizes GroupRMSNormKernelLargeBatch: Contains additional optimizations for large batch sizes Both kernels are currently exposed as custom PyTorch ops. A future MR will implement heuristic-based kernel selection and expose a unified interface. Signed-off-by: Simeng Liu <simengl@nvidia.com> * Resolve comments and fix typo with IS_FLASHINFER_AVAILABLE Signed-off-by: Simeng Liu <simengl@nvidia.com> --------- Signed-off-by: Simeng Liu <simengl@nvidia.com>		2025-05-02 13:25:30 +08:00
..
assert.cpp	Update TensorRT-LLM (#1725 )	2024-06-04 20:26:32 +08:00
attentionOp.cpp	fix: [https://nvbugspro.nvidia.com/bug/5243482 ] If FlashMLA is used, the existence of FMHA based MLA kernels should not be checked. (#3862 )	2025-04-30 14:27:38 +08:00
attentionOp.h	fix: [https://nvbugspro.nvidia.com/bug/5243482 ] If FlashMLA is used, the existence of FMHA based MLA kernels should not be checked. (#3862 )	2025-04-30 14:27:38 +08:00
CMakeLists.txt	Update TensorRT-LLM (#2792 )	2025-02-18 21:27:39 +08:00
cublasMMWrapper.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
cublasMMWrapper.h	Update TensorRT-LLM (#2582 )	2024-12-16 21:50:47 -08:00
cublasVersionCheck.h	Initial commit	2023-09-20 00:29:41 -07:00
cudaBf16Fallbacks.cuh	Update TensorRT-LLM (20240116) (#891 )	2024-01-16 20:03:11 +08:00
cudaBufferUtils.cuh	Update TensorRT-LLM (#2783 )	2025-02-13 18:40:22 +08:00
cudaDriverWrapper.cpp	feat: add CGA reduction fmha kernels on Blackwell. (#3763 )	2025-04-29 10:43:54 +08:00
cudaDriverWrapper.h	feat: add CGA reduction fmha kernels on Blackwell. (#3763 )	2025-04-29 10:43:54 +08:00
cudaFp8Utils.cu	Add Llama 4 (#3302 )	2025-04-09 03:35:21 +08:00
cudaProfilerUtils.cpp	Update TensorRT-LLM (#1954 )	2024-07-16 15:30:25 +08:00
cudaTypeUtils.cuh	Update TensorRT-LLM (#2008 )	2024-07-23 23:05:09 +08:00
customAllReduceUtils.h	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
envUtils.cpp	cacheTransceiver buffer manager (#3798 )	2025-04-27 11:48:15 +08:00
envUtils.h	cacheTransceiver buffer manager (#3798 )	2025-04-27 11:48:15 +08:00
jsonSerializeOptional.h	Update TensorRT-LLM (#2436 )	2024-11-12 15:27:49 +08:00
logger.cpp	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
mathUtils.h	Update TensorRT-LLM (#2094 )	2024-08-07 16:44:43 +08:00
memoryUtils.cu	feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. (#3438 )	2025-05-02 13:25:30 +08:00
memoryUtils.h	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
nvtxUtils.h	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
opUtils.cpp	Fix: Revert commit `25f9669` (#3832 )	2025-04-24 14:03:20 +08:00
opUtils.h	Update TensorRT-LLM (#2792 )	2025-02-18 21:27:39 +08:00
quantTypeUtils.cuh	Update TensorRT-LLM (#2008 )	2024-07-23 23:05:09 +08:00
reduceKernelUtils.cuh	Update TensorRT-LLM (#2783 )	2025-02-13 18:40:22 +08:00
safetensors.cpp	Update TensorRT-LLM (#2792 )	2025-02-18 21:27:39 +08:00
safetensors.h	Update TensorRT-LLM (#2110 )	2024-08-13 22:34:33 +08:00
stlUtils.h	Update TensorRT-LLM (#1763 )	2024-06-11 16:59:02 +08:00
stringUtils.cpp	chore: Stabilize ABI boundary for internal kernel library (#3117 )	2025-04-11 15:07:50 +08:00
timestampUtils.cpp	Update TensorRT-LLM (#1954 )	2024-07-16 15:30:25 +08:00
timestampUtils.h	Update TensorRT-LLM (#1954 )	2024-07-16 15:30:25 +08:00
tllmException.cpp	chore: Stabilize ABI boundary for internal kernel library (#3117 )	2025-04-11 15:07:50 +08:00
workspace.h	Update TensorRT-LLM (#2184 )	2024-09-03 12:14:23 +02:00