TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Simeng Liu 873c7532fd feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. (#3438 ) * feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. Previously, the RMSNorm implementation only supported a single input tensor. With group_rms_norm, multiple tensors can be normalized together: ```python input_a, input_b, ... = group_rms_norm([input_a, input_b, ...]) ``` All input tensors must share the same batch dimension. The kernel partitions work by dynamically assigning warp groups proportional to the last dimension of each input, improving launch efficiency and reducing overhead. This MR provides two implementations: GroupRMSNormKernel: Optimized for small-to-medium batch sizes GroupRMSNormKernelLargeBatch: Contains additional optimizations for large batch sizes Both kernels are currently exposed as custom PyTorch ops. A future MR will implement heuristic-based kernel selection and expose a unified interface. Signed-off-by: Simeng Liu <simengl@nvidia.com> * Resolve comments and fix typo with IS_FLASHINFER_AVAILABLE Signed-off-by: Simeng Liu <simengl@nvidia.com> --------- Signed-off-by: Simeng Liu <simengl@nvidia.com>		2025-05-02 13:25:30 +08:00
..
_torch	feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. (#3438 )	2025-05-02 13:25:30 +08:00
auto_parallel	chore: remove usernames from comments (#3291 )	2025-04-05 13:44:28 +08:00
bench	[feat]: Allow for a settable end-of-sequence/padding token in max throughput benchmark. (#3776 )	2025-05-01 09:42:46 +08:00
commands	Add smart router for moe (#3641 )	2025-04-23 12:21:59 +08:00
evaluate	[TRTLLM-4763][test] Accuracy test improvement (Part 3.6): Deprecate mmlu_llmapi.py (#3802 )	2025-04-23 23:05:13 +08:00
executor	feat: LogitsProcessor in PyTorch backend (#3145 )	2025-05-01 14:15:30 -07:00
inputs	feat: llama4 input processor (#3383 )	2025-04-25 16:47:14 -07:00
layers	feat: Add FP8 support for SM 120 (#3248 )	2025-04-14 16:05:41 -07:00
llmapi	feat: Support Top-K logprobs and prompt_logprobs in LLMAPI (#3388 )	2025-05-01 12:47:14 -04:00
models	chore: bump version to 0.19.0 (#3598 ) (#3841 )	2025-04-29 16:57:22 +08:00
plugin	chore: bump version to 0.19.0 (#3598 ) (#3841 )	2025-04-29 16:57:22 +08:00
quantization	chore: bump version to 0.19.0 (#3598 ) (#3841 )	2025-04-29 16:57:22 +08:00
runtime	feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode (#3380 )	2025-04-21 14:31:01 +08:00
scaffolding	feat: fix erros on scaffolding README (#3899 )	2025-04-29 10:15:06 +08:00
serve	feat: Support Top-K logprobs and prompt_logprobs in LLMAPI (#3388 )	2025-05-01 12:47:14 -04:00
tools	test: Fix breaking Phi3 multimodal tests (#3544 )	2025-04-15 08:02:34 +08:00
__init__.py	fix: revert https://github.com/NVIDIA/TensorRT-LLM/pull/3858 (#3928 )	2025-04-29 11:26:13 +08:00
_common.py	Update (#2978 )	2025-03-23 16:39:35 +08:00
_dlpack_utils.py	feat: Add MNNVL MoE A2A support (#3504 )	2025-04-25 17:29:08 +08:00
_ipc_utils.py	fix: Proper error bubbling for PyExecutor (#3321 )	2025-04-15 14:49:46 +08:00
_mnnvl_utils.py	feat: Add MNNVL MoE A2A support (#3504 )	2025-04-25 17:29:08 +08:00
_utils.py	chore: Remove duplicated get_sm_version. (#3935 )	2025-04-30 11:43:53 +08:00
builder.py	chore: remove usernames from comments (#3291 )	2025-04-05 13:44:28 +08:00
disaggregated_params.py	Update TensorRT-LLM (#2936 )	2025-03-18 21:25:19 +08:00
functional.py	Unify two versions of AllReduce custom op (#3032 )	2025-04-22 21:58:42 +08:00
graph_rewriting.py	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
logger.py	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
lora_manager.py	add passing E2E LoRA flow (#3788 )	2025-04-23 18:38:06 +03:00
mapping.py	Add smart router for moe (#3641 )	2025-04-23 12:21:59 +08:00
module.py	Update (#2978 )	2025-03-23 16:39:35 +08:00
network.py	chore: remove usernames from comments (#3291 )	2025-04-05 13:44:28 +08:00
parameter.py	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
profiler.py	test [TRTLLM-4477,TRTLLM-4481]: Accuracy test improvement (Part 3.5): Support GSM8K and GPQA (#3483 )	2025-04-22 07:38:16 +08:00
prompt_adapter_manager.py	Update TensorRT-LLM (#2333 )	2024-10-15 15:28:40 +08:00
python_plugin.py	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
sampling_params.py	feat: LogitsProcessor in PyTorch backend (#3145 )	2025-05-01 14:15:30 -07:00
top_model_mixin.py	Update TensorRT-LLM (#2053 )	2024-07-30 21:25:01 +08:00
version.py	chore: bump version to 0.20.0rc2 (#3949 )	2025-04-30 11:44:43 +08:00