* feat: Add heuristic for GroupRMSNorm kernel selection.
Implements a logistic regression model to dynamically select between:
- GroupRMSNormBaseKernel: Allocates warps proportional to sum of dimensions
(better SM occupancy in most cases)
- GroupRMSNormLargeBatch: Allocates warps proportional to max dimension
(better block scheduling in large batch scenarios)
Selection heuristic considers batch size, allocated warps, and scheduling
efficiency on the current GPU architecture. Models for Compute Capability
9.x and 10.x are trained base on nsys kernel runtime data.
The default kernel selection is the base kernel.
The python operator group_rms_norm will use the heuristic by default.
User can pick to use the base or large batch kernels as well.
Signed-off-by: Simeng Liu <simengl@nvidia.com>
* Address the comments.
Signed-off-by: Simeng Liu <simengl@nvidia.com>
---------
Signed-off-by: Simeng Liu <simengl@nvidia.com>
* feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator.
Previously, the RMSNorm implementation only supported a single input tensor. With group_rms_norm, multiple tensors can be normalized together:
```python
input_a, input_b, ... = group_rms_norm([input_a, input_b, ...])
```
All input tensors must share the same batch dimension. The kernel partitions work by dynamically assigning warp groups proportional to the last dimension of each input, improving launch efficiency and reducing overhead.
This MR provides two implementations:
GroupRMSNormKernel: Optimized for small-to-medium batch sizes
GroupRMSNormKernelLargeBatch: Contains additional optimizations for large batch sizes
Both kernels are currently exposed as custom PyTorch ops. A future MR will implement heuristic-based kernel selection and expose a unified interface.
Signed-off-by: Simeng Liu <simengl@nvidia.com>
* Resolve comments and fix typo with IS_FLASHINFER_AVAILABLE
Signed-off-by: Simeng Liu <simengl@nvidia.com>
---------
Signed-off-by: Simeng Liu <simengl@nvidia.com>
* support lp in pytorch backend
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
* fix tp
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
---------
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
* add qwen3 dense model pytorch backend support, initial commit
solve the results error issue
add qwen3 moe model pytorch backend support
reformat the code
* perf - use flash_infer rmsnorm for qwen3
* feat - support qwen3 moe rmsnorm
* Put the computation of Q and K norm (in attn) into a single CUDA stream, and get a 5% - 8% throughput improvement on Qwen3 4B and Qwen3 - moe 30B - A3B.
* Put the computation of Q and K norm (in attn) into a single CUDA stream, and get a 5% - 8% throughput improvement on Qwen3 4B and Qwen3 - moe 30B - A3B. -- Forgot to update all modifications.
* fix bugs of running qwen3 public models and fp8 models
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
* fix bugs due to rebase
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
* fix bugs captured by pre-commi
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
* fix bug of attention
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
---------
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
Co-authored-by: Keddy Jin <jin.gq@aliyun.com>
Co-authored-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>
Co-authored-by: shao <shao@nvidia.com>
* add parallel_q_b_proj_and_concat
Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com>
* code cleanup
Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com>
* one gemm/concat and then split the latent_cache and pass them separately to context/gen
Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com>
---------
Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com>
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
* fix bug of create cuda stream as default parameter which will be initialized during importing
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
* add torch.cuda.Stream() for the leader node
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
* fix pre-commit issue
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
---------
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
* add MNNVL memory mapping support
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add more MPI environment for trtllm-llmapi-launch
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add MoE communication and prepare kernels
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add MNNVL AlltoAll support for DeepSeekV3
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add output dump for throughput benchmark
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* support dynamic kernel launch grid
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* address review comments
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* address review comments #2
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
---------
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add passing E2E LoRA flow
Signed-off-by: Shahar Mor <smor@nvidia.com>
* add experimental feature
Signed-off-by: Shahar Mor <smor@nvidia.com>
* fix llma_args definition
Signed-off-by: Shahar Mor <smor@nvidia.com>
* decreased manually size of max loras to address OOM
Signed-off-by: Shahar Mor <smor@nvidia.com>
---------
Signed-off-by: Shahar Mor <smor@nvidia.com>
* added files for nemotron-h
Signed-off-by: Luis Vega <lvega@nvidia.com>
* use try/except to import RMSNorm
Signed-off-by: Luis Vega <lvega@nvidia.com>
---------
Signed-off-by: Luis Vega <lvega@nvidia.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
* apply a tenative fix to moe bypass kernel update
* Pass none to disable final stage in moe
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
Signed-off-by: Chang Liu <lc9114@gmail.com>
---------
Signed-off-by: Chang Liu <lc9114@gmail.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
* One of the tactic is not supported during dispatch.
* final_hidden_states should be unpacked if it is not min_latency_mode.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Instead of allocating UserBuffers at beginning of runtime, UB buffers
are now managed with global allocator. The allocator will dynamically
assign free UB buffer or allocate new buffer for torch tensor. It makes
userbuffers easier to use.
* In common usecase, the Userbuffers will be allocated correctly during
warm up stage. There is no dynamic allocation during inference.
* UB fusion pattern is rewroten using the new UB Allocator. It contains
following passes:
1. Fuse Quant with allreduce, replace with UB impl, and insert a
copy_to_userbuffers. Currently the normal allreduce still does not
support FP8 quant. So this need to be done in UB pass
2. Convert all supported allreduce with UB and insert copy_to_userbuffers.
3. Fuse op before ar with the copy_to_userbuffers. So the op directly
writes to the userbuffer
4. Remove userbuffers finalize if the output is connect to another UB
allreduce.
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
* Several optimizations and fixings on the Autotuner.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Apply the new Python side Autotuner on current linear for nvFP4 data type.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Apply the new Python side Autotuner on MoE op
* Remove routers from cache key to improve inference perf
* Prevent unnecessary code profiling. Use do_preparation keyword to select which part should be executed during before evaluating any tactic.
* Remove try-catch inside moe profiling process.
* Move default tactic -1 to 0 transforms in cpp runner.
* Revise relavant tests.
* Predefined the bucketizing strategy for fused_moe
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Add specific_profile support for AutoTuner to bypass the standard cache search process for perf optimization
* Add specific_profile for moe
* Add specific profile for linear
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Fixing and revising according to reviewer's suggestions.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Use lru_cache for inference pref optimization.
* Revert gen_custom_cache_key feature
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Replace runner with runner id to achieve a serializable cache.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Code clean up and minor fixings.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Move all tunable runners and custom ops into torch_custom_ops.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Treat min_latency_mode as a independent dynamic tensor. Modify get_valid_tactics to suit for it.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
---------
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* fp8 kv + bf16 ctx MLA + fp8 gen MLA
Use BF16 for context MLA.
mFP8GenerationMLA and mFP8ContextFMHA shouldn't be enabled together.
Allow mSM==90 for mFP8GenerationMLA==true.
For FMHA, dataTypeKv should be FP8.
For FP8 MLA generation, the output is still in BF16.
Refine debug info for FMHA kernel metadata.
Use inputType, outputType, SM together to hash kernel list.
Add FP8 MLA generation FMHA kernel.
Special WAR of NUM_COMPUTE_GROUPS for MLA generation kernel.
Separate the implementation of fused_multihead_attention_v2.h to CPP and print some debug info if checkIfKernelExist fails.
Refine debug info in fused_multihead_attention_v2.cpp
Correct FP8 MLA metadata.
New kernel provided by Yuxin, which outputs BF16.
smem size is not set correctly, which will lead to illegal mem access.
Yuxin fixed the error in FMHA MLA kernel: previously the BF16 isn't correctly written: some parts are repeatedly written, while some others are untouched.
There are two bmm1 scales that should be set correctly.
New kernel generated by Yuxin.
Modificatiosn to common/attentionOp for FP8 MLA on Hopper using FMHA.
Not necessary. If mFP8GenerationMLA, is_fp8_out is false, so mFP8ContextFMHA is false.
Skip a check in fmhaDispatcher.
Modifications in fmhaRunner:
- Debug dump.
- if (!isFP8GenerationMLA) skips a lot of flag setting.
- TMA descriptor modification for qo (by Yuxin).
Cleanup debug output.
Clean up o tma descriptor modifications.
Signed-off-by: Bo Li <bobboli0202@gmail.com>
* Resolve conflicts.
Signed-off-by: Bo Li <bobboli0202@gmail.com>
* Apply the patch of FP8 FlashMLA and resolve conflicts.
Signed-off-by: Bo Li <bobboli0202@gmail.com>
* Fix compilation error.
Signed-off-by: Bo Li <bobboli0202@gmail.com>
* Fix compile error.
Signed-off-by: Bo Li <bobboli0202@gmail.com>
* pick blackwell support
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
* Add copyright notice to fused_multihead_attention_v2.cpp.
Signed-off-by: Bo Li <bobboli0202@gmail.com>
* Add license.
Signed-off-by: Bo Li <bobboli0202@gmail.com>
* Add missing license.
Signed-off-by: Bo Li <bobboli0202@gmail.com>
* Exclude building flashMLA kernels under sm90.
Signed-off-by: Bo Li <bobboli0202@gmail.com>
* Revert "Exclude building flashMLA kernels under sm90."
This reverts commit f0c859d459.
Signed-off-by: Bo Li <bobboli0202@gmail.com>
* Use macro to skip compiling FlashMLA for non sm90 targets.
Signed-off-by: Bo Li <bobboli0202@gmail.com>
---------
Signed-off-by: Bo Li <bobboli0202@gmail.com>
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
Co-authored-by: Dylan Chen <ziqingc@nvidia.com>
Co-authored-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>