* Replace sanity test for nemotron h with a correctness test
* Add prefill+decode reference logprobs from initial implementation + batched forward test
* Add testing that decode matches prefill - compare decode vs all prefilling the decoded tokens
* first commit of cpp moe loadbalance code
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add python bindings for moe load balance
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add python wrapper, ut and bug fixes
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add binding for layerId and update binding test
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add host tensor sharing and ut
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
---------
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* [AutoDeploy] HF factory improvements
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* improve monkey-patches and add unit tests
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
---------
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
This PR adds a customized allreduce to TensorRT-LLM. The new allreduce is used for communication on PCIe-based GPUs via low-precision quantization, which can accelerate the PCIe allreduce process.
Signed-off-by: Hui Kang <hkang@nvidia.com>
Co-authored-by: Hui Kang <hkang@nvidia.com>
* add docstring to summarize current rope support
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* minor: replace call_method, adjust inserting order of cos_sin_cache calculation node
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* add unit test for triton rope and ds rope
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* update rope matcher to match DS RoPE, add custom op for reference, add unit test case
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* cache cos[pos_idx].unsqueeze and sin[pos_idxs].unsqueeze
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* minor doc update
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* separate pattern matching and optimization for explicit and complex rope + minor updates
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* clean rope impl in repo
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* replace fused_flattened_mla_with_cache's rope impl with torch_apply_rope_with_qk_interleaving, update unit test
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* minor
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* separate layout infer and transpose to a new transformation
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* update rope_with_explicit_freqs and rope_with_input_interleaved to expose unsqueeze_dim and support match_rope_layout, add unit tests
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* solve merge conflict in transform.py, need to fix optimize_rope with cuda graph capture
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* minor clean up after rebase
Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com>
* fix pre-commit
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* support map to bnsd layout and infer unsqueeze_dim from op
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* fix cos/sin not the same across prompts in the same batch issue when mapping to flashinfer op
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* fix for unit test
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* fix custom op input/output node ordering issue for DeepSeek V3 rope
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* clean code
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* minor
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* move flattening of cos_sin_cache to the graph, update flashinfer op docstring and test
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* debug transform unit test failure
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
---------
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
- Adds BatchedGemm cubins and the respective call interface from TensorRT-LLM Generator.
- Refactors TRT-LLM Gen MoE runner to call to BMM interface
- The accuracy is verified for DeepSeek R1 FP4
Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
This PR adds a customized allreduce to TensorRT-LLM. The new allreduce is used for communication on PCIe-based GPUs via low-precision quantization, which can accelerate the PCIe allreduce process.
Signed-off-by: Hui Kang <hkang@nvidia.com>
Co-authored-by: Hui Kang <hkang@nvidia.com>
Support DeepSeek-R1 W4A8 on Hopper
Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Co-authored-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
* [TRTLLM-4374] Upgrade TRT 10.10.0 GA, CUDA 12.9 GA and DLFW 25.04
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* fix review
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* update images
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* Update jenkins/L0_Test.groovy
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* update image name
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
---------
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
* feat: Add heuristic for GroupRMSNorm kernel selection.
Implements a logistic regression model to dynamically select between:
- GroupRMSNormBaseKernel: Allocates warps proportional to sum of dimensions
(better SM occupancy in most cases)
- GroupRMSNormLargeBatch: Allocates warps proportional to max dimension
(better block scheduling in large batch scenarios)
Selection heuristic considers batch size, allocated warps, and scheduling
efficiency on the current GPU architecture. Models for Compute Capability
9.x and 10.x are trained base on nsys kernel runtime data.
The default kernel selection is the base kernel.
The python operator group_rms_norm will use the heuristic by default.
User can pick to use the base or large batch kernels as well.
Signed-off-by: Simeng Liu <simengl@nvidia.com>
* Address the comments.
Signed-off-by: Simeng Liu <simengl@nvidia.com>
---------
Signed-off-by: Simeng Liu <simengl@nvidia.com>
* fix relaxed acceptance to support enable this feature in context phase.
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
* fix sample_and_accept_draft_tokens unit test.
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
---------
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
* disable overlap in encoder
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: invokeGatherBatch
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: overlap same batch
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: add enableTrtOverlap to ExecutorConfig
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* disable overlap for beam search and spec decode
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* skip overlap tests with beam search or speculative decoding
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* moveFinishedContextRequestsToGeneration and skip unfinished requests in updateRequests
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* enable overlap in GptChunkedLongContextTests
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: Enable overlap in gptManagerBenchmark
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: Improve early exit
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Use OptionalRef for newOutputTokens tensor
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: Add overlap scheduling support to TRTLLMDecoder
- Updated TRTLLMDecoder to accept an `enable_overlap_scheduler` parameter.
- Modified the decoder's internal logic to utilize the overlap scheduling feature.
- Adjusted the sequence lengths handling to ensure compatibility with the new scheduling approach.
- Enhanced unit tests to include cases for the overlap scheduler with the TRTLLMDecoder.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fix: allNewTokens in PP
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: Add rename_weights_with_regex function for dynamic weight key renaming
Introduced a new utility function to rename weight keys in a dictionary based on regex pattern matching. This allows for flexible mapping of keys from Hugging Face naming conventions to TRT-LLM naming conventions, enhancing model compatibility and usability.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* feat: Implement SiglipVisionModel and related components
Added the SiglipVisionModel along with its associated classes, including SiglipAttention, SiglipEncoderLayer, and SiglipEncoder.
Additionally, a new test suite for the SiglipVisionModel has been created to ensure compatibility with Hugging Face outputs.
Currently SiglipVisionModel support batch size larger than one. Also, inputs and outputs shape are same with the HF for compatibility.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* feat: Add CLIPVisionModel and associated components
Introduced the CLIPVisionModel along with its related classes, including CLIPAttention, CLIPEncoderLayer, CLIPEncoder, and CLIPVisionTransformer. This implementation aligns with Hugging Face's CLIP architecture, ensuring compatibility in input and output shapes.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* feat: Enhance CLIPVisionModel with attention metadata preparation and unit tests
Updated the CLIPVisionModel to include a method for preparing attention metadata, simplifying the model's usage. Additionally, added a comprehensive unit test suite for the CLIPVisionModel, ensuring compatibility with Hugging Face outputs and validating model performance across various scenarios.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* feat: Refactor SiglipVisionModel with attention metadata preparation and update unit tests
Enhanced the SiglipVisionModel by adding a method to prepare attention metadata, streamlining its usage. Updated unit tests to validate model performance and compatibility with Hugging Face outputs, including adjustments to the configuration and test scenarios.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* refactor: Remove unused rotary_emb parameter from CLIP and Siglip attention classes
Eliminated the rotary_emb parameter from the CLIPAttention and SiglipAttention classes to streamline the code. Updated unit tests to reflect changes in the model configurations, including clarifications in the default configurations sourced from Hugging Face.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* feat: Integrate CLIPVisionModel into LlavaNextInputProcessor and enhance weight loading
Added CLIPVisionModel to the LlavaNextInputProcessor for improved vision processing. Updated the model loading mechanism to ensure compatibility with the new vision model and added attention metadata preparation. Removed debug print statements from weight renaming function for cleaner code.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* refactor: Remove unused max_position_embeddings from CLIPAttention and update Siglip classes to use CLIP components
Removed the unused max_position_embeddings variable from the CLIPAttention class. Updated the Siglip classes to utilize CLIP components, specifically replacing SiglipEncoder and SiglipAttention with their CLIP counterparts, streamlining the codebase and enhancing consistency across models.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* refactor: Consolidate weight loading logic into a shared implementation
Refactored the weight loading process across CLIP and Siglip models by using a new utility function, _load_weights_impl, to streamline the loading mechanism. This change enhances code maintainability and reduces redundancy in weight handling, ensuring consistent behavior across different model architectures.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* refactor: Simplify output handling in CLIP and Siglip models by removing output_hidden_states parameter
Removed the output_hidden_states parameter from the CLIPEncoder and SiglipVisionTransformer classes, streamlining the output handling process. Updated the corresponding unit tests to reflect these changes and ensure compatibility with the new output structure.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* feat: Enhance LlavaNextInputProcessor with dynamic model loading and memory optimization
Updated the LlavaNextInputProcessor to support dynamic model loading from local paths or Hugging Face, improving memory efficiency by partially loading the model components. Integrated the LlavaNextMultiModalProjector and adjusted weight loading to ensure compatibility with the new architecture.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
---------
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
* feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator.
Previously, the RMSNorm implementation only supported a single input tensor. With group_rms_norm, multiple tensors can be normalized together:
```python
input_a, input_b, ... = group_rms_norm([input_a, input_b, ...])
```
All input tensors must share the same batch dimension. The kernel partitions work by dynamically assigning warp groups proportional to the last dimension of each input, improving launch efficiency and reducing overhead.
This MR provides two implementations:
GroupRMSNormKernel: Optimized for small-to-medium batch sizes
GroupRMSNormKernelLargeBatch: Contains additional optimizations for large batch sizes
Both kernels are currently exposed as custom PyTorch ops. A future MR will implement heuristic-based kernel selection and expose a unified interface.
Signed-off-by: Simeng Liu <simengl@nvidia.com>
* Resolve comments and fix typo with IS_FLASHINFER_AVAILABLE
Signed-off-by: Simeng Liu <simengl@nvidia.com>
---------
Signed-off-by: Simeng Liu <simengl@nvidia.com>
When input size is larger than the max workspace size, we shall fallback to NCCL + corresponding pre/post function to ensure the functionality of AllReduce.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Replace deepseek_allreduce op with the new unified allreduce op and moe_allreduce op.
* Minor revision of moe_allreduce op argument names.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>