Commit Graph

94 Commits

Author SHA1 Message Date
Yiqing Yan
fda8b0277a
[Infra][TRTLLM-4374] Upgrade TRT 10.10.0 GA, CUDA 12.9 GA and DLFW 25.04 (#4049)
* [TRTLLM-4374] Upgrade TRT 10.10.0 GA, CUDA 12.9 GA and DLFW 25.04

Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>

* fix review

Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>

* update images

Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>

* Update jenkins/L0_Test.groovy

Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>

* update image name

Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>

---------

Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-05-13 14:59:12 +08:00
Simeng Liu
286a789549
feat: Add heuristic for GroupRMSNorm kernel selection. (#4047)
* feat: Add heuristic for GroupRMSNorm kernel selection.

Implements a logistic regression model to dynamically select between:
- GroupRMSNormBaseKernel: Allocates warps proportional to sum of dimensions
  (better SM occupancy in most cases)
- GroupRMSNormLargeBatch: Allocates warps proportional to max dimension
  (better block scheduling in large batch scenarios)

Selection heuristic considers batch size, allocated warps, and scheduling
efficiency on the current GPU architecture. Models for Compute Capability
9.x and 10.x are trained base on nsys kernel runtime data.
The default kernel selection is the base kernel.

The python operator group_rms_norm will use the heuristic by default.
User can pick to use the base or large batch kernels as well.

Signed-off-by: Simeng Liu <simengl@nvidia.com>

* Address the comments.

Signed-off-by: Simeng Liu <simengl@nvidia.com>

---------

Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-05-13 08:52:53 +08:00
Fanrong Li
77f8e43592
[fix] Fix relaxed acceptance to support enabling it in context phase (#4126)
* fix relaxed acceptance to support enable this feature in context phase.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* fix sample_and_accept_draft_tokens unit test.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

---------

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-05-09 14:11:14 +08:00
Yi Zhang
91bf5e6a8e
[TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804)
Add Piecewise CUDA Graph Support

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-05-09 11:04:01 +08:00
Yukun He
5b61486d87
chore: Clean up the legacy DeepseekAllreudceFusionOp. (#4081)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-05-09 10:20:41 +08:00
Mike Iovine
9afe510367
[fix] Fix llama4 + eagle3 (#3998)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-08 19:20:27 -04:00
chenfeiz0326
7f5716ef83
Cherry-pick trtllm-gen from feat/llama4 to main (#4086)
* feat: TRT-LLM Gen FP8 MoE Llama4

Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>

* feat: TRT-LLM Gen llama4 MoE Top1 routing

Signed-off-by: Jiqun Tu <jtu@nvidia.com>

* feat: add per tensor FP8 TRT-LLM Gen GEMMs

Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>

* Update

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

* Update

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

* Add license for cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/gemmCubins

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

* Add guard for routingIndicesClusterKernel

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

* Guard sm90+ for routingkernels

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

* Guard sm90+ for routingkernels

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

---------

Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
Signed-off-by: Jiqun Tu <jtu@nvidia.com>
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Co-authored-by: Nikita Korobov <nkorobov@nvidia.com>
Co-authored-by: Jiqun Tu <jtu@nvidia.com>
2025-05-08 14:13:01 -07:00
shaharmor98
7d94c9561f
feat: support multi lora adapters and TP (#3885)
* support multi lora, tp

Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-05-08 23:45:45 +08:00
Yuan Tong
5b93273156
feat: adopt new logprob definition in PyTorch flow (#4057)
feat: align logprob definition of PyTorch flow

Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: Erin <14718778+hchings@users.noreply.github.com>
2025-05-08 20:16:40 +08:00
Yan Chunwei
0c26059703
chore: Cleanup deprecated APIs from LLM-API (part 1/2) (#3732)
* beam_width and max_new_token

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* remove beam_width

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* remove min_length

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* remove return_num_sequences

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

---------

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-07 13:20:25 +08:00
Enwei Zhu
c28b90984f
[TRTLLM-3925, https://nvbugs/5245262] [fix] Normalize LLM.generate API (#3985)
* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-07 11:06:23 +08:00
Erin
cba1793cda
cleanup logprob params (#4039)
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-05-07 00:50:16 +08:00
Robin Kobus
72057a0a64
[TRTLLM-3429] feat: Overlap scheduling in C++ runtime (#3625)
* disable overlap in encoder

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* feat: invokeGatherBatch

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* feat: overlap same batch

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: add enableTrtOverlap to ExecutorConfig

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* disable overlap for beam search and spec decode

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* skip overlap tests with beam search or speculative decoding

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* moveFinishedContextRequestsToGeneration and skip unfinished requests in updateRequests

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* enable overlap in GptChunkedLongContextTests

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* feat: Enable overlap in gptManagerBenchmark

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* feat: Improve early exit

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Use OptionalRef for newOutputTokens tensor

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* feat: Add overlap scheduling support to TRTLLMDecoder

- Updated TRTLLMDecoder to accept an `enable_overlap_scheduler` parameter.
- Modified the decoder's internal logic to utilize the overlap scheduling feature.
- Adjusted the sequence lengths handling to ensure compatibility with the new scheduling approach.
- Enhanced unit tests to include cases for the overlap scheduler with the TRTLLMDecoder.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fix: allNewTokens in PP

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-06 15:06:46 +02:00
Jinyang Yuan
eeb6c0c83f
[fix] Loosen the thresholds of test_attention_mla (#4074)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-05-06 11:31:09 +08:00
Suyog Gupta
ac2ab9ba36
[AutoDeploy][perf] Further optimize flashinfer backend in AutoDeploy (#4024)
* reuse batch_indices, positions across layers

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* fix flashinfer unit tests

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* simplify call to get_batch_indices_positions

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* fix call to get_batch_indices_positions

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

---------

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-05-06 10:46:36 +08:00
qixiang-99
bf4f7ad744
feat: add Pytorch support of Vision Encoder for multimodal models (#3791)
* feat: Add rename_weights_with_regex function for dynamic weight key renaming

Introduced a new utility function to rename weight keys in a dictionary based on regex pattern matching. This allows for flexible mapping of keys from Hugging Face naming conventions to TRT-LLM naming conventions, enhancing model compatibility and usability.

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>

* feat: Implement SiglipVisionModel and related components

Added the SiglipVisionModel along with its associated classes, including SiglipAttention, SiglipEncoderLayer, and SiglipEncoder.
Additionally, a new test suite for the SiglipVisionModel has been created to ensure compatibility with Hugging Face outputs.

Currently SiglipVisionModel support batch size larger than one. Also, inputs and outputs shape are same with the HF for compatibility.

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>

* feat: Add CLIPVisionModel and associated components

Introduced the CLIPVisionModel along with its related classes, including CLIPAttention, CLIPEncoderLayer, CLIPEncoder, and CLIPVisionTransformer. This implementation aligns with Hugging Face's CLIP architecture, ensuring compatibility in input and output shapes.

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>

* feat: Enhance CLIPVisionModel with attention metadata preparation and unit tests

Updated the CLIPVisionModel to include a method for preparing attention metadata, simplifying the model's usage. Additionally, added a comprehensive unit test suite for the CLIPVisionModel, ensuring compatibility with Hugging Face outputs and validating model performance across various scenarios.

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>

* feat: Refactor SiglipVisionModel with attention metadata preparation and update unit tests

Enhanced the SiglipVisionModel by adding a method to prepare attention metadata, streamlining its usage. Updated unit tests to validate model performance and compatibility with Hugging Face outputs, including adjustments to the configuration and test scenarios.

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>

* refactor: Remove unused rotary_emb parameter from CLIP and Siglip attention classes

Eliminated the rotary_emb parameter from the CLIPAttention and SiglipAttention classes to streamline the code. Updated unit tests to reflect changes in the model configurations, including clarifications in the default configurations sourced from Hugging Face.

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>

* feat: Integrate CLIPVisionModel into LlavaNextInputProcessor and enhance weight loading

Added CLIPVisionModel to the LlavaNextInputProcessor for improved vision processing. Updated the model loading mechanism to ensure compatibility with the new vision model and added attention metadata preparation. Removed debug print statements from weight renaming function for cleaner code.

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>

* refactor: Remove unused max_position_embeddings from CLIPAttention and update Siglip classes to use CLIP components

Removed the unused max_position_embeddings variable from the CLIPAttention class. Updated the Siglip classes to utilize CLIP components, specifically replacing SiglipEncoder and SiglipAttention with their CLIP counterparts, streamlining the codebase and enhancing consistency across models.

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>

* refactor: Consolidate weight loading logic into a shared implementation

Refactored the weight loading process across CLIP and Siglip models by using a new utility function, _load_weights_impl, to streamline the loading mechanism. This change enhances code maintainability and reduces redundancy in weight handling, ensuring consistent behavior across different model architectures.

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>

* refactor: Simplify output handling in CLIP and Siglip models by removing output_hidden_states parameter

Removed the output_hidden_states parameter from the CLIPEncoder and SiglipVisionTransformer classes, streamlining the output handling process. Updated the corresponding unit tests to reflect these changes and ensure compatibility with the new output structure.

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>

* feat: Enhance LlavaNextInputProcessor with dynamic model loading and memory optimization

Updated the LlavaNextInputProcessor to support dynamic model loading from local paths or Hugging Face, improving memory efficiency by partially loading the model components. Integrated the LlavaNextMultiModalProjector and adjusted weight loading to ensure compatibility with the new architecture.

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>

---------

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-05-03 05:13:47 +08:00
Mike Iovine
906cddffb0
[infra] Improve llama4 parallelism test coverage (#3821)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-02 16:15:04 -04:00
Simeng Liu
873c7532fd
feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. (#3438)
* feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator.

Previously, the RMSNorm implementation only supported a single input tensor. With group_rms_norm, multiple tensors can be normalized together:
```python
input_a, input_b, ... = group_rms_norm([input_a, input_b, ...])
```
All input tensors must share the same batch dimension. The kernel partitions work by dynamically assigning warp groups proportional to the last dimension of each input, improving launch efficiency and reducing overhead.

This MR provides two implementations:
GroupRMSNormKernel: Optimized for small-to-medium batch sizes
GroupRMSNormKernelLargeBatch: Contains additional optimizations for large batch sizes

Both kernels are currently exposed as custom PyTorch ops. A future MR will implement heuristic-based kernel selection and expose a unified interface.

Signed-off-by: Simeng Liu <simengl@nvidia.com>

* Resolve comments and fix typo with IS_FLASHINFER_AVAILABLE

Signed-off-by: Simeng Liu <simengl@nvidia.com>

---------

Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-05-02 13:25:30 +08:00
Lucas Liebenwein
be916b19e0
feat: [AutoDeploy] unfusing attention for native support (#3668)
* [AutoDeploy] unfused streamlined attention + caching

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* improved unit testing

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* reviewer feedback

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* some updates to attn_mask handling

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* updated manual benchmarking and cudagraph capture

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

---------

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-05-02 09:06:49 +08:00
Yukun He
a1645c922b
Fallback to NCCL for various patterns when input size is large. (#4009)
When input size is larger than the max workspace size, we shall fallback to NCCL + corresponding pre/post function to ensure the functionality of AllReduce.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-05-01 15:17:16 -07:00
Yukun He
9cc5922a0b
Clean up allreduce op in Deepseek V3 model. (#3829)
* Replace deepseek_allreduce op with the new unified allreduce op and moe_allreduce op.
* Minor revision of moe_allreduce op argument names.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-05-01 07:56:36 +08:00
xiweny
68a19a33d4
TRTLLM-4624 feat: Add nvfp4 gemm and moe support for SM120 (#3770)
* upgrade cutlass to 3.9

Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>

update latest internal_cutlass_kernels; revert cutlass version update; fix fp4 gemm for sm100

Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>

* update internal cutlass kernels

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

* fix file

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

* remove unnecessary change

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

* update hash

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

---------

Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Co-authored-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
2025-04-29 11:19:11 -04:00
qixiang-99
f370dd0e32
refactor(test): remove random context sequence lengths and set seed for reproducibility in attention tests (#3919)
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
2025-04-29 10:08:04 +08:00
Jinyang Yuan
dafc28fb85
fix: Fix FMHA-based MLA in the generation phase and add MLA unit test (#3863) 2025-04-29 09:09:43 +08:00
Erin
0577ea0155
waive test_attention_no_cache (#3921)
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-04-28 13:57:01 -07:00
Mike Iovine
e534bf09cc
[fix] Fix flashinfer + speculation issues (#3686)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-04-28 14:34:22 -04:00
milesial
362a8272f8
feat: llama4 input processor (#3383)
Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com>
Signed-off-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
Co-authored-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-04-25 16:47:14 -07:00
qixiang-99
ecd621fb0a
feat: Add head size 72 support for QKV Preprocessing kernel (#3743)
* refactor: Fix headsize 72 attention error for TRTLLM attn backend in PyTorch workflow

- Remove the head size pre-check logic in AttentionOp because head size 72 can be supported with fmha kernels.
- Added support for head size 72 in unfused attention kernels(QKVPreprocessing).
- Enhanced unit tests by introducing a scenario generation function for better test coverage of attention configurations(include head size 72).

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>

* update: Waive head_dim=72 test cases and enhance test representation

- Added a waiver for head_dim=72 cases on post sm100 in the test suite to address known issues.
- Introduced a custom __repr__ method in the Scenario class for pytest substring match.

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>

---------

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
2025-04-25 11:07:40 -07:00
dongxuy04
16535991b2
feat: Add MNNVL MoE A2A support (#3504)
* add MNNVL memory mapping support

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add more MPI environment for trtllm-llmapi-launch

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add MoE communication and prepare kernels

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add MNNVL AlltoAll support for DeepSeekV3

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add output dump for throughput benchmark

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* support dynamic kernel launch grid

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* address review comments

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* address review comments #2

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

---------

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-04-25 17:29:08 +08:00
Yuan Tong
57944206ba
feat: return logits in PyTorch flow (#3221)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-04-24 16:56:03 -07:00
QI JUN
635dcdcb9e
chore: reorganize some unit tests of PyTorch (#3780)
* reorganize some unit tests of PyTorch

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

---------

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-23 11:19:10 -07:00
Daniel Cámpora
1299f27c74
fix: Fix C++ decoder synchronization in PyTorch (#3106)
* Use updateDecoderBuffers in python decoder.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix synchronize in trtllm decoder.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Enable by default.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Use guided_decoder to setup seqslots and free them.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Use always decode_async and update_requests.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update decoder buffers.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix speculative decoding tests.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Send new_tensors_host instead of assuming dict.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Make default False in enable_trtllm_decoder.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Partially fix mtp, partially fix py_executor.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update request states before sending disagg ctx cache.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix disagg test for torch decoder.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Make isend_tensor_list and recv_tensor_list for sending the tensors_host.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix rebase.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Add disagg serving case to guided decoder.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Get overlap scheduling to work.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update cutlass to main.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update after rebasing.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update to use decode async and update requests.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Properly pass information to update_requests

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Make disaggregated serving a step closer to working.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix rebase.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix rebase and format.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Copy new device tokens more pythonic.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Restore MTP add dummy reqs.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Add ordereddict import to py_executor.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Added seq slot manager. Add test.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Use transmission for single tensor except when list of tensors is received.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Add TRTLLMDecoder allocation to estimate max kv cache tokens.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Add stream synchronization

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Make memory calculation of decoder adapt to the chosen decoder. Recognize decoder option passed in executorconfig. Make overlap scheduler test run on TinyLlama.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Format

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Add decoder creation to estimate max kv.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update submodule UCXX inline with main.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

---------

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-04-23 23:55:27 +08:00
Mike Iovine
0bc520f15e
fix: Limit llama4 context length to 8k (#3778)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-04-23 08:55:10 -07:00
shaharmor98
49262a62a5
add passing E2E LoRA flow (#3788)
add passing E2E LoRA flow (#3788)

Signed-off-by: Shahar Mor <smor@nvidia.com>
2025-04-23 18:38:06 +03:00
shaharmor98
5fff8f0935
Add running E2E LoRA flow (#3648)
* add passing E2E LoRA flow

Signed-off-by: Shahar Mor <smor@nvidia.com>

* add experimental feature

Signed-off-by: Shahar Mor <smor@nvidia.com>

* fix llma_args definition

Signed-off-by: Shahar Mor <smor@nvidia.com>

* decreased manually size of max loras to address OOM

Signed-off-by: Shahar Mor <smor@nvidia.com>

---------

Signed-off-by: Shahar Mor <smor@nvidia.com>
2025-04-23 11:19:41 +08:00
Lucas Liebenwein
06b914e0f9
feat: [AutoDeploy] generalizing cudagraph to multiple dynamic inputs (#3589)
* generalizing cudagraph to multiple dynamic inputs

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* fix for failing test

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

---------

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-04-23 03:38:51 +08:00
Yukun He
0ae7017342
Unify two versions of AllReduce custom op (#3032)
* Rewrite unit test for unified allreduce op. Removing the legacy unit test.
* Revise formats, fusion_op bindings. Put all tensors as optional inputs.
* Move the MoeAllreduceOp to a separate custom op.
* Move all the fusion patterns to the new version of the AllReduce fusion kernel. Remove the AllReduce strategy config. Revise the AllReduce strategies and fusion pattern definitions.
* Add more TODOs, fixing minor bugs, and remove legacy code.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-04-22 21:58:42 +08:00
Zhenhuan Chen
2672f13d77
test: fix cublas_scaled_mm with aligned workspace size (#3600)
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-04-21 14:51:42 +08:00
yuxianq
faef37782a
fix: Remove ParallelConfig. (#3678)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-21 14:14:08 +08:00
liji-nv
a51f7559a3
fix: update test_user_buffers_mm_add_prologue atol (#3711) (#3713)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-04-21 11:24:20 +08:00
hlu1
31624b079a
feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend (#3387)
* Add TRT-LLM Gen MOE to Deepseek

fix fused moe rebase bug.

Fix atol in test_fp4_gemm_quantize.py

fix fused moe rebase bug.

Fix FusedMoe.

Disable 2nd routing kernel preexit

Bump routing reduction to fp32

Disable PDL for fc1

[DEBUG] Lift token limit to 16k

[Bugfix] Token limit to 16k + fp32 routing + tanh

Make fp8 tileN 8

Fix FP8 MoE + Remove redundent temp output for FP4

[FP8-only] Avoid wasting CTAs for activation kernel

fix: unblock FP8 weightloading with trtllm-gen

Remove max_token limit for trtllm-gen path

perf: avoid type-conversion and fill_ from aten

Minor fix

Signed-off-by: Hao Lu <haolu@nvidia.com>

* Fix rebase issues

Signed-off-by: Hao Lu <haolu@nvidia.com>

* Fix compile issue

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* CI clean

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

---------

Signed-off-by: Hao Lu <haolu@nvidia.com>
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
Co-authored-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-04-21 10:01:33 +08:00
hlu1
c861b6cf17
Clean up modeling_deepseek.py (#3640)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-04-18 17:54:33 -07:00
HuiGao-NV
d3608d6818
Remove dummy forward path (#3669)
Remove dummy forward path
2025-04-18 16:17:50 +08:00
dongfengy
b71a0f76b4
test: Add llama 4 to ci (#3520)
* Add llama 4 to ci

Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>

* Only test trtllm

Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>

* Disable marverick

Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>

---------

Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
2025-04-18 11:25:52 +08:00
danielafrimi
0f084d9566
added loraOp into lora layer + test for mlp and comparison to lora plugin (#3455)
Loraop integration into torch modules

Signed-off-by: Ubuntu <dafrimi@nvidia.com>
2025-04-17 12:48:27 +08:00
QI JUN
57cafe7f9b
waive test_fp8_scaled_mm (#3637)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-16 15:07:30 -07:00
Luis Vega
0bda1f9780
feat: Nemotron-H model support (#3430)
* added files for nemotron-h

Signed-off-by: Luis Vega <lvega@nvidia.com>

* use try/except to import RMSNorm

Signed-off-by: Luis Vega <lvega@nvidia.com>

---------

Signed-off-by: Luis Vega <lvega@nvidia.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-04-16 14:05:56 -07:00
Mike Iovine
41a6c98544
Support CUDA graphs for EAGLE3 (#3176)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-04-17 04:53:50 +08:00
Olya Kozlova
b3e6723dbc
feat: Adding FP8 BMM from Codegen (#3541)
* Adding FP8 BMM from Codegen

Signed-off-by: Olya Kozlova <okozlova@s4124-0110.nvidia.com>

* Fixed licenses

Signed-off-by: Olya Kozlova <okozlova@s4124-0062.nvidia.com>

---------

Signed-off-by: Olya Kozlova <okozlova@s4124-0110.nvidia.com>
Signed-off-by: Olya Kozlova <okozlova@s4124-0062.nvidia.com>
Co-authored-by: Olya Kozlova <okozlova@6u1g-0018.nvidia.com>
Co-authored-by: Olya Kozlova <okozlova@s4124-0062.nvidia.com>
2025-04-16 10:37:15 +02:00
yuxianq
fd8ded2b2b
feat: Support cos_sin_cache in all cases. (#3517)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-16 13:48:44 +08:00