Commit Graph

221 Commits

Author SHA1 Message Date
Grzegorz Kwasniewski
ea380ff45c
[TRTLLM-9767][feat] Fixed recursive node traversals (#10379)
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
2026-01-05 18:42:06 +02:00
Eran Geva
3749a2ce1c
[#10374][fix] fixed race condition in AutoDeploy's mp tests port acquisition (#10366)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2026-01-05 13:33:01 +02:00
Yukun He
d272f1a9bc
[TRTLLM-8821][feat] Apply AutoTuner to AllReduce Op for strategy tuning. (#8531)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2026-01-05 15:44:37 +08:00
Grzegorz Kwasniewski
0d1f5ad7a2
[TRTLLM-10358][feat] Added proper rescaling of FP4 weights (#10378)
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
2026-01-03 16:26:16 -05:00
Gal Hubara-Agam
f3dd6da080
[#10056][chore] AutoDeploy: Enable Nemo SuperV3 accuracy test (#10308)
Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
2026-01-02 11:20:19 +02:00
Gal Hubara-Agam
5845951538
[#10056][fix] AutoDeploy: Handle deletion of nested params in sharding (#10376)
Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
2026-01-01 08:11:11 -05:00
tcherckez-nvidia
4868772ad7
[None][feat] Add export data to build and run script for AD (#10299)
Signed-off-by: Tal Cherckez <127761168+tcherckez-nvidia@users.noreply.github.com>
2026-01-01 04:54:47 -05:00
Lucas Liebenwein
1bbe71b3ed
[#10244][feat] AutoDeploy: separate prefill/decode in flashinfer (#10252)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-12-31 17:01:24 -05:00
tcherckez-nvidia
464847c6be
[#9717][chore] Standardize MoE weights interface (#10295)
Signed-off-by: Tal Cherckez <127761168+tcherckez-nvidia@users.noreply.github.com>
2025-12-31 07:37:18 -05:00
Eran Geva
74832a1895
[https://nvbugs/5766986][fix] fixed the shard_all_unprocessed default value to align with the default.yml (#10271)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-12-30 08:54:13 -05:00
Neta Zmora
966231d29c
[#9626][feat] Add an auto-deploy transform for using cutlass FP4 MoE kernels (#10304)
Add a transform to relace torch.ops.auto_deploy.torch_quant_nvfp4_moe
with the optimized torch.ops.auto_deploy.trtllm_quant_nvfp4_moe_fused.

Currently generates the wrong results when the number of rows in MoE FC1 weights is not divisible by 128,
so torch.ops.auto_deploy.trtllm_quant_nvfp4_moe_fused is not set as the default FP4 MoE implementation (i.e. the transform is disabled).

Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
2025-12-29 23:18:15 +02:00
Neta Zmora
f3f02315df
[None][chore]: small refactoring to auto-deploy MoE operator (#10300)
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
2025-12-25 12:27:11 -05:00
gramnarayan
a9eb5afc9f
[#9241][feat] AutoDeploy: Support Eagle3 Speculative Decoding (#9869)
Support two model flow with no overlap scheduler or chain drafter. Drafting model is in PyTorch backend.

Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
2025-12-24 23:30:42 -05:00
Neta Zmora
c4b36d31ff
[#10137][feat] AutoDeploy FP8 MoE refactor (#10138)
The trtllm (cutlass) fp8 moe operator performs W3+W1 fusion (concat) during inference and we want to move this fusion to the model optimization time.

The Cutlass MoE kernel is used thru a trtllm torch operator.
Its implementation uses two FC operations (fc1 and fc2) while the canonical MoE API defines three GEMM operations and their associated weights (W1, W2, W3) so when we switch from the torch.moe op to the trtllm.moe op we also change terminology from w1, w2, w3 to fc1, fc2.

Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
2025-12-24 18:58:10 +02:00
Suyog Gupta
e2891a6c77
[#10052][feat] AutoDeploy enable cudagraphs for flashinfer BatchDecode (#10193)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Co-authored-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
2025-12-24 05:55:09 -08:00
Grzegorz Kwasniewski
06900a7f19
[TRTLLM-9565][fix] Fix deepseek sharding (#9984)
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
2025-12-23 10:28:14 -05:00
Grzegorz Kwasniewski
ccc64da287
[TRTLLM-9847][fix] WAR fix hanging fused allreduce. (#10087)
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
2025-12-23 00:03:32 +01:00
tcherckez-nvidia
12e1cb8d7e
[#9717][chore] Refactor MoE code to use enums (#9910)
Signed-off-by: Tal Cherckez <127761168+tcherckez-nvidia@users.noreply.github.com>
2025-12-22 15:14:56 -05:00
Ziyi Xiong
70b4d282c6
[TRTLLM-7736][feat] Incrementally update the inputs of target and draft models (#9708)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-12-19 15:11:25 +08:00
William Zhang
478b6b20a1
[#9230][refactor] Replace nemotron patches with custom model implementation (#9751)
[#9230][refactor] Replace nemotron patches with custom model implementation

* Why?

Patching for nemotron H models was growing out of hand, and made certain
optimizations more complex than they needed to be.

* What?

This commit finally gets rid of them, and replaces them with the custom
model implementation in `modeling_nemotron_h.py`.

Closes #9230
Closes NvBug 5747867

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-12-18 19:36:27 -08:00
Lucas Liebenwein
76ec820465
[#7532][feat] AutoDeploy: gather logits before lm head (#9962)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Co-authored-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
2025-12-17 19:50:13 -08:00
Grzegorz Kwasniewski
83885c69e7
[TRTLLM-9136][feat] 2D parallel EP TP support (#9459)
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
2025-12-15 09:52:29 +01:00
Lucas Liebenwein
e767fc649a
[None][feat] AutoDeploy: prepare_metadata revisited (#9764)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-12-12 20:14:14 +08:00
Chenghao Zhang
75f5446d67
[#9753][feat] AutoDeploy: Implement add rms_norm fusion (#9754)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
2025-12-08 14:24:27 -08:00
Eran Geva
98db262a67
[None][fix] Switch AutoDeploy's default allreduce strategy to NCCL (#9666)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-12-08 03:26:21 -08:00
Chenghao Zhang
d6f95a4363
[None][feat] AutoDeploy: Perf optimization for Attention and rmsnorm (#9719)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
2025-12-05 12:59:04 -08:00
gramnarayan
74df9b180b
[#9602][feat] AutoDeploy: Support TRTLLM Sampler (#9641)
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
2025-12-04 19:24:11 -08:00
tcherckez-nvidia
f9aa86dbdd
[#8733][feat] Add Llama4 MoE handling to AutoDeploy (#9556)
Signed-off-by: Tal Cherckez <127761168+tcherckez-nvidia@users.noreply.github.com>
Signed-off-by: tcherckez-nvidia <127761168+tcherckez-nvidia@users.noreply.github.com>
Co-authored-by: Neta Zmora <nzmora@nvidia.com>
2025-12-04 08:03:33 +02:00
gramnarayan
098b9ff226
[#9147][feat] AutoDeploy: Draft Target Speculative Decoding (#9275)
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
2025-12-04 05:13:49 +08:00
Lucas Liebenwein
a1964bcbbc
[#9643][fix] AutoDeploy: fix nano sharding config (#9668)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-12-04 03:10:25 +08:00
Suyog Gupta
93871d52b2
[None][chore] AutoDeploy update cuda stream manager for multi-device (#9575)
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-12-02 20:43:14 -08:00
Grzegorz Kwasniewski
0a7a88e74e
[TRTLLM-8946][feat] Improved heuristics to detect shardable regions (#9200)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-12-02 22:08:19 +01:00
Neta Zmora
a560ba5546
[#9550][feat] AutoDeploy: Add NVFP4 Cutlass MoE kernels (#9551)
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
2025-12-03 01:39:38 +08:00
Lucas Liebenwein
e72ce98c0f
[#9150][feat] AutoDeploy: reviewer comments for #9150 (#9527)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-12-02 12:09:10 -05:00
William Zhang
2dd3ebf037
[#9150][feat] Add code for nano v3 to custom implementation in AD (#9465)
* Why?

We would like to show an alternative to monkey-patching in AutoDeploy.

* What?

This commit builds on the existing custom model implementation for
NemotronH and adds the bits relevant for MoE layers.

Part of #9150.

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-12-02 08:56:44 -08:00
Eran Geva
c9771ebb99
[#9198][feat] Refactor dist ops in AutoDeploy (#9301)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-12-02 02:36:32 +08:00
Chenghao Zhang
0a2104dce9
[None][feat] AutoDeploy: Use the router gemm op for nemotron MOE (#9500)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
2025-12-01 10:24:31 -08:00
Stefan Niebler
f155812eb0
[TRTLLM-6756][feat] Add Beam Search to TorchSampler (#8509)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-12-01 18:48:04 +01:00
Neta Zmora
bc25fff039
[#9496][fix] AutoDeploy: remove auto-tuner from nvfp4_gemm forward (#9497)
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
2025-12-01 10:04:39 +02:00
Grzegorz Kwasniewski
cff54fcae3
[#8948][feat] Support custom sharding config (#9143)
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
2025-11-29 05:28:05 +08:00
Lucas Liebenwein
2f8bd6fb36
[#9150][feat] AutoDeploy Nemotron-Flash support (#9504)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-11-27 18:03:57 +01:00
Chenghao Zhang
18fbda5cdb
[None][feat] AutoDeploy: Add A_log fusion for Mamba layers (#9422)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
2025-11-26 14:39:20 -08:00
Chenghao Zhang
bc7b60e016
[None][feat] AutoDeploy: Remove redundant copies in mamba layers (#9461)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-11-26 14:38:33 -08:00
Suyog Gupta
e484bec82f
[None][chore] AutoDeploy add multi stream moe pass to default.yaml (#9430)
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-11-25 14:16:13 -08:00
Eran Geva
afc52d7b93
[https://nvbugs/5647400] [fix] Enlarged the AllReduce workspace size to 64MB. Added AllReduce strategy to AD config. (#9145)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-11-25 10:56:07 -08:00
William Zhang
a4049fc557
[#9413][fix] Minor fixes to nemotron H and custom models in AD (#9416)
* Why?

There were a couple of issues with the recently merged custom model
injection for AutoDeploy + the reference implementation of nemotron
H:
- `d_mlp` was left in despite being mathematically always null (could
  lead to runtime issues during sharding).
- the custom model mapping was inherited by children factories.

* What?

This commit fixes these issues, and refactors the key of the custom
implementation to be based on the name of the configuration class as
well.

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-11-24 20:17:33 -08:00
Suyog Gupta
efd503751f
[#9271][perf] Enable multi-stream MOE optimization in AutoDeploy (#9322)
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-11-24 19:50:10 -08:00
William Zhang
11a0b276fb
[#9230][feat] Slimmed down implementation of nemotron H (#9235)
* Why?

The reference nemotron H code on HuggingFace is out of date,
and therefore bugged, and has several untested code paths.
This makes an already hairy patching system even hairier.

The proposal is to do away with those patches, and replace the
original implementation with one that is heavily slimmed down.

* What?

This PR sets the basis for an alternative path with such a 
slimmed down implementation that:
- fixes bugs in the current HF implementation
- adds no new dependencies to TensorRT-LLM
- does away with unnecessary features for TensorRT-LLM/
  AutoDeploy:
- no training related code (dropout, gradient checkpointing, etc.)
- no caching logic (we want to replace it with our own anyway)
- no attention masking where possible
- reuses existing AD custom ops for mamba SSM update /
   causal conv1d / attention

In order for the above to be usable in the AD apparatus,
`AutoModelForCausalLMFactory` is extended to allow registrations
of custom model implementations.

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-11-23 03:13:32 -08:00
Neta Zmora
3952a61681
[#9388][fix] AutoDeploy: Fix cutlass BF16 MoE kernel invocation (#9339)
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
2025-11-21 17:05:03 -08:00
Chenghao Zhang
564989865c
[TRTLLM-9082][feat] AutoDeploy: Move the moe Align kernel to AOT (#9106)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
2025-11-21 16:05:48 -08:00