Commit Graph

1210 Commits

Author SHA1 Message Date
Grzegorz Kwasniewski
cff54fcae3
[#8948][feat] Support custom sharding config (#9143)
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
2025-11-29 05:28:05 +08:00
Matthias Jouanneaux
f8dd494536
[None][perf] Helix: improve all-to-all perf for large CP size (#9494)
Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>
Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>
Co-authored-by: Zheyu Fu <zheyuf@nvidia.com>
2025-11-28 07:24:55 -08:00
mpikulski
e5f39ec7cf
[TRTLLM-9488][feat] add 'disable_flashinfer_sampling' config option (#9454)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-11-28 13:00:39 +01:00
Robin Kobus
5eae3650c3
[None][fix] Pass checkpoint_format to create_input_processor (#9521)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-11-28 10:32:29 +01:00
Yukun He
60c43a200a
[None][fix] Fix on-disk cache and revise logger/statistics for AutoTuner. (#9211)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-11-28 13:32:21 +08:00
Lucas Liebenwein
2f8bd6fb36
[#9150][feat] AutoDeploy Nemotron-Flash support (#9504)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-11-27 18:03:57 +01:00
Bo Li
62b771877c
[TRTLLM-9389][chore] Refactor AlltoallMethodType. (#9388)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-11-27 21:09:29 +08:00
Fanrong Li
2d5eadf65f
[None][fix] fix TP support for DeepSeek-V3.2 on hopper (#9484)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-11-27 21:02:25 +08:00
Ziyi Xiong
1dd55d8507
[https://nvbugs/5698581][fix] Init draft tokens for CUDA graph dummy request (#9505)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-11-27 13:05:37 +08:00
Jiagan Cheng
14762e0287
[None][fix] Replace PYTORCH_CUDA_ALLOC_CONF with PYTORCH_ALLOC_CONF to fix deprecation warning (#9294)
Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>
2025-11-27 12:22:01 +08:00
Chenghao Zhang
18fbda5cdb
[None][feat] AutoDeploy: Add A_log fusion for Mamba layers (#9422)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
2025-11-26 14:39:20 -08:00
Chenghao Zhang
bc7b60e016
[None][feat] AutoDeploy: Remove redundant copies in mamba layers (#9461)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-11-26 14:38:33 -08:00
Aurelien Chartier
ef7ee6a940
[None][feat] Add environment variable to force spec-dec number of accepted tokens (#9371)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-11-26 07:22:16 -08:00
Chang Liu
b10137fdd5
[None][feat] Support MLA chunked prefill for DeepSeek V3.2 model (#9376)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-11-26 16:38:25 +08:00
Enwei Zhu
1bf2d750a2
[None][chore] Upgrade CuteDSL to 4.3.0 (#9444)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-11-26 14:53:09 +08:00
JunyiXu-nv
b7308a4000
[https://nvbugs/5580099][fix] Cherry pick IMA issue fix from release/1.1 (#9032)
Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>
2025-11-26 13:09:06 +08:00
shuyixiong
d8acea1db3
[TRTLLM-9293][feat] Enable partial weight loading to support streaming update weights (#9224)
Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com>
2025-11-26 10:59:06 +08:00
Chuang Zhu
0e9c7f8c07
[https://nvbugs/5685143][fix] avoid cudaFree overlap with cuda graph (#9438)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-11-25 16:20:29 -08:00
Suyog Gupta
e484bec82f
[None][chore] AutoDeploy add multi stream moe pass to default.yaml (#9430)
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-11-25 14:16:13 -08:00
Robin Kobus
32f53910ef
[TRTLLM-909][feat] Overlap context chunks in pipeline parallel mode (#9308)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-11-25 22:11:51 +01:00
Eran Geva
afc52d7b93
[https://nvbugs/5647400] [fix] Enlarged the AllReduce workspace size to 64MB. Added AllReduce strategy to AD config. (#9145)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-11-25 10:56:07 -08:00
mpikulski
899fda9e47
[TRTLLM-9490][feat] use FlashInfer's top_k_sampling_from_probs (#9457)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-11-25 18:53:53 +01:00
mpikulski
c5f52ab304
[TRTLLM-8376][feat] top-p optimization (removes redundant softmax) (#9411)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-11-25 18:46:48 +01:00
YueWeng
cc336c4abd
[TRTLLM-8160][feat] Add draft token tree runtime on CDL (#8586)
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
2025-11-25 09:40:55 -05:00
Yueh-Ting (eop) Chen
a38d91aae2
[https://nvbugs/5537996][fix] Let KV cache manager block initialization be aware whether it is doing a dry run or not (#9093)
Before this commit, the kv cache manager does the same regardless, which causes a mis-calculation in free memory available to allocate for the KV cache manager, hence causing a crash.

This commit fixes this by letting KV cache manager initialization be aware whether it is doing the dry run or not. If it is a dry run, use the max_tokens setting that is already pre-calculated and filled into kv_cache_config.max_tokens.

Signed-off-by: eopXD <yuehtingc@nvidia.com>
2025-11-25 17:27:11 +08:00
Yukun He
e580da4155
[TRTLLM-7963][feat] Cold L2 cache when doing autotune benchmarking. (#8779)
The performance results of some kernels could be easily affected by the warm/cold L2 cache status. To achieve more precise profiling results, the L2 cache is cleared for every execution by the circular buffer method for better benchmarking during autotuning.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-11-25 15:06:22 +08:00
William Zhang
a4049fc557
[#9413][fix] Minor fixes to nemotron H and custom models in AD (#9416)
* Why?

There were a couple of issues with the recently merged custom model
injection for AutoDeploy + the reference implementation of nemotron
H:
- `d_mlp` was left in despite being mathematically always null (could
  lead to runtime issues during sharding).
- the custom model mapping was inherited by children factories.

* What?

This commit fixes these issues, and refactors the key of the custom
implementation to be based on the name of the configuration class as
well.

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-11-24 20:17:33 -08:00
Suyog Gupta
efd503751f
[#9271][perf] Enable multi-stream MOE optimization in AutoDeploy (#9322)
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-11-24 19:50:10 -08:00
Yuxian Qiu
8a0295015f
[None][chore] Reduce nested nvtx ranges. (#9347)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-11-25 09:58:41 +08:00
bhsueh_NV
1a93583438
[None][feat] Support Yarn on QwQ-32B model (#9059)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>
Co-authored-by: NVJiangShao <91270701+StudyingShao@users.noreply.github.com>
2025-11-25 07:27:28 +08:00
Yibin Li
1ce483c999
[TRTLLM-7967][feat] Adding Starcoder2 PyTorch Backend Support (#8923)
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-11-24 11:23:22 -08:00
Yukun He
960851f419
[None][chore] Remove unnecessary log in the short tuning profile (#9387)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-11-24 12:31:26 +08:00
Yukun He
39076410a8
[https://nvbugs/5676748][fix] Fix mismatched nvfp4 gemm sf shape. (#9336)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-11-24 12:16:32 +08:00
brb-nv
c045e359a7
[https://nvbugs/5637012][fix] Fix helix unit tests (#9369)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-11-23 19:34:22 -08:00
Yukun He
c3acf965a6
[TRTLLM-7963][fix] Several improvements of autotuning quality (#9348)
* Skip the shape profile generating process if the profile has already been found in the cache under tuning mode. This is a prerequisite for nested autotuning because host overhead might be included during the profiling of the high-level op.
* Enable the profiling with CUDA graph as the default profiling method.
* Apply a heuristic method to cut off the number of repeat times of profiling according to a few-run time measurement.
2025-11-24 10:38:45 +08:00
Bo Li
fcfec93cad
[TRTLLM-9389][chore] Rename AlltoAll backend names (#9329)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-11-23 13:52:57 -08:00
William Zhang
11a0b276fb
[#9230][feat] Slimmed down implementation of nemotron H (#9235)
* Why?

The reference nemotron H code on HuggingFace is out of date,
and therefore bugged, and has several untested code paths.
This makes an already hairy patching system even hairier.

The proposal is to do away with those patches, and replace the
original implementation with one that is heavily slimmed down.

* What?

This PR sets the basis for an alternative path with such a 
slimmed down implementation that:
- fixes bugs in the current HF implementation
- adds no new dependencies to TensorRT-LLM
- does away with unnecessary features for TensorRT-LLM/
  AutoDeploy:
- no training related code (dropout, gradient checkpointing, etc.)
- no caching logic (we want to replace it with our own anyway)
- no attention masking where possible
- reuses existing AD custom ops for mamba SSM update /
   causal conv1d / attention

In order for the above to be usable in the AD apparatus,
`AutoModelForCausalLMFactory` is extended to allow registrations
of custom model implementations.

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-11-23 03:13:32 -08:00
Neta Zmora
3952a61681
[#9388][fix] AutoDeploy: Fix cutlass BF16 MoE kernel invocation (#9339)
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
2025-11-21 17:05:03 -08:00
Chenghao Zhang
564989865c
[TRTLLM-9082][feat] AutoDeploy: Move the moe Align kernel to AOT (#9106)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
2025-11-21 16:05:48 -08:00
Izzy Putterman
eb7792e875
[None][feat] Eagle: PostNorm and multilayer options (#9233)
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
2025-11-21 17:39:00 -05:00
Enwei Zhu
13fbd4366a
[TRTLLM-9370][feat] Integration of CuteDSL NVFP4 grouped GEMM (Part 2: SwiGLU Fusion and Finalize Fusion) (#9288)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-11-21 14:03:38 -08:00
Ziyi Xiong
5df907b388
[https://nvbugs/5590408][fix] Fallback to greedy sampling in two-model overlap scheduler (#9321)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-11-21 10:19:59 -05:00
HuiGao-NV
6dd2fcd7b3
[https://nvbugs/5629833][fix] Don't fill tensors with 0 (#9296)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-11-21 20:50:05 +08:00
mpikulski
095b6864a8
[TRTLLM-8650][fix] beam search request validation (#8433) (#9228)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-11-21 04:08:45 -08:00
xxi
cc0dc7c124
[TRTLLM-8957][feat] create communication related classes (#8968) 2025-11-20 22:32:42 -08:00
Yukun He
9a79f32f7a [https://nvbugs/5608489][fix] Fix output unpack issues for Llama3/4 NVFP4 models. (#8679)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
2025-11-20 12:43:13 -05:00
Lizhi Zhou
33b0b945c7 [https://nvbugs/5582277][fix] rework DisaggPPTerminationHandler to fix hang issue (#8519)
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
2025-11-20 12:43:13 -05:00
Jin Li
3454eacd74 [https://nvbugs/5546510][fix] Move torch.cuda.Stream out of torch com… (#8494)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
2025-11-20 12:43:13 -05:00
JunyiXu-nv
ee6944bfa2 [https://nvbugs/5569713][fix] Disable fp8 deep gemm for EXAONE-4.0-32B-FP8 (#8429)
Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
2025-11-20 12:43:13 -05:00
Liao Lanyu
04ad9f96fa
[https://nvbugs/5667687][fix] Set correct lm_head_tp_size_upper_bound (#9300)
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Co-authored-by: Lanyu Liao <lancelly@users.noreply.github.com>
2025-11-20 00:41:00 -08:00