Bo Li
fcfec93cad
[TRTLLM-9389][chore] Rename AlltoAll backend names ( #9329 )
...
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-11-23 13:52:57 -08:00
William Zhang
11a0b276fb
[ #9230 ][feat] Slimmed down implementation of nemotron H ( #9235 )
...
* Why?
The reference nemotron H code on HuggingFace is out of date,
and therefore bugged, and has several untested code paths.
This makes an already hairy patching system even hairier.
The proposal is to do away with those patches, and replace the
original implementation with one that is heavily slimmed down.
* What?
This PR sets the basis for an alternative path with such a
slimmed down implementation that:
- fixes bugs in the current HF implementation
- adds no new dependencies to TensorRT-LLM
- does away with unnecessary features for TensorRT-LLM/
AutoDeploy:
- no training related code (dropout, gradient checkpointing, etc.)
- no caching logic (we want to replace it with our own anyway)
- no attention masking where possible
- reuses existing AD custom ops for mamba SSM update /
causal conv1d / attention
In order for the above to be usable in the AD apparatus,
`AutoModelForCausalLMFactory` is extended to allow registrations
of custom model implementations.
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-11-23 03:13:32 -08:00
Neta Zmora
3952a61681
[ #9388 ][fix] AutoDeploy: Fix cutlass BF16 MoE kernel invocation ( #9339 )
...
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
2025-11-21 17:05:03 -08:00
Chenghao Zhang
564989865c
[TRTLLM-9082][feat] AutoDeploy: Move the moe Align kernel to AOT ( #9106 )
...
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
2025-11-21 16:05:48 -08:00
Izzy Putterman
eb7792e875
[None][feat] Eagle: PostNorm and multilayer options ( #9233 )
...
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
2025-11-21 17:39:00 -05:00
Enwei Zhu
13fbd4366a
[TRTLLM-9370][feat] Integration of CuteDSL NVFP4 grouped GEMM (Part 2: SwiGLU Fusion and Finalize Fusion) ( #9288 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-11-21 14:03:38 -08:00
Ziyi Xiong
5df907b388
[ https://nvbugs/5590408 ][fix] Fallback to greedy sampling in two-model overlap scheduler ( #9321 )
...
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-11-21 10:19:59 -05:00
HuiGao-NV
6dd2fcd7b3
[ https://nvbugs/5629833 ][fix] Don't fill tensors with 0 ( #9296 )
...
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-11-21 20:50:05 +08:00
mpikulski
095b6864a8
[TRTLLM-8650][fix] beam search request validation ( #8433 ) ( #9228 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-11-21 04:08:45 -08:00
xxi
cc0dc7c124
[TRTLLM-8957][feat] create communication related classes ( #8968 )
2025-11-20 22:32:42 -08:00
Yukun He
9a79f32f7a
[ https://nvbugs/5608489 ][fix] Fix output unpack issues for Llama3/4 NVFP4 models. ( #8679 )
...
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
2025-11-20 12:43:13 -05:00
Lizhi Zhou
33b0b945c7
[ https://nvbugs/5582277 ][fix] rework DisaggPPTerminationHandler to fix hang issue ( #8519 )
...
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
2025-11-20 12:43:13 -05:00
Jin Li
3454eacd74
[ https://nvbugs/5546510 ][fix] Move torch.cuda.Stream out of torch com… ( #8494 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
2025-11-20 12:43:13 -05:00
JunyiXu-nv
ee6944bfa2
[ https://nvbugs/5569713 ][fix] Disable fp8 deep gemm for EXAONE-4.0-32B-FP8 ( #8429 )
...
Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
2025-11-20 12:43:13 -05:00
Liao Lanyu
04ad9f96fa
[ https://nvbugs/5667687 ][fix] Set correct lm_head_tp_size_upper_bound ( #9300 )
...
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Co-authored-by: Lanyu Liao <lancelly@users.noreply.github.com>
2025-11-20 00:41:00 -08:00
Neta Zmora
1d6fbbf45d
[ #9236 ][feature] Make sharing of activation_type across SW layers more robust ( #9238 )
...
C++, Python and Python MoE layer all share the definition of ActivationType.
Currently this is done thru redefinition which is fragile and can break when adding new activation function types.
tensorrt_llm/_torch/utils.py
cpp/tensorrt_llm/kernels/cutlass_kernels/include/common.h
=>
tensorrt_llm/layers/moe.py
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-11-20 16:06:58 +08:00
Yechan Kim
d5622b2689
[None][fix] Multimodal InputProcessor dummy builder fix ( #8916 )
...
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-11-19 22:32:21 -08:00
Chang Liu
79a6c9742b
[None][fix] Use fp32 for indexer weight_proj GEMM ( #9243 )
...
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-11-19 21:52:38 -08:00
Neta Zmora
028fc877a5
[ #9096 ][feature] Auto Deploy: configurable fused MoE backend ( #9194 )
...
Allow configuring Auto Deploy's MoE/FP8-MoE backend from external yaml config file.
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
2025-11-19 21:50:22 -08:00
Yukun He
b6bced83c0
[TRTLLM-7963][feat] Use CUDAGraph to improve the tuning accuracy for AutoTuner. ( #9089 )
...
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-11-20 08:54:29 +08:00
Fanrong Li
d4abb86f3e
[None][fix] fix EPLB for DeepSeek-V3.2-Exp ( #9245 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-11-19 13:45:54 -08:00
Faraz
49c45ebef1
[None][fix] change logging for weight loading on unified memory ( #9177 )
...
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
Signed-off-by: Simeng Liu <109828133+SimengLiu-nv@users.noreply.github.com>
Co-authored-by: Simeng Liu <109828133+SimengLiu-nv@users.noreply.github.com>
2025-11-19 14:31:19 -05:00
NVShreyas
1eae941d77
[ #9237 ][feat] enable iter stats in autodeploy ( #9278 )
...
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
2025-11-19 19:29:29 +01:00
NVShreyas
a7c0b54ce7
[None][feat] add specdec to nemotron nas ( #8985 )
...
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
2025-11-19 19:28:35 +01:00
Bo Li
d8b05894ee
[None][perf] Adjust select_alltoall_method_type. ( #8950 )
...
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-11-19 07:43:55 -08:00
mpikulski
46dd9886bb
[ https://nvbugs/5661877 ][fix] fix test regression in TestBatchedSampling::test_samples ( #9215 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-11-19 01:44:44 -08:00
CarstyYou
ee941ac779
[ https://nvbugs/5456493 ][feat] add fp8 dense for sm120 ( #9174 )
...
Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>
2025-11-19 14:40:34 +08:00
ChristinaZ
941a54c66a
[None][feat] Update the indexer topK ( #9255 )
...
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-11-19 11:49:00 +08:00
jellysnack
99ba723e20
[None][fix] logits device and shape issues in dynamic draft path ( #9079 )
...
Signed-off-by: jellysnack <oleg.jellysnack@gmail.com>
2025-11-18 19:22:47 -08:00
Grzegorz Kwasniewski
7905d6c0da
[ #9098 ][feat] Simple sharding latent experts ( #9099 )
...
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
2025-11-18 21:14:22 -05:00
Grzegorz Kwasniewski
92f86a50d4
[ #9137 ][feat] Factory sharding as default ( #9144 )
...
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
2025-11-18 21:12:03 -05:00
Patrice Castonguay
9b0f45298f
[None][feat] Have ability to cancel disagg request if KV cache resource are exhausted ( #9155 )
...
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-11-18 20:59:17 -05:00
Enwei Zhu
7c4777a571
[TRTLLM-9286][feat] Integration of CuteDSL NVFP4 grouped GEMM ( #8880 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-11-18 17:40:12 -08:00
Ziyi Xiong
7c4344b92e
[ https://nvbugs/5590408 ][fix] Exclude num of draft tokens from mMaxSeqLenKv ( #9210 )
...
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-11-18 15:41:56 -05:00
Eran Geva
3ac11a6180
[ #9152 ][fix] AutoDeploy fused_allreduce_residual_rmsnorm to support demollm mode ( #9197 )
...
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-11-18 22:15:29 +02:00
Chenghao Zhang
f0b68e4c66
[None][feat] AutoDeploy: Perf improvement for small batch size ( #9163 )
...
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-11-18 12:11:12 -08:00
Zheyu Fu
c4e02d7f04
[TRTLLM-8136][feat] Dynamic draft length in spec decode (stage 1). ( #8194 )
...
Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>
2025-11-18 11:13:39 -05:00
Robin Kobus
9913dc25ae
[None][refactor] decoding inputs, part 2 ( #5799 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-11-18 14:38:51 +01:00
Chang Liu
8e001dd195
[None][fix] DeepSeek V3.2 indexer RoPE fix ( #9232 )
...
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-11-18 20:35:27 +08:00
Lizhi Zhou
07343bb11c
[None][chore] fix a deepseekv3 error when debug mode is on ( #9217 )
...
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2025-11-18 01:14:32 -08:00
ruodil
82480346aa
[ https://nvbugs/5652552 ][fix] add printing for llm args ( #9205 )
...
Signed-off-by: Ruodi Lu <ruodil@users.noreply.github.com>
Co-authored-by: Ruodi Lu <ruodil@users.noreply.github.com>
2025-11-17 23:58:36 -08:00
Tri Dao
fc088e642c
[None][feat] Support Glm4MoeForCausalLM ( #8256 )
...
Signed-off-by: Tri Dao <daominhtri0503@gmail.com>
Co-authored-by: Xuanyu Chen <xuanyuc@nvidia.com>
2025-11-18 09:43:21 +08:00
Robin Kobus
df41f220a2
[TRTLLM-8831][feat] Enable early exit with overlap scheduler ( #8587 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-11-17 18:07:13 +01:00
Mike Iovine
6151a4c9d6
[None][feat] Add simple optimizations for MTP 2-model ( #9176 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-11-17 10:05:39 -05:00
Kaiyu Xie
04be5a704e
[None] [fix] Fix missing ActivationType issue ( #9171 )
...
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
2025-11-17 10:43:25 +08:00
Anthony Chang
86cfb3ea7e
[None][feat] Update TRTLLM MoE cubins; reduce mxfp4 weight padding requirement; tighten TMA bound ( #9025 )
...
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-11-17 10:04:29 +08:00
Jinyang Yuan
6dc70aa0e5
[ https://nvbugs/5613089 ][fix] Fix the rank to access all_rank_chunk_size_list when chunked MoE is used ( #8723 )
...
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-11-17 10:01:08 +08:00
sunnyqgg
7862b15a65
[TRTLLM-8778][feat] Add tree attention support for blackwell arch ( #8975 )
...
Signed-off-by: qgai <qgai@nvidia.com>
2025-11-17 09:01:53 +08:00
Guoming Zhang
e0f69657c7
[None][fix] Update the attention layers counting for Qwen3-next. ( #9072 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-11-16 11:52:56 -08:00
JadoTu
3cde84581d
[None][fix] Make the sliced nvfp4 output contiguous ( #9123 )
...
Signed-off-by: jiant <107457950+JadoTu@users.noreply.github.com>
2025-11-15 20:00:54 +08:00