TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-23 12:12:39 +08:00

Author	SHA1	Message	Date
Bo Li	fcfec93cad	[TRTLLM-9389][chore] Rename AlltoAll backend names (#9329 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-11-23 13:52:57 -08:00
William Zhang	11a0b276fb	[#9230 ][feat] Slimmed down implementation of nemotron H (#9235 ) * Why? The reference nemotron H code on HuggingFace is out of date, and therefore bugged, and has several untested code paths. This makes an already hairy patching system even hairier. The proposal is to do away with those patches, and replace the original implementation with one that is heavily slimmed down. * What? This PR sets the basis for an alternative path with such a slimmed down implementation that: - fixes bugs in the current HF implementation - adds no new dependencies to TensorRT-LLM - does away with unnecessary features for TensorRT-LLM/ AutoDeploy: - no training related code (dropout, gradient checkpointing, etc.) - no caching logic (we want to replace it with our own anyway) - no attention masking where possible - reuses existing AD custom ops for mamba SSM update / causal conv1d / attention In order for the above to be usable in the AD apparatus, `AutoModelForCausalLMFactory` is extended to allow registrations of custom model implementations. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>	2025-11-23 03:13:32 -08:00
Neta Zmora	3952a61681	[#9388 ][fix] AutoDeploy: Fix cutlass BF16 MoE kernel invocation (#9339 ) Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>	2025-11-21 17:05:03 -08:00
Chenghao Zhang	564989865c	[TRTLLM-9082][feat] AutoDeploy: Move the moe Align kernel to AOT (#9106 ) Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>	2025-11-21 16:05:48 -08:00
Izzy Putterman	eb7792e875	[None][feat] Eagle: PostNorm and multilayer options (#9233 ) Signed-off-by: Izzy Putterman <iputterman@nvidia.com>	2025-11-21 17:39:00 -05:00
Enwei Zhu	13fbd4366a	[TRTLLM-9370][feat] Integration of CuteDSL NVFP4 grouped GEMM (Part 2: SwiGLU Fusion and Finalize Fusion) (#9288 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-11-21 14:03:38 -08:00
Ziyi Xiong	5df907b388	[https://nvbugs/5590408 ][fix] Fallback to greedy sampling in two-model overlap scheduler (#9321 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-11-21 10:19:59 -05:00
HuiGao-NV	6dd2fcd7b3	[https://nvbugs/5629833 ][fix] Don't fill tensors with 0 (#9296 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-11-21 20:50:05 +08:00
mpikulski	095b6864a8	[TRTLLM-8650][fix] beam search request validation (#8433 ) (#9228 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-11-21 04:08:45 -08:00
xxi	cc0dc7c124	[TRTLLM-8957][feat] create communication related classes (#8968 )	2025-11-20 22:32:42 -08:00
Yukun He	9a79f32f7a	[https://nvbugs/5608489 ][fix] Fix output unpack issues for Llama3/4 NVFP4 models. (#8679 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com>	2025-11-20 12:43:13 -05:00
Lizhi Zhou	33b0b945c7	[https://nvbugs/5582277 ][fix] rework DisaggPPTerminationHandler to fix hang issue (#8519 ) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com>	2025-11-20 12:43:13 -05:00
Jin Li	3454eacd74	[https://nvbugs/5546510 ][fix] Move torch.cuda.Stream out of torch com… (#8494 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com>	2025-11-20 12:43:13 -05:00
JunyiXu-nv	ee6944bfa2	[https://nvbugs/5569713 ][fix] Disable fp8 deep gemm for EXAONE-4.0-32B-FP8 (#8429 ) Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com>	2025-11-20 12:43:13 -05:00
Liao Lanyu	04ad9f96fa	[https://nvbugs/5667687 ][fix] Set correct lm_head_tp_size_upper_bound (#9300 ) Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com> Co-authored-by: Lanyu Liao <lancelly@users.noreply.github.com>	2025-11-20 00:41:00 -08:00
Neta Zmora	1d6fbbf45d	[#9236 ][feature] Make sharing of activation_type across SW layers more robust (#9238 ) C++, Python and Python MoE layer all share the definition of ActivationType. Currently this is done thru redefinition which is fragile and can break when adding new activation function types. tensorrt_llm/_torch/utils.py cpp/tensorrt_llm/kernels/cutlass_kernels/include/common.h => tensorrt_llm/layers/moe.py cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-11-20 16:06:58 +08:00
Yechan Kim	d5622b2689	[None][fix] Multimodal InputProcessor dummy builder fix (#8916 ) Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>	2025-11-19 22:32:21 -08:00
Chang Liu	79a6c9742b	[None][fix] Use fp32 for indexer weight_proj GEMM (#9243 ) Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>	2025-11-19 21:52:38 -08:00
Neta Zmora	028fc877a5	[#9096 ][feature] Auto Deploy: configurable fused MoE backend (#9194 ) Allow configuring Auto Deploy's MoE/FP8-MoE backend from external yaml config file. Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>	2025-11-19 21:50:22 -08:00
Yukun He	b6bced83c0	[TRTLLM-7963][feat] Use CUDAGraph to improve the tuning accuracy for AutoTuner. (#9089 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-11-20 08:54:29 +08:00
Fanrong Li	d4abb86f3e	[None][fix] fix EPLB for DeepSeek-V3.2-Exp (#9245 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-11-19 13:45:54 -08:00
Faraz	49c45ebef1	[None][fix] change logging for weight loading on unified memory (#9177 ) Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> Signed-off-by: Simeng Liu <109828133+SimengLiu-nv@users.noreply.github.com> Co-authored-by: Simeng Liu <109828133+SimengLiu-nv@users.noreply.github.com>	2025-11-19 14:31:19 -05:00
NVShreyas	1eae941d77	[#9237 ][feat] enable iter stats in autodeploy (#9278 ) Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>	2025-11-19 19:29:29 +01:00
NVShreyas	a7c0b54ce7	[None][feat] add specdec to nemotron nas (#8985 ) Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>	2025-11-19 19:28:35 +01:00
Bo Li	d8b05894ee	[None][perf] Adjust select_alltoall_method_type. (#8950 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-11-19 07:43:55 -08:00
mpikulski	46dd9886bb	[https://nvbugs/5661877 ][fix] fix test regression in TestBatchedSampling::test_samples (#9215 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-11-19 01:44:44 -08:00
CarstyYou	ee941ac779	[https://nvbugs/5456493 ][feat] add fp8 dense for sm120 (#9174 ) Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>	2025-11-19 14:40:34 +08:00
ChristinaZ	941a54c66a	[None][feat] Update the indexer topK (#9255 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-11-19 11:49:00 +08:00
jellysnack	99ba723e20	[None][fix] logits device and shape issues in dynamic draft path (#9079 ) Signed-off-by: jellysnack <oleg.jellysnack@gmail.com>	2025-11-18 19:22:47 -08:00
Grzegorz Kwasniewski	7905d6c0da	[#9098 ][feat] Simple sharding latent experts (#9099 ) Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>	2025-11-18 21:14:22 -05:00
Grzegorz Kwasniewski	92f86a50d4	[#9137 ][feat] Factory sharding as default (#9144 ) Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>	2025-11-18 21:12:03 -05:00
Patrice Castonguay	9b0f45298f	[None][feat] Have ability to cancel disagg request if KV cache resource are exhausted (#9155 ) Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>	2025-11-18 20:59:17 -05:00
Enwei Zhu	7c4777a571	[TRTLLM-9286][feat] Integration of CuteDSL NVFP4 grouped GEMM (#8880 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-11-18 17:40:12 -08:00
Ziyi Xiong	7c4344b92e	[https://nvbugs/5590408 ][fix] Exclude num of draft tokens from mMaxSeqLenKv (#9210 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-11-18 15:41:56 -05:00
Eran Geva	3ac11a6180	[#9152 ][fix] AutoDeploy fused_allreduce_residual_rmsnorm to support demollm mode (#9197 ) Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>	2025-11-18 22:15:29 +02:00
Chenghao Zhang	f0b68e4c66	[None][feat] AutoDeploy: Perf improvement for small batch size (#9163 ) Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com> Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>	2025-11-18 12:11:12 -08:00
Zheyu Fu	c4e02d7f04	[TRTLLM-8136][feat] Dynamic draft length in spec decode (stage 1). (#8194 ) Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>	2025-11-18 11:13:39 -05:00
Robin Kobus	9913dc25ae	[None][refactor] decoding inputs, part 2 (#5799 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-11-18 14:38:51 +01:00
Chang Liu	8e001dd195	[None][fix] DeepSeek V3.2 indexer RoPE fix (#9232 ) Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>	2025-11-18 20:35:27 +08:00
Lizhi Zhou	07343bb11c	[None][chore] fix a deepseekv3 error when debug mode is on (#9217 ) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>	2025-11-18 01:14:32 -08:00
ruodil	82480346aa	[https://nvbugs/5652552 ][fix] add printing for llm args (#9205 ) Signed-off-by: Ruodi Lu <ruodil@users.noreply.github.com> Co-authored-by: Ruodi Lu <ruodil@users.noreply.github.com>	2025-11-17 23:58:36 -08:00
Tri Dao	fc088e642c	[None][feat] Support Glm4MoeForCausalLM (#8256 ) Signed-off-by: Tri Dao <daominhtri0503@gmail.com> Co-authored-by: Xuanyu Chen <xuanyuc@nvidia.com>	2025-11-18 09:43:21 +08:00
Robin Kobus	df41f220a2	[TRTLLM-8831][feat] Enable early exit with overlap scheduler (#8587 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-11-17 18:07:13 +01:00
Mike Iovine	6151a4c9d6	[None][feat] Add simple optimizations for MTP 2-model (#9176 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-11-17 10:05:39 -05:00
Kaiyu Xie	04be5a704e	[None] [fix] Fix missing ActivationType issue (#9171 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com> Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Co-authored-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>	2025-11-17 10:43:25 +08:00
Anthony Chang	86cfb3ea7e	[None][feat] Update TRTLLM MoE cubins; reduce mxfp4 weight padding requirement; tighten TMA bound (#9025 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-11-17 10:04:29 +08:00
Jinyang Yuan	6dc70aa0e5	[https://nvbugs/5613089 ][fix] Fix the rank to access all_rank_chunk_size_list when chunked MoE is used (#8723 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-11-17 10:01:08 +08:00
sunnyqgg	7862b15a65	[TRTLLM-8778][feat] Add tree attention support for blackwell arch (#8975 ) Signed-off-by: qgai <qgai@nvidia.com>	2025-11-17 09:01:53 +08:00
Guoming Zhang	e0f69657c7	[None][fix] Update the attention layers counting for Qwen3-next. (#9072 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-11-16 11:52:56 -08:00
JadoTu	3cde84581d	[None][fix] Make the sliced nvfp4 output contiguous (#9123 ) Signed-off-by: jiant <107457950+JadoTu@users.noreply.github.com>	2025-11-15 20:00:54 +08:00

1 2 3 4 5 ...

1175 Commits