TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Neta Zmora	966231d29c	[#9626 ][feat] Add an auto-deploy transform for using cutlass FP4 MoE kernels (#10304 ) Add a transform to relace torch.ops.auto_deploy.torch_quant_nvfp4_moe with the optimized torch.ops.auto_deploy.trtllm_quant_nvfp4_moe_fused. Currently generates the wrong results when the number of rows in MoE FC1 weights is not divisible by 128, so torch.ops.auto_deploy.trtllm_quant_nvfp4_moe_fused is not set as the default FP4 MoE implementation (i.e. the transform is disabled). Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>	2025-12-29 23:18:15 +02:00
Ziyi Xiong	c59aa8bec5	[TRTLLM-9962][feat] Some optimizations for two-model spec dec (#10208 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-12-28 12:52:04 +08:00
JunyiXu-nv	55bc6a5ff8	[https://nvbugs/5753250 ][fix] Fix undefined local variable in responses utils (#10154 ) Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com> Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>	2025-12-28 06:59:32 +08:00
shivghai	ee07a7c55e	[None][fix] [Gemma3] Fix RoPE for local attention for Gemma3 (#9961 ) Signed-off-by: Shiv Ghai <8965168+shivghai@users.noreply.github.com>	2025-12-27 11:50:59 -08:00
Guoming Zhang	1865020b6f	[TRTLLM-8577][feat] Clean the Qwen3-next code by removing Qwen3NextCo… (#10228 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-12-27 22:49:55 +08:00
Olya Kozlova	55f3cda66d	[None][fix] Fix request_id for best_of/n case (#8368 ) Signed-off-by: Olya Kozlova <okozlova@nvidia.com>	2025-12-26 22:20:24 +01:00
Jin Li	c04563657e	[TRTLLM-7735][feat] Attention NVFP4 out support for torch compile (#9740 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-12-27 00:07:20 +08:00
Pengyun Lin	c5b0f9e436	[https://nvbugs/5633700 ][fix] Cache tiktoken vocab for gpt-oss (#10219 ) Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>	2025-12-26 18:39:03 +08:00
Wanli Jiang	14554ab3f3	[None][feat] Support multi-gpu running for nemotron-v3-nano and super (#10118 ) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2025-12-26 11:23:14 +08:00
Enwei Zhu	13ffe52ad0	[None][fix] Allow YAML config overwriting CLI args for trtllm-eval (#10296 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-12-25 15:08:15 -05:00
Neta Zmora	f3f02315df	[None][chore]: small refactoring to auto-deploy MoE operator (#10300 ) Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>	2025-12-25 12:27:11 -05:00
bhsueh_NV	db3430f589	[None][feat] Support VLM part for Mistral Large 3 (#10188 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-12-25 11:20:58 -05:00
Jin Li	7e4cef9def	[None][fix] Cherry-pick conflict changes for PR 7999 PR 8515 (#9446 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-12-25 10:23:04 -05:00
Ziyi Xiong	d8b5aeb061	[https://nvbugs/5652062 ][fix] Rewind kv_cache and reset draft tokens (#10160 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-12-25 09:13:51 -05:00
ZhichenJiang	46e4af5688	[TRTLLM-9831][perf] Enable 2CTA with autotune for CuteDSL MoE and Grouped GEMM optimizations (#10201 ) Signed-off-by: zhichen jiang <zhichenj@NVIDIA.com> Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-12-25 09:04:20 -05:00
Zhenhuan Chen	8462cf6c96	[TRTLLM-9578][feat] make PDL enabled by default (#9695 ) Signed-off-by: Zhenhuan Chen <zhenhuanc@nvidia.com>	2025-12-25 07:15:24 -05:00
Xianjie Qiao	53b81783b1	[None][fix] Fix pageable H2D memcopy issue on GB200 (#10289 ) Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>	2025-12-25 18:15:57 +08:00
gramnarayan	a9eb5afc9f	[#9241 ][feat] AutoDeploy: Support Eagle3 Speculative Decoding (#9869 ) Support two model flow with no overlap scheduler or chain drafter. Drafting model is in PyTorch backend. Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>	2025-12-24 23:30:42 -05:00
Ziyi Xiong	43178590d1	[TRTLLM-10143][feat] Reuse previous draft requests if possible (#10263 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-12-24 17:48:38 -08:00
Neta Zmora	c4b36d31ff	[#10137 ][feat] AutoDeploy FP8 MoE refactor (#10138 ) The trtllm (cutlass) fp8 moe operator performs W3+W1 fusion (concat) during inference and we want to move this fusion to the model optimization time. The Cutlass MoE kernel is used thru a trtllm torch operator. Its implementation uses two FC operations (fc1 and fc2) while the canonical MoE API defines three GEMM operations and their associated weights (W1, W2, W3) so when we switch from the torch.moe op to the trtllm.moe op we also change terminology from w1, w2, w3 to fc1, fc2. Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>	2025-12-24 18:58:10 +02:00
Necofish	8614cd3439	[None][fix] fix: resolve GPU memory imbalance in concurrent weight loading (#6472 ) Signed-off-by: Necofish <liuxiangyang@mail.ustc.edu.cn> Signed-off-by: Nekofish-L <liuxiangyang@mail.ustc.edu.cn> Signed-off-by: Jie Li <lijie@nvidia.com> Co-authored-by: Jie Li <lijie@nvidia.com>	2025-12-24 09:43:09 -05:00
Suyog Gupta	e2891a6c77	[#10052 ][feat] AutoDeploy enable cudagraphs for flashinfer BatchDecode (#10193 ) Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com> Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com> Co-authored-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>	2025-12-24 05:55:09 -08:00
shuyixiong	f4f0fe85e9	[TRTLLM-9737][chore] Add rl perf reproduce script and enhance the robustness of Ray tests (#9939 ) Signed-off-by: Shuyi Xiong <219646547+shuyixiong@users.noreply.github.com>	2025-12-24 15:27:01 +08:00
Yukun He	595daa5089	[TRTLLM-9615][feat] Support synchronization through PP ranks in the distributed tuning system (#10011 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-12-24 15:03:10 +08:00
Fanrong Li	156f6453dc	[TRTLLM-9798][feat] Change to use new DeepGEMM MQA sm100 kernel for MTP-3 (#10226 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-12-24 14:39:12 +08:00
Balaram Buddharaju	8c1cfc872b	[TRTLLM-9493][feat] Custom AllToAll for helix parallelism (#9986 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-12-23 18:14:30 -08:00
Grzegorz Kwasniewski	06900a7f19	[TRTLLM-9565][fix] Fix deepseek sharding (#9984 ) Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>	2025-12-23 10:28:14 -05:00
Xianjie Qiao	871c6b435c	[None] [feat] skip batch_tokenize_prompts in CustomDataset (#10214 ) Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>	2025-12-23 17:40:57 +08:00
Yiqing Yan	59b05dc0a8	[None][chore] Bump version to 1.2.0rc7 (#10216 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-12-23 15:07:47 +08:00
Harshini Komali	d691371eaf	[TRTLLM-9091] [feat] Replace GenAI-Perf with AIPerf (#9310 ) Signed-off-by: lkomali <lkomali@nvidia.com> Signed-off-by: Harshini Komali <157742537+lkomali@users.noreply.github.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-12-23 13:25:55 +08:00
Li Min	1e82ff7a0c	[TRTLLM-9989][fix] Fix tvm_ffi aaarch64 issue. (#10199 ) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>	2025-12-23 10:20:40 +08:00
Yuxian Qiu	696f754ef4	[None][fix] avoid implicit cudaStreamSynchronize in sample_async. (#10120 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-12-23 10:15:40 +08:00
Tailing Yuan	648196f8ae	[TRTLLM-9432][feat] Reduce synchronization and recompilation for qwen3-next (#9691 ) Signed-off-by: Tailing Yuan <yuantailing@gmail.com>	2025-12-23 10:14:29 +08:00
Faraz	f05af48bca	[https://nvbugs/5747674 ][fix] Add contiguous() before view() in load_expert_w3_w1_weight and load (#10136 ) Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>	2025-12-22 21:03:34 -05:00
Fanrong Li	0d2500c631	[TRTLLM-9677][feat] Support DeepSeek-V3.2 tool parser (#10126 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-12-23 08:46:47 +08:00
Grzegorz Kwasniewski	ccc64da287	[TRTLLM-9847][fix] WAR fix hanging fused allreduce. (#10087 ) Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>	2025-12-23 00:03:32 +01:00
tcherckez-nvidia	12e1cb8d7e	[#9717 ][chore] Refactor MoE code to use enums (#9910 ) Signed-off-by: Tal Cherckez <127761168+tcherckez-nvidia@users.noreply.github.com>	2025-12-22 15:14:56 -05:00
JunyiXu-nv	aaa87abf41	[TRTLLM-7906][feat] Support multiple post process for Responses API (#9908 ) Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>	2025-12-22 11:33:34 -05:00
William Zhang	a6a88985cf	[TRTLLM-9409][feat] Pass MRoPE tensors for EPD disagg (#9758 ) * Why? Certain VLMs like the Qwen family need more than just the multimodal embeddings in the language model, and need MRoPE position IDs and deltas. Prior to this commit, only the embeddings could be communicated from the encoder worker to the prefill worker. * What? This commit extends the `DisaggregatedParams` to include the MRoPE information. It also adjusts several pieces of code required to communicate that between E, P and D workers. Closes TRTLLM-9409. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>	2025-12-22 06:32:49 -05:00
Yan Chunwei	ea6cd76c55	[None][refactor] simplify get_stats and get_kvcache_events with rpc (#9980 ) Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-12-22 18:23:43 +08:00
JadoTu	7421224d69	[None][fix] NVFP4 linear method's weight and weight_scale padding (#10148 ) Signed-off-by: jiant <107457950+JadoTu@users.noreply.github.com>	2025-12-22 15:00:31 +08:00
Fanrong Li	f0bd60a395	[https://nvbugs/5684820 ][fix] fix the detokenizer issue for DeepSeek-v3.2 (#10106 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-12-22 10:56:33 +08:00
Balaram Buddharaju	5266475014	[None][feat] Cudagraph updates for helix parallelism (#10141 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-12-21 15:21:52 -05:00
shuyixiong	4fc6036276	[https://nvbugs/5702793 ][fix] Fix view operation on uncontiguous tensor (#10147 ) Signed-off-by: Shuyi Xiong <219646547+shuyixiong@users.noreply.github.com>	2025-12-21 11:47:20 -05:00
bhsueh_NV	cd4b4f43fa	[None][feat] Support Eagle3 on Mistral Large3 (#9971 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-12-21 10:25:45 -05:00
xxi	5ae154022a	[TRTLLM-9872][fix] clear the failed test at CI when enalbe_configurab… (#10067 ) Signed-off-by: xxi <xxi@nvidia.com>	2025-12-21 08:14:50 -05:00
Bo Li	a66eeab537	[TRTLLM-9805][feat] Skip Softmax Attention. (#9821 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com> Co-authored-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>	2025-12-21 02:52:42 -05:00
Enwei Zhu	21a93fbf9d	[TRTLLM-9992][perf] Enable PDL for CuteDSL kernels and overlap MoeOutputMemset (#10043 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-12-20 03:12:41 -05:00
Yuxian Qiu	e75331480f	[None][fix] fix draft_lengths for CUDA graph capture. (#10004 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-12-20 09:04:48 +08:00
Pengyun Lin	ac03915dc3	[TRTLLM-9604][feat] DS R1 & V3.1 tool parser (#10010 ) Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>	2025-12-19 17:20:03 +08:00

1 2 3 4 5 ...

1909 Commits