TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-02-05 18:51:38 +08:00

Author	SHA1	Message	Date
Pengyun Lin	c04cf4334e	[TRTLLM-8242][feat] Add stability tags for serve subcommand (#10012 ) Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>	2026-01-05 14:16:15 +08:00
Yukun He	0937df2c68	[TRTLLM-10185][feat] AutoTuner Cache: Support cache file lock and merge all ranks into one (#10336 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2026-01-05 13:44:09 +08:00
Tailing Yuan	a7fe043b13	[None][feat] Layer-wise benchmarks: support TEP balance, polish slurm scripts (#10237 ) Signed-off-by: Tailing Yuan <yuantailing@gmail.com>	2026-01-05 11:23:04 +08:00
Cheng Hang	656c705ff1	[None][feat] sm100 weight-only kernel (#10190 ) Signed-off-by: Cheng Hang <chang@nvidia.com>	2026-01-05 09:44:36 +08:00
Fanrong Li	b5a1e10bc0	[https://nvbugs/5779534 ][fix] fix buffer reuse for CUDA graph attention metadata (#10393 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2026-01-05 09:43:44 +08:00
Wanli Jiang	da0830670a	[TRTLLM-10065][feat] Add accuracy tests for super-v3 with multiple-gpus (#10234 ) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2026-01-05 09:41:49 +08:00
Lizhi Zhou	82c1ba84a7	[https://nvbugs/5649010 ][fix] use 0 port as arbitrary port when disagg service discovery is enabled (#10383 ) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>	2026-01-05 09:40:40 +08:00
bhsueh_NV	0517b62789	[https://nvbugs/5772363 ][fix] fix bug of Mistral-Small-3.1-24B-Instruct-2503 (#10394 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2026-01-05 09:04:13 +08:00
Faraz	8e2065b4d9	[https://nvbugs/5670469 ][fix] Filter 0s and choose min of kv_head for Nemotron model (#10206 ) Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>	2026-01-05 08:42:53 +08:00
dongfengy	afc533193d	[None][feat] Support nvfp4 for gptoss (#8956 ) Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>	2026-01-04 08:57:44 -05:00
Jaedeok Kim	a4dcc6a711	[TRTLLM-10171][fix] Correct attention handling in ModelConfig and KVCacheManager (#10330 ) Signed-off-by: Jaedeok Kim <jaedeokk@nvidia.com>	2026-01-04 06:07:30 -05:00
Grzegorz Kwasniewski	0d1f5ad7a2	[TRTLLM-10358][feat] Added proper rescaling of FP4 weights (#10378 ) Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>	2026-01-03 16:26:16 -05:00
Izzy Putterman	bdf6953ddc	[None][feat] Eagle: MLA Based Eagle (#9677 ) Signed-off-by: Izzy Putterman <iputterman@nvidia.com>	2026-01-02 13:45:07 -05:00
Gal Hubara-Agam	f3dd6da080	[#10056 ][chore] AutoDeploy: Enable Nemo SuperV3 accuracy test (#10308 ) Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>	2026-01-02 11:20:19 +02:00
Balaram Buddharaju	4a1b742aa0	[TRTLLM-9467][fix] Fix PP+CP combination with helix parallelism (#10312 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2026-01-01 13:42:53 -05:00
Gal Hubara-Agam	5845951538	[#10056 ][fix] AutoDeploy: Handle deletion of nested params in sharding (#10376 ) Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>	2026-01-01 08:11:11 -05:00
tcherckez-nvidia	4868772ad7	[None][feat] Add export data to build and run script for AD (#10299 ) Signed-off-by: Tal Cherckez <127761168+tcherckez-nvidia@users.noreply.github.com>	2026-01-01 04:54:47 -05:00
Lucas Liebenwein	1bbe71b3ed	[#10244 ][feat] AutoDeploy: separate prefill/decode in flashinfer (#10252 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2025-12-31 17:01:24 -05:00
Mike Iovine	9085021aa4	[None][feat] Implement sampling for MTP 1-model (#10019 ) Signed-off-by: Mike Iovine <miovine@nvidia.com>	2025-12-31 13:48:34 -05:00
Simeng Liu	84d107b2f0	[https://nvbugs/5717993 ][fix] Add execution_stream across PyExecutor, KVCacheManager, PeftCacheManager to ensure proper CUDA stream synchronization between KV cache transfer operations and model forward kernels. (#10060 ) Signed-off-by: SimengLiu-nv <simengl@nvidia.com>	2025-12-31 09:22:54 -08:00
tcherckez-nvidia	464847c6be	[#9717 ][chore] Standardize MoE weights interface (#10295 ) Signed-off-by: Tal Cherckez <127761168+tcherckez-nvidia@users.noreply.github.com>	2025-12-31 07:37:18 -05:00
Jin Li	ef1d4a40b5	[https://nvbugs/5727475 ][fix] Avoid use property with setter in nn.Mo… (#10212 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-12-31 06:21:36 -05:00
Necofish	73870ae4ad	[None][feat] support Qwen3-VL dense model in pytorch backend (#9060 ) Signed-off-by: Nekofish-L <liuxiangyang@mail.ustc.edu.cn>	2025-12-31 17:54:26 +09:00
Pengyun Lin	fad000589d	[None][chore] Unify DS tool parser names (#10239 ) Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>	2025-12-31 14:40:07 +08:00
Jin Li	34c2fd50a9	[https://nvbugs/5707359 ][fix] Unwaive OOM case that should be fixed by #9446 (#10334 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-12-31 10:41:39 +08:00
Yuxian Qiu	1f3afb8e6f	[None][feat] Implement send_object for TorchDist. (#10213 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-12-31 10:40:52 +08:00
Eran Geva	74832a1895	[https://nvbugs/5766986 ][fix] fixed the shard_all_unprocessed default value to align with the default.yml (#10271 ) Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>	2025-12-30 08:54:13 -05:00
Bo Li	1f0365da36	[None][infra] Add LongBenchV1 to trtllm-eval. (#10265 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-12-30 21:39:34 +08:00
binghanc	692d8f2023	[TRTLLM-9455][feat] support for new checkpoint (#10082 ) Signed-off-by: binghanc <176802681+binghanc@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>	2025-12-30 14:46:39 +08:00
Neta Zmora	966231d29c	[#9626 ][feat] Add an auto-deploy transform for using cutlass FP4 MoE kernels (#10304 ) Add a transform to relace torch.ops.auto_deploy.torch_quant_nvfp4_moe with the optimized torch.ops.auto_deploy.trtllm_quant_nvfp4_moe_fused. Currently generates the wrong results when the number of rows in MoE FC1 weights is not divisible by 128, so torch.ops.auto_deploy.trtllm_quant_nvfp4_moe_fused is not set as the default FP4 MoE implementation (i.e. the transform is disabled). Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>	2025-12-29 23:18:15 +02:00
Ziyi Xiong	c59aa8bec5	[TRTLLM-9962][feat] Some optimizations for two-model spec dec (#10208 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-12-28 12:52:04 +08:00
JunyiXu-nv	55bc6a5ff8	[https://nvbugs/5753250 ][fix] Fix undefined local variable in responses utils (#10154 ) Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com> Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>	2025-12-28 06:59:32 +08:00
shivghai	ee07a7c55e	[None][fix] [Gemma3] Fix RoPE for local attention for Gemma3 (#9961 ) Signed-off-by: Shiv Ghai <8965168+shivghai@users.noreply.github.com>	2025-12-27 11:50:59 -08:00
Guoming Zhang	1865020b6f	[TRTLLM-8577][feat] Clean the Qwen3-next code by removing Qwen3NextCo… (#10228 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-12-27 22:49:55 +08:00
Olya Kozlova	55f3cda66d	[None][fix] Fix request_id for best_of/n case (#8368 ) Signed-off-by: Olya Kozlova <okozlova@nvidia.com>	2025-12-26 22:20:24 +01:00
Jin Li	c04563657e	[TRTLLM-7735][feat] Attention NVFP4 out support for torch compile (#9740 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-12-27 00:07:20 +08:00
Pengyun Lin	c5b0f9e436	[https://nvbugs/5633700 ][fix] Cache tiktoken vocab for gpt-oss (#10219 ) Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>	2025-12-26 18:39:03 +08:00
Wanli Jiang	14554ab3f3	[None][feat] Support multi-gpu running for nemotron-v3-nano and super (#10118 ) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2025-12-26 11:23:14 +08:00
Enwei Zhu	13ffe52ad0	[None][fix] Allow YAML config overwriting CLI args for trtllm-eval (#10296 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-12-25 15:08:15 -05:00
Neta Zmora	f3f02315df	[None][chore]: small refactoring to auto-deploy MoE operator (#10300 ) Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>	2025-12-25 12:27:11 -05:00
bhsueh_NV	db3430f589	[None][feat] Support VLM part for Mistral Large 3 (#10188 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-12-25 11:20:58 -05:00
Jin Li	7e4cef9def	[None][fix] Cherry-pick conflict changes for PR 7999 PR 8515 (#9446 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-12-25 10:23:04 -05:00
Ziyi Xiong	d8b5aeb061	[https://nvbugs/5652062 ][fix] Rewind kv_cache and reset draft tokens (#10160 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-12-25 09:13:51 -05:00
ZhichenJiang	46e4af5688	[TRTLLM-9831][perf] Enable 2CTA with autotune for CuteDSL MoE and Grouped GEMM optimizations (#10201 ) Signed-off-by: zhichen jiang <zhichenj@NVIDIA.com> Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-12-25 09:04:20 -05:00
Zhenhuan Chen	8462cf6c96	[TRTLLM-9578][feat] make PDL enabled by default (#9695 ) Signed-off-by: Zhenhuan Chen <zhenhuanc@nvidia.com>	2025-12-25 07:15:24 -05:00
Xianjie Qiao	53b81783b1	[None][fix] Fix pageable H2D memcopy issue on GB200 (#10289 ) Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>	2025-12-25 18:15:57 +08:00
gramnarayan	a9eb5afc9f	[#9241 ][feat] AutoDeploy: Support Eagle3 Speculative Decoding (#9869 ) Support two model flow with no overlap scheduler or chain drafter. Drafting model is in PyTorch backend. Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>	2025-12-24 23:30:42 -05:00
Ziyi Xiong	43178590d1	[TRTLLM-10143][feat] Reuse previous draft requests if possible (#10263 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-12-24 17:48:38 -08:00
Neta Zmora	c4b36d31ff	[#10137 ][feat] AutoDeploy FP8 MoE refactor (#10138 ) The trtllm (cutlass) fp8 moe operator performs W3+W1 fusion (concat) during inference and we want to move this fusion to the model optimization time. The Cutlass MoE kernel is used thru a trtllm torch operator. Its implementation uses two FC operations (fc1 and fc2) while the canonical MoE API defines three GEMM operations and their associated weights (W1, W2, W3) so when we switch from the torch.moe op to the trtllm.moe op we also change terminology from w1, w2, w3 to fc1, fc2. Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>	2025-12-24 18:58:10 +02:00
Necofish	8614cd3439	[None][fix] fix: resolve GPU memory imbalance in concurrent weight loading (#6472 ) Signed-off-by: Necofish <liuxiangyang@mail.ustc.edu.cn> Signed-off-by: Nekofish-L <liuxiangyang@mail.ustc.edu.cn> Signed-off-by: Jie Li <lijie@nvidia.com> Co-authored-by: Jie Li <lijie@nvidia.com>	2025-12-24 09:43:09 -05:00
Suyog Gupta	e2891a6c77	[#10052 ][feat] AutoDeploy enable cudagraphs for flashinfer BatchDecode (#10193 ) Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com> Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com> Co-authored-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>	2025-12-24 05:55:09 -08:00
shuyixiong	f4f0fe85e9	[TRTLLM-9737][chore] Add rl perf reproduce script and enhance the robustness of Ray tests (#9939 ) Signed-off-by: Shuyi Xiong <219646547+shuyixiong@users.noreply.github.com>	2025-12-24 15:27:01 +08:00
Yukun He	595daa5089	[TRTLLM-9615][feat] Support synchronization through PP ranks in the distributed tuning system (#10011 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-12-24 15:03:10 +08:00
Fanrong Li	156f6453dc	[TRTLLM-9798][feat] Change to use new DeepGEMM MQA sm100 kernel for MTP-3 (#10226 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-12-24 14:39:12 +08:00
Balaram Buddharaju	8c1cfc872b	[TRTLLM-9493][feat] Custom AllToAll for helix parallelism (#9986 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-12-23 18:14:30 -08:00
Grzegorz Kwasniewski	06900a7f19	[TRTLLM-9565][fix] Fix deepseek sharding (#9984 ) Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>	2025-12-23 10:28:14 -05:00
Xianjie Qiao	871c6b435c	[None] [feat] skip batch_tokenize_prompts in CustomDataset (#10214 ) Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>	2025-12-23 17:40:57 +08:00
Yiqing Yan	59b05dc0a8	[None][chore] Bump version to 1.2.0rc7 (#10216 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-12-23 15:07:47 +08:00
Harshini Komali	d691371eaf	[TRTLLM-9091] [feat] Replace GenAI-Perf with AIPerf (#9310 ) Signed-off-by: lkomali <lkomali@nvidia.com> Signed-off-by: Harshini Komali <157742537+lkomali@users.noreply.github.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-12-23 13:25:55 +08:00
Li Min	1e82ff7a0c	[TRTLLM-9989][fix] Fix tvm_ffi aaarch64 issue. (#10199 ) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>	2025-12-23 10:20:40 +08:00
Yuxian Qiu	696f754ef4	[None][fix] avoid implicit cudaStreamSynchronize in sample_async. (#10120 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-12-23 10:15:40 +08:00
Tailing Yuan	648196f8ae	[TRTLLM-9432][feat] Reduce synchronization and recompilation for qwen3-next (#9691 ) Signed-off-by: Tailing Yuan <yuantailing@gmail.com>	2025-12-23 10:14:29 +08:00
Faraz	f05af48bca	[https://nvbugs/5747674 ][fix] Add contiguous() before view() in load_expert_w3_w1_weight and load (#10136 ) Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>	2025-12-22 21:03:34 -05:00
Fanrong Li	0d2500c631	[TRTLLM-9677][feat] Support DeepSeek-V3.2 tool parser (#10126 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-12-23 08:46:47 +08:00
Grzegorz Kwasniewski	ccc64da287	[TRTLLM-9847][fix] WAR fix hanging fused allreduce. (#10087 ) Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>	2025-12-23 00:03:32 +01:00
tcherckez-nvidia	12e1cb8d7e	[#9717 ][chore] Refactor MoE code to use enums (#9910 ) Signed-off-by: Tal Cherckez <127761168+tcherckez-nvidia@users.noreply.github.com>	2025-12-22 15:14:56 -05:00
JunyiXu-nv	aaa87abf41	[TRTLLM-7906][feat] Support multiple post process for Responses API (#9908 ) Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>	2025-12-22 11:33:34 -05:00
William Zhang	a6a88985cf	[TRTLLM-9409][feat] Pass MRoPE tensors for EPD disagg (#9758 ) * Why? Certain VLMs like the Qwen family need more than just the multimodal embeddings in the language model, and need MRoPE position IDs and deltas. Prior to this commit, only the embeddings could be communicated from the encoder worker to the prefill worker. * What? This commit extends the `DisaggregatedParams` to include the MRoPE information. It also adjusts several pieces of code required to communicate that between E, P and D workers. Closes TRTLLM-9409. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>	2025-12-22 06:32:49 -05:00
Yan Chunwei	ea6cd76c55	[None][refactor] simplify get_stats and get_kvcache_events with rpc (#9980 ) Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-12-22 18:23:43 +08:00
JadoTu	7421224d69	[None][fix] NVFP4 linear method's weight and weight_scale padding (#10148 ) Signed-off-by: jiant <107457950+JadoTu@users.noreply.github.com>	2025-12-22 15:00:31 +08:00
Fanrong Li	f0bd60a395	[https://nvbugs/5684820 ][fix] fix the detokenizer issue for DeepSeek-v3.2 (#10106 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-12-22 10:56:33 +08:00
Balaram Buddharaju	5266475014	[None][feat] Cudagraph updates for helix parallelism (#10141 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-12-21 15:21:52 -05:00
shuyixiong	4fc6036276	[https://nvbugs/5702793 ][fix] Fix view operation on uncontiguous tensor (#10147 ) Signed-off-by: Shuyi Xiong <219646547+shuyixiong@users.noreply.github.com>	2025-12-21 11:47:20 -05:00
bhsueh_NV	cd4b4f43fa	[None][feat] Support Eagle3 on Mistral Large3 (#9971 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-12-21 10:25:45 -05:00
xxi	5ae154022a	[TRTLLM-9872][fix] clear the failed test at CI when enalbe_configurab… (#10067 ) Signed-off-by: xxi <xxi@nvidia.com>	2025-12-21 08:14:50 -05:00
Bo Li	a66eeab537	[TRTLLM-9805][feat] Skip Softmax Attention. (#9821 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com> Co-authored-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>	2025-12-21 02:52:42 -05:00
Enwei Zhu	21a93fbf9d	[TRTLLM-9992][perf] Enable PDL for CuteDSL kernels and overlap MoeOutputMemset (#10043 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-12-20 03:12:41 -05:00
Yuxian Qiu	e75331480f	[None][fix] fix draft_lengths for CUDA graph capture. (#10004 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-12-20 09:04:48 +08:00
Pengyun Lin	ac03915dc3	[TRTLLM-9604][feat] DS R1 & V3.1 tool parser (#10010 ) Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>	2025-12-19 17:20:03 +08:00
Chang Liu	31bc14b350	[TRTLLM-9654][feat] Support DeepSeek-V32 chat template (#9814 ) Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>	2025-12-19 17:05:38 +08:00
Ziyi Xiong	70b4d282c6	[TRTLLM-7736][feat] Incrementally update the inputs of target and draft models (#9708 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-12-19 15:11:25 +08:00
William Zhang	478b6b20a1	[#9230 ][refactor] Replace nemotron patches with custom model implementation (#9751 ) [#9230][refactor] Replace nemotron patches with custom model implementation * Why? Patching for nemotron H models was growing out of hand, and made certain optimizations more complex than they needed to be. * What? This commit finally gets rid of them, and replaces them with the custom model implementation in `modeling_nemotron_h.py`. Closes #9230 Closes NvBug 5747867 Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>	2025-12-18 19:36:27 -08:00
Wangjue Yao	9f283f330b	[None][feat] Support Mooncake transfer engine as a cache transceiver backend (#8309 ) Signed-off-by: wjueyao <wyao123@terpmail.umd.edu> Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>	2025-12-19 10:09:51 +08:00
Lizhi Zhou	f02782a6f2	[https://nvbugs/5726066 ][fix] fix auto-scaling related failures (#9845 ) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com> Co-authored-by: Emma Qiao <qqiao@nvidia.com>	2025-12-18 16:37:48 -05:00
Enwei Zhu	6fe89ea00f	[TRTLLM-9819][perf] Reuse alltoall workspace for CuteDSL MoE output (#9840 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-12-18 10:36:38 -08:00
CarstyYou	0b279f4ad4	[https://nvbugs/5456493 ][feat] Add fp8 bmm on sm120 (#9687 ) Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>	2025-12-18 22:57:20 +08:00
ZhichenJiang	4e55b83101	[None][perf] Add more optimization options for MOE CuteDSL finalized kernel (#10042 ) Signed-off-by: zhichen jiang <zhichenj@NVIDIA.com>	2025-12-18 22:49:28 +08:00
Lucas Liebenwein	76ec820465	[#7532 ][feat] AutoDeploy: gather logits before lm head (#9962 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> Co-authored-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>	2025-12-17 19:50:13 -08:00
Yuan Tong	f7e245668b	[TRTLLM-9680][perf] Optimize TRTLLMSampler log_probs performance (Core fix has been merged via #9353 ) (#9655 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-12-17 17:56:01 +08:00
Yukun He	00c0564334	[None][chore] Remove unnecessary warning log for tuning. (#10077 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-12-17 01:51:17 -08:00
Yukun He	18b335d584	[TRTLLM-9989][fix] Disable tvm_ffi for CuteDSL nvFP4 dense GEMM. (#10040 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-12-17 00:41:26 -08:00
Yukun He	2fd1a23e4c	[TRTLLM-9998][fix] Change trtllm-gen MoE distributed tuning strategy back to INDEPENDENT (#10036 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-12-17 00:35:22 -08:00
Void	47404196fa	[None][fix] Enabled simultaneous support for low-precision combine and MTP. (#9091 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>	2025-12-17 13:37:08 +08:00
Aurelien Chartier	7175d89b48	[None][fix] Fix iteration stats for spec-dec (#9855 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-12-16 14:11:38 -08:00
ruodil	07f307d131	[https://nvbugs/5652552 ][fix] cherry-pick add printing for llm args (#9206 ) Signed-off-by: Ruodi Lu <ruodil@users.noreply.github.com> Co-authored-by: Ruodi Lu <ruodil@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com>	2025-12-16 13:33:20 -05:00
Lizhi Zhou	bd13957e70	[TRTLLM-9181][feat] improve disagg-server prometheus metrics; synchronize workers' clocks when workers are dynamic (#9726 ) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>	2025-12-16 05:16:32 -08:00
Enwei Zhu	609d1d0383	[None][fix] Fix Illegal Memory Access for CuteDSL Grouped GEMM (#10008 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-12-16 04:06:49 -08:00
Wanli Jiang	8af51211c1	[FMDL-1222][feat] Support weight and weight_scale padding for NVFP4 MoE cutlass (#9358 ) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2025-12-16 12:41:17 +08:00
Yechan Kim	8ba8699f66	[TRTLLM-8310][feat] Add Qwen3-VL-MoE (#9689 ) Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>	2025-12-15 20:05:20 -08:00
ChristinaZ	dff77efa2a	[None][feat] Add routing support for the new model for both cutlass and trtllm moe backend (#9792 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-12-15 19:59:08 -08:00

1 2 3 4 5 ...

1988 Commits