TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-02-04 18:21:52 +08:00

Author	SHA1	Message	Date
Frida Hou	bf16fbd86c	[#9283 ][feat] AutoDeploy: separate rms pattern detection from fusion (#9969 ) Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>	2026-01-13 14:57:27 -05:00
benzh-2025	6df2c8a074	[None][feat] add fp4 gemm + allreduce (#9729 ) Signed-off-by: benzh Signed-off-by: benzh-2025	2026-01-13 21:11:13 +08:00
mpikulski	bf7998f1b8	[TRTLLM-9522][test] cover LLM API `multi_modal_embeddings` (#9963 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2026-01-12 11:38:22 +01:00
Yechan Kim	8e0d20d901	[TRTLLM-10195][feat] K-EXAONE support (#10355 ) Signed-off-by: Jaedeok Kim <jaedeokk@nvidia.com> Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> Co-authored-by: Jaedeok Kim <jaedeokk@nvidia.com>	2026-01-12 00:29:51 +09:00
Chenghao Zhang	38f249b479	[https://nvbugs/5548861 ][fix] AutoDeploy: Fix the test (#10521 ) Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>	2026-01-09 13:30:24 -08:00
Yechan Kim	7295af68ba	[None][fix] Enable AttentionDP on Qwen3-VL and fix test (#10435 ) Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>	2026-01-10 00:13:26 +09:00
William Zhang	c0ae6bbdbe	[None][feat] EPD for Qwen3 VL (#10470 ) * Why? We would like to support EPD disaggregated serving for Qwen3 VL. * What? This commit adds such support, and extends existing unit tests for correctness checks. Some minor (protected) interface changes had to be made to the weight mapper as a side-effect. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>	2026-01-08 06:45:54 -05:00
Lucas Liebenwein	30f8455d29	[https://nvbugs/5747878 ][fix] unwaive llama4 scout tests (#10468 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2026-01-07 23:33:45 -05:00
Yuxian Qiu	b85c447ceb	[https://nvbugs/5784543 ][fix] Setup dist before using autotuner. (#10491 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2026-01-08 10:32:50 +08:00
Lucas Liebenwein	d736c7f290	[https://nvbugs/5761665 ][fix] AutoDeploy: handle bugs for 25.12 dlfw upgrade (#10511 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2026-01-07 20:16:53 -05:00
Lucas Liebenwein	6095c80e56	[https://nvbugs/5721907 ][fix] AutoDeploy: improve numerical stability of flashinfer attention test (#10467 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2026-01-06 21:11:06 -05:00
Lucas Liebenwein	bb6a3973aa	[https://nvbugs/5732942 ][fix] AutoDeploy: handle transformers 4.57.1 upgrade fixes (#10466 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2026-01-06 19:55:49 -05:00
Mike Iovine	77be1b7572	[https://nvbugs/5749988 ][fix] Remove redundant qwen3 spec dec test (#10387 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2026-01-06 11:46:34 -05:00
alel	6b8ae6fa81	[None][feat] CuteDSL MOE FC1 Enhancement (#10088 ) Signed-off-by: Yuhan Li <51736452+liyuhannnnn@users.noreply.github.com>	2026-01-06 09:30:43 +08:00
Anthony Chang	225d3a9001	[None][perf] TRTLLM MoE maps to lower tuning buckets when ep>1 (#9998 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2026-01-05 17:16:12 +01:00
Yukun He	d272f1a9bc	[TRTLLM-8821][feat] Apply AutoTuner to AllReduce Op for strategy tuning. (#8531 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2026-01-05 15:44:37 +08:00
Yukun He	0937df2c68	[TRTLLM-10185][feat] AutoTuner Cache: Support cache file lock and merge all ranks into one (#10336 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2026-01-05 13:44:09 +08:00
dongfengy	afc533193d	[None][feat] Support nvfp4 for gptoss (#8956 ) Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>	2026-01-04 08:57:44 -05:00
Jaedeok Kim	a4dcc6a711	[TRTLLM-10171][fix] Correct attention handling in ModelConfig and KVCacheManager (#10330 ) Signed-off-by: Jaedeok Kim <jaedeokk@nvidia.com>	2026-01-04 06:07:30 -05:00
Izzy Putterman	bdf6953ddc	[None][feat] Eagle: MLA Based Eagle (#9677 ) Signed-off-by: Izzy Putterman <iputterman@nvidia.com>	2026-01-02 13:45:07 -05:00
Balaram Buddharaju	4a1b742aa0	[TRTLLM-9467][fix] Fix PP+CP combination with helix parallelism (#10312 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2026-01-01 13:42:53 -05:00
Lucas Liebenwein	1bbe71b3ed	[#10244 ][feat] AutoDeploy: separate prefill/decode in flashinfer (#10252 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2025-12-31 17:01:24 -05:00
Simeng Liu	84d107b2f0	[https://nvbugs/5717993 ][fix] Add execution_stream across PyExecutor, KVCacheManager, PeftCacheManager to ensure proper CUDA stream synchronization between KV cache transfer operations and model forward kernels. (#10060 ) Signed-off-by: SimengLiu-nv <simengl@nvidia.com>	2025-12-31 09:22:54 -08:00
tcherckez-nvidia	464847c6be	[#9717 ][chore] Standardize MoE weights interface (#10295 ) Signed-off-by: Tal Cherckez <127761168+tcherckez-nvidia@users.noreply.github.com>	2025-12-31 07:37:18 -05:00
Necofish	73870ae4ad	[None][feat] support Qwen3-VL dense model in pytorch backend (#9060 ) Signed-off-by: Nekofish-L <liuxiangyang@mail.ustc.edu.cn>	2025-12-31 17:54:26 +09:00
Neta Zmora	966231d29c	[#9626 ][feat] Add an auto-deploy transform for using cutlass FP4 MoE kernels (#10304 ) Add a transform to relace torch.ops.auto_deploy.torch_quant_nvfp4_moe with the optimized torch.ops.auto_deploy.trtllm_quant_nvfp4_moe_fused. Currently generates the wrong results when the number of rows in MoE FC1 weights is not divisible by 128, so torch.ops.auto_deploy.trtllm_quant_nvfp4_moe_fused is not set as the default FP4 MoE implementation (i.e. the transform is disabled). Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>	2025-12-29 23:18:15 +02:00
Guoming Zhang	93ac0bc1dc	[TRTLLM-10126][feat] Increase topk upper limit to 22 for NVLinkOneSid… (#10229 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-12-27 22:48:10 +08:00
Neta Zmora	f3f02315df	[None][chore]: small refactoring to auto-deploy MoE operator (#10300 ) Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>	2025-12-25 12:27:11 -05:00
Ziyi Xiong	d8b5aeb061	[https://nvbugs/5652062 ][fix] Rewind kv_cache and reset draft tokens (#10160 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-12-25 09:13:51 -05:00
ZhichenJiang	46e4af5688	[TRTLLM-9831][perf] Enable 2CTA with autotune for CuteDSL MoE and Grouped GEMM optimizations (#10201 ) Signed-off-by: zhichen jiang <zhichenj@NVIDIA.com> Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-12-25 09:04:20 -05:00
Gabriel Wu	1d01214ff0	[None][feat] Drop non-deepgemm fp8 block scale gemm (#10256 ) Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>	2025-12-25 14:52:52 +08:00
Neta Zmora	c4b36d31ff	[#10137 ][feat] AutoDeploy FP8 MoE refactor (#10138 ) The trtllm (cutlass) fp8 moe operator performs W3+W1 fusion (concat) during inference and we want to move this fusion to the model optimization time. The Cutlass MoE kernel is used thru a trtllm torch operator. Its implementation uses two FC operations (fc1 and fc2) while the canonical MoE API defines three GEMM operations and their associated weights (W1, W2, W3) so when we switch from the torch.moe op to the trtllm.moe op we also change terminology from w1, w2, w3 to fc1, fc2. Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>	2025-12-24 18:58:10 +02:00
shuyixiong	f4f0fe85e9	[TRTLLM-9737][chore] Add rl perf reproduce script and enhance the robustness of Ray tests (#9939 ) Signed-off-by: Shuyi Xiong <219646547+shuyixiong@users.noreply.github.com>	2025-12-24 15:27:01 +08:00
Fanrong Li	156f6453dc	[TRTLLM-9798][feat] Change to use new DeepGEMM MQA sm100 kernel for MTP-3 (#10226 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-12-24 14:39:12 +08:00
Balaram Buddharaju	8c1cfc872b	[TRTLLM-9493][feat] Custom AllToAll for helix parallelism (#9986 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-12-23 18:14:30 -08:00
Shiyu Li	3ddc9d2b48	[https://nvbugs/5729697 ][fix] MNNVL Allreduce: use CUDA runtime instead of Macro to get SM version. (#10062 ) Signed-off-by: Shiyu Li <shili@nvidia.com>	2025-12-23 16:07:07 +08:00
Bo Li	cc1323be24	[None][fix] Fix the bug for top_k=10 in NVLinkOneSided AlltoAll. (#10197 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-12-23 02:13:37 -05:00
Yuxian Qiu	696f754ef4	[None][fix] avoid implicit cudaStreamSynchronize in sample_async. (#10120 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-12-23 10:15:40 +08:00
tcherckez-nvidia	12e1cb8d7e	[#9717 ][chore] Refactor MoE code to use enums (#9910 ) Signed-off-by: Tal Cherckez <127761168+tcherckez-nvidia@users.noreply.github.com>	2025-12-22 15:14:56 -05:00
William Zhang	a6a88985cf	[TRTLLM-9409][feat] Pass MRoPE tensors for EPD disagg (#9758 ) * Why? Certain VLMs like the Qwen family need more than just the multimodal embeddings in the language model, and need MRoPE position IDs and deltas. Prior to this commit, only the embeddings could be communicated from the encoder worker to the prefill worker. * What? This commit extends the `DisaggregatedParams` to include the MRoPE information. It also adjusts several pieces of code required to communicate that between E, P and D workers. Closes TRTLLM-9409. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>	2025-12-22 06:32:49 -05:00
Bo Li	472fe497dc	[None][chore] NVLinkOneSided AlltoAll Support zero local_num_tokens. (#9822 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-12-22 05:57:12 -05:00
Balaram Buddharaju	5266475014	[None][feat] Cudagraph updates for helix parallelism (#10141 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-12-21 15:21:52 -05:00
xxi	5ae154022a	[TRTLLM-9872][fix] clear the failed test at CI when enalbe_configurab… (#10067 ) Signed-off-by: xxi <xxi@nvidia.com>	2025-12-21 08:14:50 -05:00
longcheng-nv	b882393d69	[https://nvbugs/5720357 ][fix] Fix indice offset overflow in custom Top-K kernel and corresponding UT case (#10027 ) Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com> Co-authored-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>	2025-12-19 14:58:01 -05:00
William Zhang	478b6b20a1	[#9230 ][refactor] Replace nemotron patches with custom model implementation (#9751 ) [#9230][refactor] Replace nemotron patches with custom model implementation * Why? Patching for nemotron H models was growing out of hand, and made certain optimizations more complex than they needed to be. * What? This commit finally gets rid of them, and replaces them with the custom model implementation in `modeling_nemotron_h.py`. Closes #9230 Closes NvBug 5747867 Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>	2025-12-18 19:36:27 -08:00
CarstyYou	0b279f4ad4	[https://nvbugs/5456493 ][feat] Add fp8 bmm on sm120 (#9687 ) Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>	2025-12-18 22:57:20 +08:00
ZhichenJiang	4e55b83101	[None][perf] Add more optimization options for MOE CuteDSL finalized kernel (#10042 ) Signed-off-by: zhichen jiang <zhichenj@NVIDIA.com>	2025-12-18 22:49:28 +08:00
Yuxian Qiu	bec864a78c	[None][fix] avoid ID conversion for non enable_configurable_moe cases. (#10003 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-12-18 13:29:52 +08:00
Wanli Jiang	601c29ca73	[https://nvbugs/5721644 ][fix] Update tests for nemotron_h (#9993 ) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2025-12-18 12:38:02 +08:00
Lucas Liebenwein	76ec820465	[#7532 ][feat] AutoDeploy: gather logits before lm head (#9962 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> Co-authored-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>	2025-12-17 19:50:13 -08:00

1 2 3 4 5 ...

720 Commits