TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-02-09 04:31:49 +08:00

Author	SHA1	Message	Date
Chang Liu	78bb245554	[https://nvbugs/5787453 ][fix] Better align MLA chunking with indexer chunking when chunked prefill enabled for DSV32 (#10552 )	2026-01-09 00:49:39 -08:00
JadoTu	4c498bfe58	[TRTLLM-9676][fix] Fix mamba_cache_manager when enabling cuda_graph_padding and let test cover this case (#9873 ) Signed-off-by: JadoTu <107457950+JadoTu@users.noreply.github.com>	2026-01-09 14:50:16 +08:00
Yuxian Qiu	afa55c12b6	[None][fix] revert https://github.com/NVIDIA/TensorRT-LLM/pull/10445 . (#10547 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2026-01-08 21:50:04 -05:00
Mike Iovine	4092a87b6f	[https://nvbugs/5740075 ][fix] Fix sm120 speculation (#10049 ) Signed-off-by: Mike Iovine <miovine@nvidia.com>	2026-01-08 19:55:43 -05:00
Eran Geva	489dd60312	[#10513 ][fix] AutoDeploy: removed self.mlp_type leftovers from last moe refactor (#10512 ) Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>	2026-01-08 14:49:40 -05:00
mpikulski	e0331297a6	[TRTLLM-9522][fix] broken cast (#9975 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2026-01-08 06:47:39 -05:00
William Zhang	c0ae6bbdbe	[None][feat] EPD for Qwen3 VL (#10470 ) * Why? We would like to support EPD disaggregated serving for Qwen3 VL. * What? This commit adds such support, and extends existing unit tests for correctness checks. Some minor (protected) interface changes had to be made to the weight mapper as a side-effect. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>	2026-01-08 06:45:54 -05:00
Eran Geva	6511dbaea0	[#10417 ][fix] AutoDepoloy - Reverted to direct computation of minusA (#10509 ) Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>	2026-01-08 13:43:41 +02:00
bhsueh_NV	bea61bb17d	[None][fix] Mistral large 3 few code refine (#10405 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2026-01-08 06:38:49 -05:00
Yiqing Yan	dc6b743fb6	[None][chore] Bump version to 1.2.0rc8 (#10542 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2026-01-08 04:51:44 -05:00
Yukun He	09d9878385	[TRTLLM-9661][chore] Further reduce tuning time for cuteDSL nvFP4 dense gemm. (#10339 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2026-01-08 10:21:02 +08:00
Ziyi Xiong	7187afe7b9	[https://nvbugs/5781589 ][fix] Skip spec dec for non-last rank (#10445 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2026-01-07 13:55:45 -05:00
tcherckez-nvidia	7e88212d24	[None][bug] fix export for microsoft/Phi-3-medium-128k-instruct (#10455 ) Signed-off-by: Tal Cherckez <127761168+tcherckez-nvidia@users.noreply.github.com>	2026-01-07 10:30:24 +02:00
Kanghwan	dc32bac9fc	[#4745 ][fix] Pass lora_params through Qwen2/3 model forward (#10174 ) Signed-off-by: Kanghwan Jang <861393+karljang@users.noreply.github.com>	2026-01-07 15:30:17 +08:00
Fanrong Li	a34aa63685	[https://nvbugs/5767223 ][feat] add pp support for DeepSeek-v3.2 (#10449 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2026-01-07 12:29:51 +08:00
Zongfei Jing	bb2f883296	[None] [feat] Add test script and raster M for gather fc1 kernel (#10429 ) Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2026-01-07 09:31:49 +08:00
Lucas Liebenwein	bb6a3973aa	[https://nvbugs/5732942 ][fix] AutoDeploy: handle transformers 4.57.1 upgrade fixes (#10466 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2026-01-06 19:55:49 -05:00
Lizhi Zhou	6a4bebcd01	[None][chore] remove redundant retries while binding to arbitrary port (#10452 ) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>	2026-01-06 10:39:15 -05:00
Kaiyu Xie	2eaabd7461	[None] [fix] Fix undefined tokens_per_block (#10438 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2026-01-06 02:42:37 -05:00
Karthik	617f728903	[#8460 ][feat] Revive and simplify Model Explorer visualization integration (#10150 ) Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>	2026-01-05 22:15:25 -05:00
Xiao Xuan	46f035befe	[#2511 ][fix] eagle: qwen2 capture hidden states (#10091 ) Signed-off-by: SpicyNoodle <522169030@qq.com>	2026-01-05 21:46:41 -05:00
alel	6b8ae6fa81	[None][feat] CuteDSL MOE FC1 Enhancement (#10088 ) Signed-off-by: Yuhan Li <51736452+liyuhannnnn@users.noreply.github.com>	2026-01-06 09:30:43 +08:00
JadoTu	82aaf98070	[None][feat] add the eos tokens in generation config to stop words in the sampler (#10389 ) Signed-off-by: jiant <107457950+JadoTu@users.noreply.github.com>	2026-01-06 09:24:03 +08:00
Karthik	4e50cb5708	[#10170 ][fix] Add export patch for GraniteMoe MoE models to enable torch.export compatibility (#10169 ) Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>	2026-01-05 16:13:45 -05:00
Grzegorz Kwasniewski	ea380ff45c	[TRTLLM-9767][feat] Fixed recursive node traversals (#10379 ) Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>	2026-01-05 18:42:06 +02:00
Mike Iovine	db2614ef10	[https://nvbugs/5772414 ][fix] Fix draft token tree depth=1 corner case (#10385 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2026-01-05 17:20:14 +01:00
Mike Iovine	bedfff4f00	[https://nvbugs/5772521 ][fix] Fix draft token tree chain crash (#10386 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2026-01-05 17:18:44 +01:00
Anthony Chang	225d3a9001	[None][perf] TRTLLM MoE maps to lower tuning buckets when ep>1 (#9998 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2026-01-05 17:16:12 +01:00
Balaram Buddharaju	a792c23dcf	[TRTLLM-9465][fix] Swap TP-CP grouping order (#10350 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2026-01-05 20:08:03 +08:00
Eran Geva	3749a2ce1c	[#10374 ][fix] fixed race condition in AutoDeploy's mp tests port acquisition (#10366 ) Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>	2026-01-05 13:33:01 +02:00
Fanrong Li	4931c5eb3a	[None][feat] update deepgemm to the DeepGEMM/nv_dev branch (#9898 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2026-01-05 16:43:42 +08:00
Yukun He	d272f1a9bc	[TRTLLM-8821][feat] Apply AutoTuner to AllReduce Op for strategy tuning. (#8531 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2026-01-05 15:44:37 +08:00
HuiGao-NV	2f768b76f8	[https://nvbugs/5715568 ][fix] Force release torch memory when LLM is destroyed (#10314 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2026-01-05 15:30:18 +08:00
Pengyun Lin	c04cf4334e	[TRTLLM-8242][feat] Add stability tags for serve subcommand (#10012 ) Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>	2026-01-05 14:16:15 +08:00
Yukun He	0937df2c68	[TRTLLM-10185][feat] AutoTuner Cache: Support cache file lock and merge all ranks into one (#10336 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2026-01-05 13:44:09 +08:00
Tailing Yuan	a7fe043b13	[None][feat] Layer-wise benchmarks: support TEP balance, polish slurm scripts (#10237 ) Signed-off-by: Tailing Yuan <yuantailing@gmail.com>	2026-01-05 11:23:04 +08:00
Cheng Hang	656c705ff1	[None][feat] sm100 weight-only kernel (#10190 ) Signed-off-by: Cheng Hang <chang@nvidia.com>	2026-01-05 09:44:36 +08:00
Fanrong Li	b5a1e10bc0	[https://nvbugs/5779534 ][fix] fix buffer reuse for CUDA graph attention metadata (#10393 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2026-01-05 09:43:44 +08:00
Wanli Jiang	da0830670a	[TRTLLM-10065][feat] Add accuracy tests for super-v3 with multiple-gpus (#10234 ) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2026-01-05 09:41:49 +08:00
Lizhi Zhou	82c1ba84a7	[https://nvbugs/5649010 ][fix] use 0 port as arbitrary port when disagg service discovery is enabled (#10383 ) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>	2026-01-05 09:40:40 +08:00
bhsueh_NV	0517b62789	[https://nvbugs/5772363 ][fix] fix bug of Mistral-Small-3.1-24B-Instruct-2503 (#10394 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2026-01-05 09:04:13 +08:00
Faraz	8e2065b4d9	[https://nvbugs/5670469 ][fix] Filter 0s and choose min of kv_head for Nemotron model (#10206 ) Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>	2026-01-05 08:42:53 +08:00
dongfengy	afc533193d	[None][feat] Support nvfp4 for gptoss (#8956 ) Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>	2026-01-04 08:57:44 -05:00
Jaedeok Kim	a4dcc6a711	[TRTLLM-10171][fix] Correct attention handling in ModelConfig and KVCacheManager (#10330 ) Signed-off-by: Jaedeok Kim <jaedeokk@nvidia.com>	2026-01-04 06:07:30 -05:00
Grzegorz Kwasniewski	0d1f5ad7a2	[TRTLLM-10358][feat] Added proper rescaling of FP4 weights (#10378 ) Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>	2026-01-03 16:26:16 -05:00
Izzy Putterman	bdf6953ddc	[None][feat] Eagle: MLA Based Eagle (#9677 ) Signed-off-by: Izzy Putterman <iputterman@nvidia.com>	2026-01-02 13:45:07 -05:00
Gal Hubara-Agam	f3dd6da080	[#10056 ][chore] AutoDeploy: Enable Nemo SuperV3 accuracy test (#10308 ) Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>	2026-01-02 11:20:19 +02:00
Balaram Buddharaju	4a1b742aa0	[TRTLLM-9467][fix] Fix PP+CP combination with helix parallelism (#10312 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2026-01-01 13:42:53 -05:00
Gal Hubara-Agam	5845951538	[#10056 ][fix] AutoDeploy: Handle deletion of nested params in sharding (#10376 ) Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>	2026-01-01 08:11:11 -05:00
tcherckez-nvidia	4868772ad7	[None][feat] Add export data to build and run script for AD (#10299 ) Signed-off-by: Tal Cherckez <127761168+tcherckez-nvidia@users.noreply.github.com>	2026-01-01 04:54:47 -05:00
Lucas Liebenwein	1bbe71b3ed	[#10244 ][feat] AutoDeploy: separate prefill/decode in flashinfer (#10252 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2025-12-31 17:01:24 -05:00
Mike Iovine	9085021aa4	[None][feat] Implement sampling for MTP 1-model (#10019 ) Signed-off-by: Mike Iovine <miovine@nvidia.com>	2025-12-31 13:48:34 -05:00
Simeng Liu	84d107b2f0	[https://nvbugs/5717993 ][fix] Add execution_stream across PyExecutor, KVCacheManager, PeftCacheManager to ensure proper CUDA stream synchronization between KV cache transfer operations and model forward kernels. (#10060 ) Signed-off-by: SimengLiu-nv <simengl@nvidia.com>	2025-12-31 09:22:54 -08:00
tcherckez-nvidia	464847c6be	[#9717 ][chore] Standardize MoE weights interface (#10295 ) Signed-off-by: Tal Cherckez <127761168+tcherckez-nvidia@users.noreply.github.com>	2025-12-31 07:37:18 -05:00
Jin Li	ef1d4a40b5	[https://nvbugs/5727475 ][fix] Avoid use property with setter in nn.Mo… (#10212 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-12-31 06:21:36 -05:00
Necofish	73870ae4ad	[None][feat] support Qwen3-VL dense model in pytorch backend (#9060 ) Signed-off-by: Nekofish-L <liuxiangyang@mail.ustc.edu.cn>	2025-12-31 17:54:26 +09:00
Pengyun Lin	fad000589d	[None][chore] Unify DS tool parser names (#10239 ) Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>	2025-12-31 14:40:07 +08:00
Jin Li	34c2fd50a9	[https://nvbugs/5707359 ][fix] Unwaive OOM case that should be fixed by #9446 (#10334 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-12-31 10:41:39 +08:00
Yuxian Qiu	1f3afb8e6f	[None][feat] Implement send_object for TorchDist. (#10213 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-12-31 10:40:52 +08:00
Eran Geva	74832a1895	[https://nvbugs/5766986 ][fix] fixed the shard_all_unprocessed default value to align with the default.yml (#10271 ) Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>	2025-12-30 08:54:13 -05:00
Bo Li	1f0365da36	[None][infra] Add LongBenchV1 to trtllm-eval. (#10265 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-12-30 21:39:34 +08:00
binghanc	692d8f2023	[TRTLLM-9455][feat] support for new checkpoint (#10082 ) Signed-off-by: binghanc <176802681+binghanc@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>	2025-12-30 14:46:39 +08:00
Neta Zmora	966231d29c	[#9626 ][feat] Add an auto-deploy transform for using cutlass FP4 MoE kernels (#10304 ) Add a transform to relace torch.ops.auto_deploy.torch_quant_nvfp4_moe with the optimized torch.ops.auto_deploy.trtllm_quant_nvfp4_moe_fused. Currently generates the wrong results when the number of rows in MoE FC1 weights is not divisible by 128, so torch.ops.auto_deploy.trtllm_quant_nvfp4_moe_fused is not set as the default FP4 MoE implementation (i.e. the transform is disabled). Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>	2025-12-29 23:18:15 +02:00
Ziyi Xiong	c59aa8bec5	[TRTLLM-9962][feat] Some optimizations for two-model spec dec (#10208 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-12-28 12:52:04 +08:00
JunyiXu-nv	55bc6a5ff8	[https://nvbugs/5753250 ][fix] Fix undefined local variable in responses utils (#10154 ) Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com> Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>	2025-12-28 06:59:32 +08:00
shivghai	ee07a7c55e	[None][fix] [Gemma3] Fix RoPE for local attention for Gemma3 (#9961 ) Signed-off-by: Shiv Ghai <8965168+shivghai@users.noreply.github.com>	2025-12-27 11:50:59 -08:00
Guoming Zhang	1865020b6f	[TRTLLM-8577][feat] Clean the Qwen3-next code by removing Qwen3NextCo… (#10228 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-12-27 22:49:55 +08:00
Olya Kozlova	55f3cda66d	[None][fix] Fix request_id for best_of/n case (#8368 ) Signed-off-by: Olya Kozlova <okozlova@nvidia.com>	2025-12-26 22:20:24 +01:00
Jin Li	c04563657e	[TRTLLM-7735][feat] Attention NVFP4 out support for torch compile (#9740 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-12-27 00:07:20 +08:00
Pengyun Lin	c5b0f9e436	[https://nvbugs/5633700 ][fix] Cache tiktoken vocab for gpt-oss (#10219 ) Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>	2025-12-26 18:39:03 +08:00
Wanli Jiang	14554ab3f3	[None][feat] Support multi-gpu running for nemotron-v3-nano and super (#10118 ) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2025-12-26 11:23:14 +08:00
Enwei Zhu	13ffe52ad0	[None][fix] Allow YAML config overwriting CLI args for trtllm-eval (#10296 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-12-25 15:08:15 -05:00
Neta Zmora	f3f02315df	[None][chore]: small refactoring to auto-deploy MoE operator (#10300 ) Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>	2025-12-25 12:27:11 -05:00
bhsueh_NV	db3430f589	[None][feat] Support VLM part for Mistral Large 3 (#10188 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-12-25 11:20:58 -05:00
Jin Li	7e4cef9def	[None][fix] Cherry-pick conflict changes for PR 7999 PR 8515 (#9446 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-12-25 10:23:04 -05:00
Ziyi Xiong	d8b5aeb061	[https://nvbugs/5652062 ][fix] Rewind kv_cache and reset draft tokens (#10160 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-12-25 09:13:51 -05:00
ZhichenJiang	46e4af5688	[TRTLLM-9831][perf] Enable 2CTA with autotune for CuteDSL MoE and Grouped GEMM optimizations (#10201 ) Signed-off-by: zhichen jiang <zhichenj@NVIDIA.com> Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-12-25 09:04:20 -05:00
Zhenhuan Chen	8462cf6c96	[TRTLLM-9578][feat] make PDL enabled by default (#9695 ) Signed-off-by: Zhenhuan Chen <zhenhuanc@nvidia.com>	2025-12-25 07:15:24 -05:00
Xianjie Qiao	53b81783b1	[None][fix] Fix pageable H2D memcopy issue on GB200 (#10289 ) Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>	2025-12-25 18:15:57 +08:00
gramnarayan	a9eb5afc9f	[#9241 ][feat] AutoDeploy: Support Eagle3 Speculative Decoding (#9869 ) Support two model flow with no overlap scheduler or chain drafter. Drafting model is in PyTorch backend. Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>	2025-12-24 23:30:42 -05:00
Ziyi Xiong	43178590d1	[TRTLLM-10143][feat] Reuse previous draft requests if possible (#10263 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-12-24 17:48:38 -08:00
Neta Zmora	c4b36d31ff	[#10137 ][feat] AutoDeploy FP8 MoE refactor (#10138 ) The trtllm (cutlass) fp8 moe operator performs W3+W1 fusion (concat) during inference and we want to move this fusion to the model optimization time. The Cutlass MoE kernel is used thru a trtllm torch operator. Its implementation uses two FC operations (fc1 and fc2) while the canonical MoE API defines three GEMM operations and their associated weights (W1, W2, W3) so when we switch from the torch.moe op to the trtllm.moe op we also change terminology from w1, w2, w3 to fc1, fc2. Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>	2025-12-24 18:58:10 +02:00
Necofish	8614cd3439	[None][fix] fix: resolve GPU memory imbalance in concurrent weight loading (#6472 ) Signed-off-by: Necofish <liuxiangyang@mail.ustc.edu.cn> Signed-off-by: Nekofish-L <liuxiangyang@mail.ustc.edu.cn> Signed-off-by: Jie Li <lijie@nvidia.com> Co-authored-by: Jie Li <lijie@nvidia.com>	2025-12-24 09:43:09 -05:00
Suyog Gupta	e2891a6c77	[#10052 ][feat] AutoDeploy enable cudagraphs for flashinfer BatchDecode (#10193 ) Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com> Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com> Co-authored-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>	2025-12-24 05:55:09 -08:00
shuyixiong	f4f0fe85e9	[TRTLLM-9737][chore] Add rl perf reproduce script and enhance the robustness of Ray tests (#9939 ) Signed-off-by: Shuyi Xiong <219646547+shuyixiong@users.noreply.github.com>	2025-12-24 15:27:01 +08:00
Yukun He	595daa5089	[TRTLLM-9615][feat] Support synchronization through PP ranks in the distributed tuning system (#10011 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-12-24 15:03:10 +08:00
Fanrong Li	156f6453dc	[TRTLLM-9798][feat] Change to use new DeepGEMM MQA sm100 kernel for MTP-3 (#10226 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-12-24 14:39:12 +08:00
Balaram Buddharaju	8c1cfc872b	[TRTLLM-9493][feat] Custom AllToAll for helix parallelism (#9986 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-12-23 18:14:30 -08:00
Grzegorz Kwasniewski	06900a7f19	[TRTLLM-9565][fix] Fix deepseek sharding (#9984 ) Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>	2025-12-23 10:28:14 -05:00
Xianjie Qiao	871c6b435c	[None] [feat] skip batch_tokenize_prompts in CustomDataset (#10214 ) Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>	2025-12-23 17:40:57 +08:00
Yiqing Yan	59b05dc0a8	[None][chore] Bump version to 1.2.0rc7 (#10216 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-12-23 15:07:47 +08:00
Harshini Komali	d691371eaf	[TRTLLM-9091] [feat] Replace GenAI-Perf with AIPerf (#9310 ) Signed-off-by: lkomali <lkomali@nvidia.com> Signed-off-by: Harshini Komali <157742537+lkomali@users.noreply.github.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-12-23 13:25:55 +08:00
Li Min	1e82ff7a0c	[TRTLLM-9989][fix] Fix tvm_ffi aaarch64 issue. (#10199 ) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>	2025-12-23 10:20:40 +08:00
Yuxian Qiu	696f754ef4	[None][fix] avoid implicit cudaStreamSynchronize in sample_async. (#10120 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-12-23 10:15:40 +08:00
Tailing Yuan	648196f8ae	[TRTLLM-9432][feat] Reduce synchronization and recompilation for qwen3-next (#9691 ) Signed-off-by: Tailing Yuan <yuantailing@gmail.com>	2025-12-23 10:14:29 +08:00
Faraz	f05af48bca	[https://nvbugs/5747674 ][fix] Add contiguous() before view() in load_expert_w3_w1_weight and load (#10136 ) Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>	2025-12-22 21:03:34 -05:00
Fanrong Li	0d2500c631	[TRTLLM-9677][feat] Support DeepSeek-V3.2 tool parser (#10126 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-12-23 08:46:47 +08:00
Grzegorz Kwasniewski	ccc64da287	[TRTLLM-9847][fix] WAR fix hanging fused allreduce. (#10087 ) Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>	2025-12-23 00:03:32 +01:00
tcherckez-nvidia	12e1cb8d7e	[#9717 ][chore] Refactor MoE code to use enums (#9910 ) Signed-off-by: Tal Cherckez <127761168+tcherckez-nvidia@users.noreply.github.com>	2025-12-22 15:14:56 -05:00
JunyiXu-nv	aaa87abf41	[TRTLLM-7906][feat] Support multiple post process for Responses API (#9908 ) Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>	2025-12-22 11:33:34 -05:00

1 2 3 4 5 ...

2021 Commits