TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Wanli Jiang	3789ba1d37	feat: TRTLLM-5941 Upgrade xgrammar to 0.1.18 (#5364 ) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2025-07-01 20:12:55 +08:00
danielafrimi	7a617ad1fe	feat: W4A16 GEMM (#4232 ) Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>	2025-07-01 10:36:05 +03:00
Netanel Haber	6ee94c7ac8	Reintroduce with perf fixes: feature: unify new_tokens format sample state to trtllm samper tokens format (#5513 ) `58a8a8f` - these changes were previously merged to main here. `6aef149` - the changes were temporarily reverted in main, due to a significant perf regression in models using the TorchSampler (observed by @byshiue). This PR is meant to re-merge these changes along with a fix to prevent the regression. The first commit of this PR is actually just the reverted revert - filter it out of the changes to see previously unmerged changes. Signed-off-by: Netanel Haber <nhaber@nvidia.com>	2025-06-30 11:58:59 -07:00
Wei-Ming Chen	f28cd3056e	feat: AutoDeploy fp8 quantization support for bmm (#3849 ) Signed-off-by: Wei-Ming Chen <17592131+meenchen@users.noreply.github.com>	2025-06-30 12:36:34 -04:00
nv-guomingz	6e48ac25a6	chore: remove cuda_graph_ prefix from cuda_graph_config filed members. (#5585 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-06-30 12:23:14 -04:00
Li Min	16fc99391f	refactor: [TRTLLM-6150] Refactor moe permute and finalize op by removing duplicated code (#5557 ) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>	2025-06-30 08:48:04 -07:00
Robin Kobus	9bdc5951f8	refactor: decoder state setup (#5093 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-30 11:09:43 +02:00
Fanrong Li	6cbc9a5297	[nvbug/5354946][fix] Fix mtp vanilla draft inputs (#5568 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-30 15:59:12 +08:00
WeiHaocheng	42a9385d02	[TRTLLM-5331] perf: Replace allgaher with AllToAllPrepare (#5570 ) Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>	2025-06-30 13:06:09 +08:00
dongjiyingdjy	852b79053d	feat : support duplicate_kv_weight for qwen3 blockwise scale (#5459 ) Signed-off-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>	2025-06-30 11:49:22 +08:00
nv-guomingz	578430e64c	[TRTLLM-5530][BREAKING CHANGE]: enhance the llm args pytorch config part 1(cuda_graph_config) (#5014 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-06-30 11:05:40 +08:00
Bo Li	6000380a0c	perf: Avoid reswizzle_sf after allgather. (#5504 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-06-29 21:25:50 +08:00
amirkl94	de9779900c	feat: Add support for YARN in NemotronNAS models (#4906 ) Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>	2025-06-29 09:45:49 +03:00
Lucas Liebenwein	619709fc33	[AutoDeploy] merge feat/ad-2025-06-13 (#5556 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2025-06-29 03:52:14 +08:00
Li Min	6021a439ab	Make moe permute and final as custom op (#5412 ) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>	2025-06-27 15:48:33 -07:00
Robin Kobus	a8141a4513	refactor: Speculative decoding buffers part 2 (#5316 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-27 17:41:48 +02:00
Aurelien Chartier	833c0dea4a	[TRTLLM-6104] feat: add request_perf_metrics to LLMAPI (#5497 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-06-27 17:03:05 +02:00
wili	56cdfe5c6c	[TRTLLM-5000][feat] NGrams V2 (#4569 ) Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>	2025-06-27 23:00:17 +08:00
peaceh-nv	cb58073ab7	Fix : fix build for sm120 (#5265 ) Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>	2025-06-27 20:42:47 +08:00
Daniel Cámpora	73b8a95049	feat: Use inference mode in update_requests to improve perf of TRTLLM Sampler (#5538 )	2025-06-27 18:40:53 +08:00
Daniel Stokes	83a1f60556	feat: Expose bias and FP8_MXFP4 MOE CUTLASS backend features to pytorch (#5410 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-06-27 12:29:34 +08:00
Yuxian Qiu	dc36228f52	fix: Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-06-27 11:03:38 +08:00
jmydurant	8836990bde	[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) (#5475 ) Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-06-26 22:18:08 +08:00
Robin Kobus	8dfa31c71d	refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead (#5384 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-26 19:45:52 +08:00
Bo Li	1bab9000a6	perf: Optimize swizzle_sf, unswizzle_sf, reswizzle_sf (#5318 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-06-26 14:03:56 +08:00
dongxuy04	490d2e5819	feat: large-scale EP(part 8: Online EP load balancer integration for PCIe fp8) (#5226 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-25 22:25:13 -07:00
Yukun He	9ee33605bb	[TRTLLM-6019] feat: Remove cutlass min latency code from AutoTuner. (#5394 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-06-26 13:12:03 +08:00
Netanel Haber	6aef14943c	Revert "feature: unify new_tokens format sample state to trtllm samper new_tokens format (#4401 )" (#5474 ) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2025-06-25 20:56:04 -07:00
jmydurant	578dbc8d9a	feat: chunked prefill for MLA (Blackwell) (#4651 ) Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-06-26 09:01:00 +08:00
Yukun He	3fc57543e2	[5356427] fix: Remove the seq_len of 4096 from FP8 block scale MoE tuning configs. (#5485 ) The seq_len of 4096 will cause some unknown CUDA illegal memory access issue if run with some other tests consecutively. Put a saturated upper bound for any sequence length larger than it.	2025-06-26 08:38:35 +08:00
Xianjie Qiao	1e4fa13d33	Add sleep function for disagg gen-only benchmarking (#5398 ) Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>	2025-06-26 07:32:16 +08:00
Mike Iovine	5bc8c894f7	[chore] Disable block reuse when draft model speculation is being used (#5448 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-06-26 03:51:20 +08:00
Daniel Cámpora	205c97a4ae	[TRTLLM-5974][feat] Support disaggregated serving in TRTLLM Sampler (#5328 ) Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-06-25 17:41:36 +02:00
HuiGao-NV	b3a4c1f404	feat: Remove not used padding_idx in models (#5385 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-06-25 17:19:59 +08:00
Enwei Zhu	fc7a81ceb0	test: Add LLGuidance test and refine guided decoding (#5348 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-25 14:12:56 +08:00
Enwei Zhu	76da7fed86	fix (NvBug 5354925): Fix static EPLB (#5411 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-25 13:14:40 +08:00
Lucas Liebenwein	5cffb7e0ec	[AutoDeploy] Merge feat/ad_2025_06_13 feature branch (#5454 ) Signed-off-by: Grzegorz Kwasniewski <213329731+greg-kwasniewski1@users.noreply.github.com> Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com> Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> Co-authored-by: Grzegorz Kwasniewski <213329731+greg-kwasniewski1@users.noreply.github.com> Co-authored-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com> Co-authored-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>	2025-06-25 09:30:13 +08:00
bhsueh_NV	73ba4fc320	fix: fix bug of qwen3 + eagle3 + finalize_moe_fusion (#5369 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-06-25 09:20:23 +08:00
dongxuy04	699520082b	Add MTP support for Online EPLB (#5213 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-25 07:58:13 +08:00
QI JUN	d93a5e04b5	Chore: remove unused variables (#5314 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-24 22:27:32 +08:00
HuiGao-NV	35a92f6bab	Add debug hook to support dump tensor data and add new debug functions easily (#5182 ) Signed-off-by: Hui Gao	2025-06-24 17:45:28 +08:00
Luis Vega	d26040e5d9	chore: delete mamba hybrid, since it is now called NemotronH (#5409 ) Signed-off-by: Luis Vega <vegaluisjose@users.noreply.github.com>	2025-06-24 16:27:31 +08:00
Robin Kobus	e2a8cbc80b	refactor: manage cache indirection in decoder state (#5315 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-24 09:15:59 +02:00
HuiGao-NV	e16c1bef6e	[fix] Add 1 and draft_token_num to seq_len when overlap scheduling is enabled during memory estimation (#5343 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-06-24 11:43:43 +08:00
Netanel Haber	58a8a8fd37	feature: unify new_tokens format sample state to trtllm sampler new_tokens format (#4401 ) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2025-06-23 10:38:37 -07:00
dongxuy04	4f0f17ac8a	feat: Misc Opt for large scale EP (#5374 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-20 13:11:31 +08:00
Fanrong Li	5d4ab47d5b	fix: refactor and fix mtp vanilla (#4762 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-20 05:23:39 +08:00
Yan Chunwei	9bd42ecf9b	[TRTLLM-5208][BREAKING CHANGE] chore: make pytorch LLM the default (#5312 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-06-20 03:01:10 +08:00
Kaiyu Xie	7246fd75d1	feat: Support stream_interval (#5284 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-06-19 21:57:10 +08:00
Fanrong Li	c7af650d5a	Fix: fix the deterministic issue in the MTP Eagle path (#5285 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-19 18:08:40 +08:00

1 2 3 4 5 ...

444 Commits