TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-30 15:43:19 +08:00

Author	SHA1	Message	Date
Bo Li	6000380a0c	perf: Avoid reswizzle_sf after allgather. (#5504 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-06-29 21:25:50 +08:00
amirkl94	de9779900c	feat: Add support for YARN in NemotronNAS models (#4906 ) Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>	2025-06-29 09:45:49 +03:00
Lucas Liebenwein	619709fc33	[AutoDeploy] merge feat/ad-2025-06-13 (#5556 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2025-06-29 03:52:14 +08:00
Li Min	6021a439ab	Make moe permute and final as custom op (#5412 ) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>	2025-06-27 15:48:33 -07:00
Robin Kobus	a8141a4513	refactor: Speculative decoding buffers part 2 (#5316 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-27 17:41:48 +02:00
Aurelien Chartier	833c0dea4a	[TRTLLM-6104] feat: add request_perf_metrics to LLMAPI (#5497 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-06-27 17:03:05 +02:00
wili	56cdfe5c6c	[TRTLLM-5000][feat] NGrams V2 (#4569 ) Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>	2025-06-27 23:00:17 +08:00
peaceh-nv	cb58073ab7	Fix : fix build for sm120 (#5265 ) Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>	2025-06-27 20:42:47 +08:00
Daniel Cámpora	73b8a95049	feat: Use inference mode in update_requests to improve perf of TRTLLM Sampler (#5538 )	2025-06-27 18:40:53 +08:00
Daniel Stokes	83a1f60556	feat: Expose bias and FP8_MXFP4 MOE CUTLASS backend features to pytorch (#5410 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-06-27 12:29:34 +08:00
Yuxian Qiu	dc36228f52	fix: Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-06-27 11:03:38 +08:00
jmydurant	8836990bde	[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) (#5475 ) Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-06-26 22:18:08 +08:00
Robin Kobus	8dfa31c71d	refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead (#5384 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-26 19:45:52 +08:00
Bo Li	1bab9000a6	perf: Optimize swizzle_sf, unswizzle_sf, reswizzle_sf (#5318 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-06-26 14:03:56 +08:00
dongxuy04	490d2e5819	feat: large-scale EP(part 8: Online EP load balancer integration for PCIe fp8) (#5226 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-25 22:25:13 -07:00
Yukun He	9ee33605bb	[TRTLLM-6019] feat: Remove cutlass min latency code from AutoTuner. (#5394 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-06-26 13:12:03 +08:00
Netanel Haber	6aef14943c	Revert "feature: unify new_tokens format sample state to trtllm samper new_tokens format (#4401 )" (#5474 ) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2025-06-25 20:56:04 -07:00
jmydurant	578dbc8d9a	feat: chunked prefill for MLA (Blackwell) (#4651 ) Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-06-26 09:01:00 +08:00
Yukun He	3fc57543e2	[5356427] fix: Remove the seq_len of 4096 from FP8 block scale MoE tuning configs. (#5485 ) The seq_len of 4096 will cause some unknown CUDA illegal memory access issue if run with some other tests consecutively. Put a saturated upper bound for any sequence length larger than it.	2025-06-26 08:38:35 +08:00
Xianjie Qiao	1e4fa13d33	Add sleep function for disagg gen-only benchmarking (#5398 ) Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>	2025-06-26 07:32:16 +08:00
Mike Iovine	5bc8c894f7	[chore] Disable block reuse when draft model speculation is being used (#5448 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-06-26 03:51:20 +08:00
Daniel Cámpora	205c97a4ae	[TRTLLM-5974][feat] Support disaggregated serving in TRTLLM Sampler (#5328 ) Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-06-25 17:41:36 +02:00
HuiGao-NV	b3a4c1f404	feat: Remove not used padding_idx in models (#5385 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-06-25 17:19:59 +08:00
Enwei Zhu	fc7a81ceb0	test: Add LLGuidance test and refine guided decoding (#5348 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-25 14:12:56 +08:00
Enwei Zhu	76da7fed86	fix (NvBug 5354925): Fix static EPLB (#5411 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-25 13:14:40 +08:00
Lucas Liebenwein	5cffb7e0ec	[AutoDeploy] Merge feat/ad_2025_06_13 feature branch (#5454 ) Signed-off-by: Grzegorz Kwasniewski <213329731+greg-kwasniewski1@users.noreply.github.com> Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com> Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> Co-authored-by: Grzegorz Kwasniewski <213329731+greg-kwasniewski1@users.noreply.github.com> Co-authored-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com> Co-authored-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>	2025-06-25 09:30:13 +08:00
bhsueh_NV	73ba4fc320	fix: fix bug of qwen3 + eagle3 + finalize_moe_fusion (#5369 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-06-25 09:20:23 +08:00
dongxuy04	699520082b	Add MTP support for Online EPLB (#5213 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-25 07:58:13 +08:00
QI JUN	d93a5e04b5	Chore: remove unused variables (#5314 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-24 22:27:32 +08:00
HuiGao-NV	35a92f6bab	Add debug hook to support dump tensor data and add new debug functions easily (#5182 ) Signed-off-by: Hui Gao	2025-06-24 17:45:28 +08:00
Luis Vega	d26040e5d9	chore: delete mamba hybrid, since it is now called NemotronH (#5409 ) Signed-off-by: Luis Vega <vegaluisjose@users.noreply.github.com>	2025-06-24 16:27:31 +08:00
Robin Kobus	e2a8cbc80b	refactor: manage cache indirection in decoder state (#5315 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-24 09:15:59 +02:00
HuiGao-NV	e16c1bef6e	[fix] Add 1 and draft_token_num to seq_len when overlap scheduling is enabled during memory estimation (#5343 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-06-24 11:43:43 +08:00
Netanel Haber	58a8a8fd37	feature: unify new_tokens format sample state to trtllm sampler new_tokens format (#4401 ) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2025-06-23 10:38:37 -07:00
dongxuy04	4f0f17ac8a	feat: Misc Opt for large scale EP (#5374 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-20 13:11:31 +08:00
Fanrong Li	5d4ab47d5b	fix: refactor and fix mtp vanilla (#4762 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-20 05:23:39 +08:00
Yan Chunwei	9bd42ecf9b	[TRTLLM-5208][BREAKING CHANGE] chore: make pytorch LLM the default (#5312 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-06-20 03:01:10 +08:00
Kaiyu Xie	7246fd75d1	feat: Support stream_interval (#5284 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-06-19 21:57:10 +08:00
Fanrong Li	c7af650d5a	Fix: fix the deterministic issue in the MTP Eagle path (#5285 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-19 18:08:40 +08:00
hlu1	b558232ce1	Refactor CutlassFusedMoE (#5344 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>	2025-06-19 00:04:07 -07:00
amitz-nv	1753202b61	[TRTLLM-5825][fix] Fix torch LoRA TP (#5338 ) Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>	2025-06-19 09:12:00 +03:00
Zongfei Jing	2b23cd56ce	[feat] Fusion finalize and allreduce for qwenmoe model (#5223 ) Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> Co-authored-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com>	2025-06-19 08:03:58 +08:00
jellysnack	0623ffe3bc	feat: Add LLGuidance Support for PyTorch Backend (#5214 ) Signed-off-by: jellysnack <oleg.jellysnack@gmail.com> Signed-off-by: jellysnack <158609015+jellysnack@users.noreply.github.com> Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-18 19:33:34 +08:00
Robin Kobus	38547b92f3	refactor: Introduce ResourceManagerType enum for resource management (#5246 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-18 09:55:59 +02:00
Yukun He	6711ad9cf3	[TRTLLM-5589] feat: Minor optimizations for tunable FP8 batched GEMM op. (#5139 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-06-18 14:33:46 +08:00
Yan Chunwei	724e495254	chore: partition LLM class into TorchLLM and TrtLLM (#4900 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-06-18 14:01:25 +08:00
QI JUN	855036d8ee	update LlmRequest.is_dummy property (#5283 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com> Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-18 10:52:13 +08:00
Robin Kobus	627062c265	refactor: Update decoder buffer and logits management (#4450 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-18 08:10:32 +08:00
Mike Iovine	9bf69c9fdb	[chore] Remove BaseDraftTokenManager (#5251 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-06-17 11:57:52 -04:00
QI JUN	f899c4d294	Re-implement LlmResponse in Python to reduce host overhead of pybind (#5224 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-17 21:28:09 +08:00

1 2 3 4 5 ...

433 Commits