TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Frank	92d3a2d0e0	[https://nvbugspro.nvidia.com/bug/5351333 ][fix] Update to chunking calculation. (#5625 ) Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>	2025-07-02 17:48:02 +08:00
Anurag Mukkara	c2799d0465	[nvbug/5354825] Fix nougat test image url (#5496 ) Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>	2025-06-26 10:10:18 +08:00
Wanli Jiang	af5839303d	feat: TRTLLM-5941 Upgrade xgrammar to 0.1.18 (#5364 ) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2025-06-25 14:10:50 +08:00
brb-nv	32f50ded17	nvbugs-5331031; nvbugs-5344203 - address intermittent issues with Mistral Small multimodal for BS=8 (#5453 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-06-25 11:45:14 +08:00
Yiqing Yan	decfe2fdb3	chore: bump version to 0.21.0 (#5325 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-06-19 12:58:44 +08:00
nv-guomingz	6a388b105a	chore: remove torch_compile prefix for TorchCompileConfig field members (#5261 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-06-19 09:21:51 +08:00
Zongfei Jing	2b23cd56ce	[feat] Fusion finalize and allreduce for qwenmoe model (#5223 ) Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> Co-authored-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com>	2025-06-19 08:03:58 +08:00
Yan Chunwei	3946e798db	fix[nvbug5298640]: trtllm-llmapi-launch multiple LLM instances (#4727 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-06-19 06:13:53 +08:00
jellysnack	0623ffe3bc	feat: Add LLGuidance Support for PyTorch Backend (#5214 ) Signed-off-by: jellysnack <oleg.jellysnack@gmail.com> Signed-off-by: jellysnack <158609015+jellysnack@users.noreply.github.com> Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-18 19:33:34 +08:00
Zhanrui Sun	516bd4dc05	chore: bump version to 0.21.0rc3 (#5309 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-06-18 15:59:53 +08:00
Robin Kobus	38547b92f3	refactor: Introduce ResourceManagerType enum for resource management (#5246 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-18 09:55:59 +02:00
Yukun He	6711ad9cf3	[TRTLLM-5589] feat: Minor optimizations for tunable FP8 batched GEMM op. (#5139 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-06-18 14:33:46 +08:00
Yan Chunwei	724e495254	chore: partition LLM class into TorchLLM and TrtLLM (#4900 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-06-18 14:01:25 +08:00
Yi Zhang	e44f7687af	feat: Add no_kv_cache_reuse option and streaming support for trtllm serve bench (#4971 ) Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>	2025-06-18 13:37:31 +08:00
QI JUN	855036d8ee	update LlmRequest.is_dummy property (#5283 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com> Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-18 10:52:13 +08:00
Robin Kobus	627062c265	refactor: Update decoder buffer and logits management (#4450 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-18 08:10:32 +08:00
Mike Iovine	9bf69c9fdb	[chore] Remove BaseDraftTokenManager (#5251 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-06-17 11:57:52 -04:00
QI JUN	f899c4d294	Re-implement LlmResponse in Python to reduce host overhead of pybind (#5224 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-17 21:28:09 +08:00
Dom Brown	44fb3c1673	[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner (#5207 ) - Adds a new Python custom op (fp8_block_scale_moe_runner) and a FP8BlockScaleMoERunner class for autotuning. - Updates C++ MoE and batched GEMM kernels to accept a configIndex for workspace sizing and execution. - Extends the unit test to run both autotuned and non-autotuned code paths. Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-06-17 21:01:56 +08:00
amirkl94	8451a87742	chore: Mass integration of release/0.20 (#5082 ) Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com> Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Co-authored-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com> Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Co-authored-by: Erin <14718778+hchings@users.noreply.github.com> Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com> Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>	2025-06-17 14:32:02 +03:00
liji-nv	13eef642e6	[feat] Piecewise cuda graph support for MLA (#4467 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-06-17 18:58:38 +08:00
Yilin Fan	498fadceb4	[feat] Add EAGLE3 support for Qwen3 (#5206 ) Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>	2025-06-17 17:07:06 +08:00
Enwei Zhu	4b82b8b4c7	[TRTLLM-5330] perf: Optimize MoE supplementary kernels for large-scale EP (#5215 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-17 15:23:24 +08:00
Izzy Putterman	e607768e45	Speculation: Draft Target in new FW (#4558 ) Signed-off-by: Izzy Putterman <iputterman@nvidia.com>	2025-06-17 02:26:08 +08:00
tomeras91	cea5dd1e38	[TRTLLM-5835][feat] Optimized Mamba2Mixer prefill (#5128 ) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>	2025-06-16 16:29:17 +03:00
Yilin Fan	dd29063538	[feat] Add llm args to tune python gc threshold (#5141 ) Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>	2025-06-16 17:45:22 +08:00
Robin Kobus	b6ca677741	refactor: remove decoder request from decoder interface (#5129 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-16 09:12:30 +02:00
Robin Kobus	dda64166cd	refactor: Scheduling based on KV cache state (#4865 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-16 08:14:58 +02:00
Tracin	ef3fdc8051	feat: Add w4a8_mxfp4_fp8 quantization recipe. (#4867 ) Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>	2025-06-16 11:30:57 +08:00
Enwei Zhu	babdd9ce06	test: Add json_mode_eval for guided decoding evaluation (#5179 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-16 10:03:55 +08:00
Yilin Fan	7a5e0fd300	[fix] Fix Llama4 min-latency import error (#5209 ) Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>	2025-06-16 10:03:07 +08:00
Yan Chunwei	c84e41fd9d	fix: build_config in TorchLlmArgs and avoid arbitrary args (#4972 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-06-15 17:51:56 -07:00
amitz-nv	109c426077	Enable trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA in PyT flow (#5130 )	2025-06-15 18:54:04 +03:00
Fanrong Li	39bba63758	[TRTLLM-4983] feat: enable overlap scheduler between draft forwards (#4802 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-15 23:09:16 +08:00
Fanrong Li	159ffc584e	fix: fix cuda graph max batch size for spec decoding cases. (#5076 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-15 14:57:28 +08:00
Kaiyu Xie	dce1dcc4f9	feat: Support post_proc for bench (#5122 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-06-15 13:02:38 +08:00
Enwei Zhu	63bc62ddf4	feat: Enable EPLB to existing MoE models (#5203 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-15 11:48:06 +08:00
Yuan Tong	6bce7337a9	perf: avoid dynamic import overhead in is_llm_response with duck typing (#5110 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-06-15 07:45:02 +08:00
ixlmar	e055af1bc9	chore: improve disagg test failure detection (#4738 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-06-15 01:28:26 +08:00
Aurelien Chartier	1389f5a4d3	feat: Add support for fp8 rowwise quantization (#4876 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com> Co-authored-by: aikitoria <151776613+aikitoria@users.noreply.github.com>	2025-06-14 06:37:48 -07:00
2ez4bz	dc52b67492	linting(python): Enable ruff on more files (wave 1/N) (#5140 ) Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>	2025-06-14 19:19:34 +08:00
Tailing Yuan	0b60da2c45	feat: large-scale EP(part 7: DeepEP integration) (#4792 ) Signed-off-by: Tailing Yuan <yuantailing@gmail.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-06-14 19:12:38 +08:00
yunruis	b99c5ce8c1	Feat/ds r1 min latency opt round3, add router gemm, fused a gemm, PDL (#4560 ) Signed-off-by: yunruis <yunruis@nvidia.com> Signed-off-by: kduan <176893526+Kefeng-Duan@users.noreply.github.com> Signed-off-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com> Co-authored-by: kduan <176893526+Kefeng-Duan@users.noreply.github.com>	2025-06-14 17:36:22 +08:00
nv-guomingz	3b7b5a5ad5	refactor [BREAKING CHANGE]: enhance the llm args pytorch config part 3(torch_compile_config) (#5032 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-06-14 14:23:13 +08:00
Yilin Fan	06342ffb4d	[feat] Implement model-agnostic one-engine eagle3 (#4778 ) Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>	2025-06-13 08:11:41 -07:00
Mike Iovine	25aa3881d7	[nvbug/5319281][fix] Stop drafting when we hit the draft model's max seq len (#4879 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-06-13 11:06:36 -04:00
brb-nv	089be8912a	feat: Basic skeleton for Gemma3 VLM (#5108 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-06-13 17:27:04 +08:00
nv-guomingz	b959618579	refactor [BREAKING CHANGE]:: remove the redundant use_kv_cache field from PytorchConfig (#5031 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-06-13 16:34:24 +08:00
yunruis	30c5b4183a	refactoring: port customized kernels with public cutlass version (#5027 ) Signed-off-by: yunruis Merge this to unblock others since the full CI has been run through	2025-06-13 16:19:31 +08:00
Zheng Duan	4d0a5ad384	chore: gracefully exit disagg process in tests; better startup and logging (#5109 ) Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>	2025-06-13 14:03:55 +08:00

1 2 3 4 5 ...

648 Commits