TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Iman Tabrizian	2e7da20934	[fix] Release slots with spec decode + disagg (#5975 ) Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com> Signed-off-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>	2025-07-14 16:15:03 -07:00
Yi Zhang	332a65b837	[nvbugs/5368410][fix] Disable moe allreduce for multi node (#5918 ) Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>	2025-07-14 10:06:29 +08:00
Fanrong Li	bed78a2575	fix: fix index out of bounds error in spec decoding (#5954 )	2025-07-14 09:41:27 +08:00
Fanrong Li	4905cac8fd	[nvbugs/5333742] fix MTP illegal memory access in cuda graph warmup (#5947 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-07-12 21:55:44 +08:00
Zheng Duan	e831673f80	fix: timeout and broken pipe in disagg and worker tests (#5827 ) Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>	2025-07-11 12:42:47 +08:00
Zhenhuan Chen	d9e265d5e7	[https://nvbugs/5355316 ] fix: update torch.compile option to fix triton store_cubin error (#5865 ) Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>	2025-07-10 12:16:57 +09:00
Robin Kobus	fd94d3cbf5	[nvbugs/5345391] fix: chunked prefill + overlap scheduling (#5761 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-09 17:59:45 +02:00
QI JUN	f8b4077654	[nvbugs/5326453] Avoid nesting NCCL grouping in allgather OP (#5789 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-07-08 15:39:27 +09:00
Pengyun Lin	0a0ac7b5dc	[nvbug 5304752][fix] enhance _check_arguments to filter illegal requests for pytorch backend (#5541 ) Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>	2025-07-07 19:26:13 +08:00
QI JUN	4fa9284612	[nvbug/5302638][nvbugs/5310314] fix _handle_cancelled_requests (#5532 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-07-07 16:51:24 +08:00
QI JUN	3a58db88c8	fix _pad_attention_dp_dummy_request (#5583 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-07-07 14:13:54 +08:00
Pengyun Lin	7524c77e1e	[nvbug 5004744][fix] rewrite completion API to avoid repetitive tokens (#5201 ) Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>	2025-07-07 15:06:49 +09:00
brb-nv	9106b5d9a5	fix: Skip rope scaling for local layers in Gemma3 VLM (#5773 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-07-07 13:36:23 +08:00
Iman Tabrizian	518915b5c6	[nvbug/5337601][fix] Fix disagg + speculative decoding (#5558 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Co-authored-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-07-04 12:52:35 -04:00
Yi Zhang	5ac92bb8ff	[nvbugs/5336321][fix] Enable attention dp = False test case, Fix TRTLLM Gen Moe workspace allocation (#5463 ) Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> Signed-off-by: yizhan <187001205+yizhang-nv@users.noreply.github.com>	2025-07-04 23:23:41 +09:00
Dom Brown	2aacdba1e4	[TRTLLM-6100] fix: Nvbug 5356427: autotuned TRTLLM Gen fp8 block scale MoE illegal memory access (#5676 ) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-07-04 10:38:08 +08:00
Faraz	8a8d2e9901	[NVBUG:5355009] Modify check for fuse_fp4_quant on SM120 (#5651 ) Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com> Co-authored-by: peaceh-nv <103117813+peaceh-nv@users.noreply.github.com>	2025-07-03 22:08:15 +09:00
brb-nv	a3c0cf02ce	fix: Investigate Gemma3 1B decoder output discrepancy (#5564 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-07-03 09:55:25 +08:00
Frank	92d3a2d0e0	[https://nvbugspro.nvidia.com/bug/5351333 ][fix] Update to chunking calculation. (#5625 ) Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>	2025-07-02 17:48:02 +08:00
Anurag Mukkara	c2799d0465	[nvbug/5354825] Fix nougat test image url (#5496 ) Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>	2025-06-26 10:10:18 +08:00
Wanli Jiang	af5839303d	feat: TRTLLM-5941 Upgrade xgrammar to 0.1.18 (#5364 ) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2025-06-25 14:10:50 +08:00
brb-nv	32f50ded17	nvbugs-5331031; nvbugs-5344203 - address intermittent issues with Mistral Small multimodal for BS=8 (#5453 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-06-25 11:45:14 +08:00
Yiqing Yan	decfe2fdb3	chore: bump version to 0.21.0 (#5325 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-06-19 12:58:44 +08:00
nv-guomingz	6a388b105a	chore: remove torch_compile prefix for TorchCompileConfig field members (#5261 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-06-19 09:21:51 +08:00
Zongfei Jing	2b23cd56ce	[feat] Fusion finalize and allreduce for qwenmoe model (#5223 ) Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> Co-authored-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com>	2025-06-19 08:03:58 +08:00
Yan Chunwei	3946e798db	fix[nvbug5298640]: trtllm-llmapi-launch multiple LLM instances (#4727 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-06-19 06:13:53 +08:00
jellysnack	0623ffe3bc	feat: Add LLGuidance Support for PyTorch Backend (#5214 ) Signed-off-by: jellysnack <oleg.jellysnack@gmail.com> Signed-off-by: jellysnack <158609015+jellysnack@users.noreply.github.com> Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-18 19:33:34 +08:00
Zhanrui Sun	516bd4dc05	chore: bump version to 0.21.0rc3 (#5309 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-06-18 15:59:53 +08:00
Robin Kobus	38547b92f3	refactor: Introduce ResourceManagerType enum for resource management (#5246 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-18 09:55:59 +02:00
Yukun He	6711ad9cf3	[TRTLLM-5589] feat: Minor optimizations for tunable FP8 batched GEMM op. (#5139 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-06-18 14:33:46 +08:00
Yan Chunwei	724e495254	chore: partition LLM class into TorchLLM and TrtLLM (#4900 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-06-18 14:01:25 +08:00
Yi Zhang	e44f7687af	feat: Add no_kv_cache_reuse option and streaming support for trtllm serve bench (#4971 ) Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>	2025-06-18 13:37:31 +08:00
QI JUN	855036d8ee	update LlmRequest.is_dummy property (#5283 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com> Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-18 10:52:13 +08:00
Robin Kobus	627062c265	refactor: Update decoder buffer and logits management (#4450 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-18 08:10:32 +08:00
Mike Iovine	9bf69c9fdb	[chore] Remove BaseDraftTokenManager (#5251 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-06-17 11:57:52 -04:00
QI JUN	f899c4d294	Re-implement LlmResponse in Python to reduce host overhead of pybind (#5224 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-17 21:28:09 +08:00
Dom Brown	44fb3c1673	[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner (#5207 ) - Adds a new Python custom op (fp8_block_scale_moe_runner) and a FP8BlockScaleMoERunner class for autotuning. - Updates C++ MoE and batched GEMM kernels to accept a configIndex for workspace sizing and execution. - Extends the unit test to run both autotuned and non-autotuned code paths. Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-06-17 21:01:56 +08:00
amirkl94	8451a87742	chore: Mass integration of release/0.20 (#5082 ) Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com> Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Co-authored-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com> Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Co-authored-by: Erin <14718778+hchings@users.noreply.github.com> Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com> Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>	2025-06-17 14:32:02 +03:00
liji-nv	13eef642e6	[feat] Piecewise cuda graph support for MLA (#4467 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-06-17 18:58:38 +08:00
Yilin Fan	498fadceb4	[feat] Add EAGLE3 support for Qwen3 (#5206 ) Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>	2025-06-17 17:07:06 +08:00
Enwei Zhu	4b82b8b4c7	[TRTLLM-5330] perf: Optimize MoE supplementary kernels for large-scale EP (#5215 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-17 15:23:24 +08:00
Izzy Putterman	e607768e45	Speculation: Draft Target in new FW (#4558 ) Signed-off-by: Izzy Putterman <iputterman@nvidia.com>	2025-06-17 02:26:08 +08:00
tomeras91	cea5dd1e38	[TRTLLM-5835][feat] Optimized Mamba2Mixer prefill (#5128 ) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>	2025-06-16 16:29:17 +03:00
Yilin Fan	dd29063538	[feat] Add llm args to tune python gc threshold (#5141 ) Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>	2025-06-16 17:45:22 +08:00
Robin Kobus	b6ca677741	refactor: remove decoder request from decoder interface (#5129 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-16 09:12:30 +02:00
Robin Kobus	dda64166cd	refactor: Scheduling based on KV cache state (#4865 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-16 08:14:58 +02:00
Tracin	ef3fdc8051	feat: Add w4a8_mxfp4_fp8 quantization recipe. (#4867 ) Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>	2025-06-16 11:30:57 +08:00
Enwei Zhu	babdd9ce06	test: Add json_mode_eval for guided decoding evaluation (#5179 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-16 10:03:55 +08:00
Yilin Fan	7a5e0fd300	[fix] Fix Llama4 min-latency import error (#5209 ) Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>	2025-06-16 10:03:07 +08:00
Yan Chunwei	c84e41fd9d	fix: build_config in TorchLlmArgs and avoid arbitrary args (#4972 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-06-15 17:51:56 -07:00

1 2 3 4 5 ...

666 Commits