TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-29 07:02:56 +08:00

Author	SHA1	Message	Date
Lucas Liebenwein	752cc3a8cb	[https://nvbugs/5606166 ][fix] AutoDeploy: use tuples for cudagraph shape lookup (#8772 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2025-10-31 13:59:48 +01:00
Yukun He	a1d912688c	[https://nvbugs/5623960 ][fix] Compress the warning log of AutoTuner when encountering tactic failures. (#8795 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-10-31 12:55:56 +08:00
Jin Li	0dac57f2bc	[https://nvbugs/5569534 ][fix] Warm up with different sizes for more s… (#8515 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-10-29 22:29:06 -07:00
JunyiXu-nv	6adccd758d	[https://nvbugs/5606268 ][fix] Separate cuda graph workspace to prevent IMA (#8685 ) Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>	2025-10-29 09:43:30 +01:00
sunnyqgg	e9aa8b222f	[https://nvbugs/5556020 ][fix] cherry-pick fix test_disaggregated_serving.py::TestLlama3_1_8BInstruct::test_eagle3 dimension mismatch (#8644 ) Signed-off-by: qgai <qgai@nvidia.com>	2025-10-29 15:44:25 +08:00
Yukun He	e04354bc09	[https://nvbugs/5608489 ][fix] Fix output unpack issues for Llama3/4 NVFP4 models. (#8679 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-10-28 14:21:47 +08:00
Chang Liu	f4e1cc7b39	[https://nvbugs/5549081 ][fix] Fix device id assignment for some visio… (#8552 ) Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com> Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>	2025-10-23 14:06:13 +08:00
Lizhi Zhou	3f82cdbdad	[https://nvbugs/5582277 ][fix] rework DisaggPPTerminationHandler to fix hang issue (#8519 ) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>	2025-10-23 09:43:59 +08:00
Kaiyu Xie	c7b06b1b0a	[https://nvbugs/5488576 ][fix] Propagate disable_finalize_fusion config flag in WIDEEP MoE backend (cherry-pick #8141 ) (#8566 ) Signed-off-by: Sergey Klevtsov <sklevtsov@nvidia.com> Co-authored-by: Sergey Klevtsov <141879860+sklevtsov-nvidia@users.noreply.github.com>	2025-10-22 21:46:59 +08:00
Jin Li	6631791c60	[https://nvbugs/5546510 ][fix] Move torch.cuda.Stream out of torch com… (#8494 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-10-22 11:21:58 +08:00
JunyiXu-nv	0acdecb2c3	[https://nvbugs/5569713 ][fix] Disable fp8 deep gemm for EXAONE-4.0-32B-FP8 (#8429 ) Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>	2025-10-21 12:37:56 -04:00
mpikulski	f256eb9063	[TRTLLM-8650][fix] beam search request validation (#8433 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-10-21 10:50:27 +02:00
danielafrimi	a0b7fe9e36	[https://nvbugs/5524714 ][fix] Fix TP sharding of fused-QKV weight scales in W4A16 AWQ (#8432 ) Signed-off-by: Daniel Afrimi <dafrimi@nvidia.com>	2025-10-19 15:27:23 +03:00
Ziyi Xiong	4ad7ef1497	[https://nvbugs/5534705 ][fix] Skip unnecessary CUDA graph capture (#8… (#8344 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-10-16 10:27:19 +08:00
Patrice Castonguay	7862372ee2	[https://nvbugs/5552889 ][fix] fix: Prevent empty batch when using attention DP with disagg (#8372 ) Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>	2025-10-16 09:11:04 +08:00
amitz-nv	27c6c8466b	[https://nvbugs/5510879 ][fix] Fix pytorch & TRT-python flows fused LoRA adapter modules weight split with TP>1 (#8313 ) Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>	2025-10-15 08:24:02 -07:00
amitz-nv	e5476a6b2a	[https://nvbugs/5521949 ][fix] Update FP8 model with BF16 LoRA test, fix test_bielik_11b_v2_2_instruct_multi_lora (#8324 ) Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>	2025-10-15 05:48:38 -07:00
xiweny	d5b79268e7	[https://nvbugs/5565565 ] [fix] fp8 wideep support sm103 (#8228 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-10-15 10:17:08 +08:00
Jin Li	4bac6b337e	[https://nvbugs/5537348 ][fix] Use device tensor index for MTP (#8062 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-10-14 05:51:45 -07:00
Ziyi Xiong	9ecc6db5b4	[https://nvbugs/5537878 ][fix] Reserve an extra slot for padded batch … (#8231 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-10-13 23:34:22 -07:00
Yechan Kim	3d3d49434a	[https://nvbugs/5547434 ][fix] Fix Qwen2.5-VL device_path error (#8057 ) Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>	2025-10-13 14:12:27 +08:00
Yukun He	1ca84e1a25	[https://nvbugs/5536131 ][fix] Fix illegal access issue when scale is not provided in Llama3/4. (#7960 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-10-07 23:47:00 -07:00
Jin Li	b4e6a1648b	[https://nvbugs/5451280 ][fix] Reduce memory fraction problem by warmu… (#7999 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-10-03 18:14:13 -07:00
Enwei Zhu	a64d9b69e5	[None][fix] Fix chunked prefill state of draft request (#8067 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-09-30 09:51:21 +08:00
sunnyqgg	2e5850c28a	[TRTLLM-7330][feat] Eagle3 cuda graph support for the first draft model inference (#7363 ) Signed-off-by: qgai <qgai@nvidia.com>	2025-09-26 11:28:05 +08:00
dongfengy	1eb653146a	[https://nvbugs/5525951 ][fix] Clarify that PP is not supported for GPTOSS (#7911 ) Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>	2025-09-25 12:54:18 -07:00
QI JUN	1529a6f22d	[None][chore] extract weights loading related logic to model loader (#7579 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-09-25 10:19:22 -07:00
xxi	57ff5f4c0d	[None][fix] fix a bug in wideEp use DeepEP with num_chunks > 1 (#7954 ) Signed-off-by: xxi <xxi@nvidia.com>	2025-09-25 07:53:42 -07:00
Matthias Jouanneaux	eda1467061	[TRTLLM-5966][feat] Helix: add alltoall op (#6815 ) Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>	2025-09-25 07:18:29 -07:00
Guoming Zhang	202bed4574	[None][chroe] Rename TensorRT-LLM to TensorRT LLM for source code. (#7851 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-25 21:02:35 +08:00
Guoming Zhang	9f0f52249e	[None][doc] Rename TensorRT-LLM to TensorRT LLM for homepage and the … (#7850 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-25 21:02:35 +08:00
Yan Chunwei	5342c607cd	[https://nvbugs/5516710 ][fix] fix Llama 3.3 TP PP case (#7717 ) Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-25 21:02:35 +08:00
Tao Li @ NVIDIA	44d7c3b245	[https://nvbugs/1234567 ][fix] Revert https://github.com/NVIDIA/TensorRT-LLM/pull/7768/files (#7813 ) Signed-off-by: Tao Li Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-25 21:02:35 +08:00
Wanli Jiang	22b45ff9c7	[TRTLLM-7758][feat] Phi4-mm image modality inference optimization (#7918 ) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2025-09-25 15:58:29 +08:00
Void	336c2ef540	[None][feat] DeepEP LL fp8 dispatch/combine (#7927 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>	2025-09-25 09:20:24 +08:00
Leslie Fang	342014069e	[None][chore] Validate features combination (#7630 ) Signed-off-by: leslie-fang25 <leslief@nvidia.com>	2025-09-25 08:01:13 +08:00
Iman Tabrizian	da30d496b0	[None][fix] Revert "[None][feat] Return topk logprobs in torch backend (#7756 )" (#7969 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-09-24 15:36:38 -07:00
sychen52	5a65af24cd	[OMNIML-2336][feat] Add NVFP4 x FP8 moe kernels (#7821 ) Signed-off-by: Shiyang Chen <shiychen@nvidia.com>	2025-09-24 12:14:35 -07:00
Mike Iovine	42c2ec3239	[https://nvbugs/5473781 ][fix] Fix llama 4 FP8 for PP>1 (#7220 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-09-24 12:16:27 -04:00
Yuxian Qiu	48fda86c56	[None][fix] Fix dummy load format for DeepSeek. (#7874 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-09-24 23:03:16 +08:00
Eran Geva	603517f72a	[#7675 ][feat] CapturedGraph to support max_batch_size > max(cuda_graph_batch_sizes) (#7888 ) Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>	2025-09-24 10:11:44 -04:00
Necofish	cfbcf9b9e8	[None][feat] Support Seed-OSS model in pytorch backend (#7496 ) Signed-off-by: Nekofish-L <liuxiangyang@mail.ustc.edu.cn>	2025-09-24 03:57:12 -07:00
Cao Dong	2f8dc6feb0	[None][feat] Return topk logprobs in torch backend (#7756 ) Signed-off-by: Dong Cao <docao@nvidia.com>	2025-09-24 15:30:39 +08:00
Yueh-Ting (eop) Chen	cf100933cc	[TRTLLM-6341][feature] Support SWA KV cache reuse (#6768 ) This merge request attempts to support more SWA KV cache functionality inside the KV cache manager. Before this merge request, the KV cache for sliding window attention (SWA) only holds "window size" number of blocks and reuse them in a cyclic manner. We will not be able to utilize more GPU memory with this design, leading to a limited max batch size throughput. Additionally, we will not be able to support KV cache reuse with this design. In this MR, we change such behavior to let the manager write blocks in a linear manner. With a linear block writing behavior, as the attention window moves on, the out-of-window (OOW) blocks will be detached. Right now for the sake of a correct feature first, we directly offload the OOW block from the primary block pool (GPU memory) to the secondary block pool (host memory). We will improve this in the future by delegating the block movement to the eviction policy. KV cache reuse for SWA is not developed in this merge request and will be amended in a follow-up merge request. Writing the blocks linearly, the maximum number of blocks allocated for a sequence(`GenerationRequest`) is the "max sequence length" specified. The `GenerationRequest` that stores the cache block bookkeeping structure will now keep "max sequence length" tokens of blocks. Given the above, main changes are (more context in the MR): - Remove "cyclic" concept under the kv cache manager, such concept originally guards the block reuse under kv cache manager. - Add detach mechanism and have it under `KVCacheManager::addToken`. Please note that detach is still guarded off for SWA when reuse is enabled. A follow-up merge request will proceed to improve this. - Enforce "max sequence length" to be a non-optional parameter to the `KVCacheManager`/`BlockManager` - Let all window size resource pool get identical proportion of memory - Fix free memory calculation under `resource_manager.py` Signed-off-by: eopXD <yuehtingc@nvidia.com> Co-authored-by: Tomer Asida <tasida@nvidia.com>	2025-09-24 14:28:24 +08:00
Ziyi Xiong	31ef03fd82	[https://nvbugs/5528405 ][fix] Set up draft_tokens before scheduling (#7903 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-09-24 09:56:17 +08:00
Venky	6ff0fad75e	[TRTLLM-7015] [feat] Enable `prompt_logprobs` in pytorch backend (#7580 ) Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>	2025-09-23 18:48:10 -07:00
mpikulski	9970345919	[TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler, cf. TRTLLM-7156) (#7294 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-09-23 16:05:05 -07:00
Daniel Cámpora	9f1d9b7b18	[None][feat] Use list instead of torch tensor for new tokens in update requests (#7730 ) Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-09-23 10:40:08 -04:00
Zheyu Fu	34963ec39c	[None][fix] Assign [] to req.py_draft_tokens instead of None when spec decode is off (#7511 ) Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>	2025-09-23 06:54:18 -07:00
ChristinaZ	dd5fb2857a	[None][fix] Re-add the import for allgather that was mistakenly removed. (#7920 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-09-23 03:09:48 -07:00

1 2 3 4 5 ...

908 Commits