TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-23 12:12:39 +08:00

Author	SHA1	Message	Date
Yechan Kim	3d3d49434a	[https://nvbugs/5547434 ][fix] Fix Qwen2.5-VL device_path error (#8057 ) Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>	2025-10-13 14:12:27 +08:00
Yukun He	1ca84e1a25	[https://nvbugs/5536131 ][fix] Fix illegal access issue when scale is not provided in Llama3/4. (#7960 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-10-07 23:47:00 -07:00
Jin Li	b4e6a1648b	[https://nvbugs/5451280 ][fix] Reduce memory fraction problem by warmu… (#7999 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-10-03 18:14:13 -07:00
Enwei Zhu	a64d9b69e5	[None][fix] Fix chunked prefill state of draft request (#8067 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-09-30 09:51:21 +08:00
Yiqing Yan	4d5465a575	[None][chore] Bump version to 1.1.0 (#7942 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-09-26 13:17:36 +08:00
sunnyqgg	2e5850c28a	[TRTLLM-7330][feat] Eagle3 cuda graph support for the first draft model inference (#7363 ) Signed-off-by: qgai <qgai@nvidia.com>	2025-09-26 11:28:05 +08:00
Yuan Tong	fae83c387b	[#6102 ][fix] support non-system python installation (#7763 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-09-26 10:16:15 +08:00
Yanchao Lu	7e2521a7f0	[None][chore] Some clean-ups for CUDA 13.0 dependencies (#7979 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-26 08:46:11 +08:00
dongfengy	1eb653146a	[https://nvbugs/5525951 ][fix] Clarify that PP is not supported for GPTOSS (#7911 ) Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>	2025-09-25 12:54:18 -07:00
QI JUN	1529a6f22d	[None][chore] extract weights loading related logic to model loader (#7579 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-09-25 10:19:22 -07:00
xxi	57ff5f4c0d	[None][fix] fix a bug in wideEp use DeepEP with num_chunks > 1 (#7954 ) Signed-off-by: xxi <xxi@nvidia.com>	2025-09-25 07:53:42 -07:00
Matthias Jouanneaux	eda1467061	[TRTLLM-5966][feat] Helix: add alltoall op (#6815 ) Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>	2025-09-25 07:18:29 -07:00
Yueh-Ting (eop) Chen	c5012423f5	[None][chore] Remove developer name in comment (#7981 ) Signed-off-by: eopXD <yuehtingc@nvidia.com>	2025-09-25 06:43:38 -07:00
Guoming Zhang	202bed4574	[None][chroe] Rename TensorRT-LLM to TensorRT LLM for source code. (#7851 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-25 21:02:35 +08:00
QI JUN	961418908c	[https://nvbugs/5531963 ][fix] cherry pick #7725 (#7907 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-25 21:02:35 +08:00
Yan Chunwei	cb466a846d	[None][fix] api stability bug in status label (#7861 ) Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-25 21:02:35 +08:00
Yan Chunwei	9d48898def	[None][doc] add stable label to all the un-labelled arguments in LLM class (#7863 ) Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-25 21:02:35 +08:00
Guoming Zhang	9f0f52249e	[None][doc] Rename TensorRT-LLM to TensorRT LLM for homepage and the … (#7850 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-25 21:02:35 +08:00
Yan Chunwei	5342c607cd	[https://nvbugs/5516710 ][fix] fix Llama 3.3 TP PP case (#7717 ) Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-25 21:02:35 +08:00
Tao Li @ NVIDIA	44d7c3b245	[https://nvbugs/1234567 ][fix] Revert https://github.com/NVIDIA/TensorRT-LLM/pull/7768/files (#7813 ) Signed-off-by: Tao Li Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-25 21:02:35 +08:00
Wanli Jiang	22b45ff9c7	[TRTLLM-7758][feat] Phi4-mm image modality inference optimization (#7918 ) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2025-09-25 15:58:29 +08:00
Void	336c2ef540	[None][feat] DeepEP LL fp8 dispatch/combine (#7927 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>	2025-09-25 09:20:24 +08:00
Leslie Fang	342014069e	[None][chore] Validate features combination (#7630 ) Signed-off-by: leslie-fang25 <leslief@nvidia.com>	2025-09-25 08:01:13 +08:00
Iman Tabrizian	da30d496b0	[None][fix] Revert "[None][feat] Return topk logprobs in torch backend (#7756 )" (#7969 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-09-24 15:36:38 -07:00
sychen52	5a65af24cd	[OMNIML-2336][feat] Add NVFP4 x FP8 moe kernels (#7821 ) Signed-off-by: Shiyang Chen <shiychen@nvidia.com>	2025-09-24 12:14:35 -07:00
Mike Iovine	42c2ec3239	[https://nvbugs/5473781 ][fix] Fix llama 4 FP8 for PP>1 (#7220 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-09-24 12:16:27 -04:00
Yuxian Qiu	48fda86c56	[None][fix] Fix dummy load format for DeepSeek. (#7874 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-09-24 23:03:16 +08:00
Macrocell	6e5e8b8a3b	[None][fix] fix get_iteration_stats IndexError (#7216 ) Signed-off-by: yuhongwei <yumiao.yhw@antgroup.com> Co-authored-by: yuhongwei <yumiao.yhw@antgroup.com>	2025-09-24 22:43:03 +08:00
Eran Geva	603517f72a	[#7675 ][feat] CapturedGraph to support max_batch_size > max(cuda_graph_batch_sizes) (#7888 ) Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>	2025-09-24 10:11:44 -04:00
Necofish	cfbcf9b9e8	[None][feat] Support Seed-OSS model in pytorch backend (#7496 ) Signed-off-by: Nekofish-L <liuxiangyang@mail.ustc.edu.cn>	2025-09-24 03:57:12 -07:00
Enwei Zhu	a1a57e83b8	[TRTLLM-5235][feat] Enable regex and EBNF grammar in trtllm-serve (#7925 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-09-24 18:30:23 +08:00
JunyiXu-nv	6654b78c94	[https://nvbugs/5521799 ][fix] Trim incorrectly generated harmony messages (#7849 ) Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>	2025-09-24 16:38:43 +08:00
Cao Dong	2f8dc6feb0	[None][feat] Return topk logprobs in torch backend (#7756 ) Signed-off-by: Dong Cao <docao@nvidia.com>	2025-09-24 15:30:39 +08:00
Yueh-Ting (eop) Chen	cf100933cc	[TRTLLM-6341][feature] Support SWA KV cache reuse (#6768 ) This merge request attempts to support more SWA KV cache functionality inside the KV cache manager. Before this merge request, the KV cache for sliding window attention (SWA) only holds "window size" number of blocks and reuse them in a cyclic manner. We will not be able to utilize more GPU memory with this design, leading to a limited max batch size throughput. Additionally, we will not be able to support KV cache reuse with this design. In this MR, we change such behavior to let the manager write blocks in a linear manner. With a linear block writing behavior, as the attention window moves on, the out-of-window (OOW) blocks will be detached. Right now for the sake of a correct feature first, we directly offload the OOW block from the primary block pool (GPU memory) to the secondary block pool (host memory). We will improve this in the future by delegating the block movement to the eviction policy. KV cache reuse for SWA is not developed in this merge request and will be amended in a follow-up merge request. Writing the blocks linearly, the maximum number of blocks allocated for a sequence(`GenerationRequest`) is the "max sequence length" specified. The `GenerationRequest` that stores the cache block bookkeeping structure will now keep "max sequence length" tokens of blocks. Given the above, main changes are (more context in the MR): - Remove "cyclic" concept under the kv cache manager, such concept originally guards the block reuse under kv cache manager. - Add detach mechanism and have it under `KVCacheManager::addToken`. Please note that detach is still guarded off for SWA when reuse is enabled. A follow-up merge request will proceed to improve this. - Enforce "max sequence length" to be a non-optional parameter to the `KVCacheManager`/`BlockManager` - Let all window size resource pool get identical proportion of memory - Fix free memory calculation under `resource_manager.py` Signed-off-by: eopXD <yuehtingc@nvidia.com> Co-authored-by: Tomer Asida <tasida@nvidia.com>	2025-09-24 14:28:24 +08:00
Daniel Cámpora	5ccb2dea33	[None][chore] Make sampler type beta. (#7934 ) Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-09-23 20:51:39 -07:00
Yuan Tong	70c3b100eb	[#7692 ][fix] recognize RequestError as per-request error in background handler (#7726 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-09-24 11:11:17 +08:00
Yuan Tong	f050b8d871	[None][fix] refine `backend` option handling for commands (#7829 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-09-24 10:54:33 +08:00
Ziyi Xiong	31ef03fd82	[https://nvbugs/5528405 ][fix] Set up draft_tokens before scheduling (#7903 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-09-24 09:56:17 +08:00
Venky	6ff0fad75e	[TRTLLM-7015] [feat] Enable `prompt_logprobs` in pytorch backend (#7580 ) Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>	2025-09-23 18:48:10 -07:00
Lizhi Zhou	7550251988	[TRTLLM-7182][test] add multi-nodes test for disagg-serving (#7470 ) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>	2025-09-24 08:31:56 +08:00
mpikulski	9970345919	[TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler, cf. TRTLLM-7156) (#7294 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-09-23 16:05:05 -07:00
Yilin Fan	7d4d6cc9e0	[TRTLLM-7292][feat] Support multi-threaded tokenizers for trtllm-serve (cherry-pick) (#7776 ) Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>	2025-09-23 09:39:47 -07:00
Daniel Cámpora	9f1d9b7b18	[None][feat] Use list instead of torch tensor for new tokens in update requests (#7730 ) Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-09-23 10:40:08 -04:00
Zheyu Fu	34963ec39c	[None][fix] Assign [] to req.py_draft_tokens instead of None when spec decode is off (#7511 ) Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>	2025-09-23 06:54:18 -07:00
ChristinaZ	dd5fb2857a	[None][fix] Re-add the import for allgather that was mistakenly removed. (#7920 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-09-23 03:09:48 -07:00
Yan Chunwei	3ba19b6ff1	[https://nvbugs/5532023 ][fix] executor with-statement bug (#7895 ) Signed-off-by: chunweiy <chunweiy@nvidia.com>	2025-09-23 02:05:39 -07:00
Enwei Zhu	f882fb86db	[https://nvbugs/5367180 ][fix] Fix xgrammar import before loading tensorrt_llm binary (#7906 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-09-23 00:29:57 -07:00
Yan Chunwei	40820e6711	[None][fix] CHERRY-PICK trtllm-serve yaml loading (#7551 ) (#7897 ) Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>	2025-09-23 14:56:52 +08:00
Pengbo Wang	5792464d37	[None][fix] Read eos_token_id from generation_config for kimi_k2 (#7120 ) Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>	2025-09-23 10:47:03 +08:00
yunruis	126cd707e3	[None][opt] Add batch waiting when scheduling (#7416 ) Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>	2025-09-23 10:27:37 +08:00

1 2 3 4 5 ...

1302 Commits