TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Guoming Zhang	b4be0d2e4c	[None][chore] Refine qwen3-next implementation. (#8064 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-09-30 15:05:13 -04:00
Yechan Kim	948b8b9569	[None][fix] Fix CUDA graph for Qwen2.5-VL (#8047 ) Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>	2025-09-30 14:40:03 +08:00
Cao Dong	62010c0ab7	[None][feat] Return topk logprobs in torch backend (#7976 ) Signed-off-by: Cao Dong <87467313+dcaox@users.noreply.github.com>	2025-09-30 09:32:37 +08:00
bhsueh_NV	38d6e4e60b	[None][feat] Support Qwen3 next (#7892 ) Signed-off-by: mengw <12670782+wm2012011492@users.noreply.github.com> Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Co-authored-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-09-29 21:16:07 +08:00
mpikulski	a0d489a8d5	[TRTLLM-7728][perf] improve batched sampling perf for contiguous batches (#7908 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-09-29 13:32:50 +01:00
Zongfei Jing	e9f26feeb6	[None][chore] Cherry-pick from (#7598 ) Make low_precision_combine as a llm arg (#7898 ) Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-09-28 22:32:33 -04:00
Yukun He	28b9a81c58	[TRTLLM-4500][feat] Add serialization/deserialization options for AutoTuner profiling cache (#7738 ) To achieve determinism for the AutoTuner profiling cache, serialization and deserialization are introduced to store the cache on disk in JSON format. Use TLLM_AUTOTUNER_CACHE_PATH to indicate the path where the cache file should be stored: Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-09-29 07:40:51 +08:00
Xianjie Qiao	c8f98b3065	[None] [feat] Update disagg gen-only benchmark. (#7917 ) Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>	2025-09-28 09:56:56 +08:00
Iman Tabrizian	33282351a2	[TRTLLM-6106][feat] Add support for KVCache transfer from KVCache reuse path (#6348 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-09-27 19:29:30 -04:00
YueWeng	a4243f0da5	[TRTLLM-6393][feat] add static tree sampling and verification (#7161 ) Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>	2025-09-26 13:16:16 -04:00
HuiGao-NV	f4d3be4bbc	[None][feat] Add a standalone buffer cache class and reuse buffers between cduagraph and no-graph flow (#7669 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-09-26 07:28:06 -07:00
HuiGao-NV	a9965d84e0	[None][chore] Report NCCL error message but not OOM when NCCL error happens (#8009 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-09-25 23:07:32 -07:00
sunnyqgg	2e5850c28a	[TRTLLM-7330][feat] Eagle3 cuda graph support for the first draft model inference (#7363 ) Signed-off-by: qgai <qgai@nvidia.com>	2025-09-26 11:28:05 +08:00
QI JUN	1529a6f22d	[None][chore] extract weights loading related logic to model loader (#7579 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-09-25 10:19:22 -07:00
Guoming Zhang	202bed4574	[None][chroe] Rename TensorRT-LLM to TensorRT LLM for source code. (#7851 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-25 21:02:35 +08:00
Leslie Fang	342014069e	[None][chore] Validate features combination (#7630 ) Signed-off-by: leslie-fang25 <leslief@nvidia.com>	2025-09-25 08:01:13 +08:00
Iman Tabrizian	da30d496b0	[None][fix] Revert "[None][feat] Return topk logprobs in torch backend (#7756 )" (#7969 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-09-24 15:36:38 -07:00
Yuxian Qiu	48fda86c56	[None][fix] Fix dummy load format for DeepSeek. (#7874 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-09-24 23:03:16 +08:00
Cao Dong	2f8dc6feb0	[None][feat] Return topk logprobs in torch backend (#7756 ) Signed-off-by: Dong Cao <docao@nvidia.com>	2025-09-24 15:30:39 +08:00
Yueh-Ting (eop) Chen	cf100933cc	[TRTLLM-6341][feature] Support SWA KV cache reuse (#6768 ) This merge request attempts to support more SWA KV cache functionality inside the KV cache manager. Before this merge request, the KV cache for sliding window attention (SWA) only holds "window size" number of blocks and reuse them in a cyclic manner. We will not be able to utilize more GPU memory with this design, leading to a limited max batch size throughput. Additionally, we will not be able to support KV cache reuse with this design. In this MR, we change such behavior to let the manager write blocks in a linear manner. With a linear block writing behavior, as the attention window moves on, the out-of-window (OOW) blocks will be detached. Right now for the sake of a correct feature first, we directly offload the OOW block from the primary block pool (GPU memory) to the secondary block pool (host memory). We will improve this in the future by delegating the block movement to the eviction policy. KV cache reuse for SWA is not developed in this merge request and will be amended in a follow-up merge request. Writing the blocks linearly, the maximum number of blocks allocated for a sequence(`GenerationRequest`) is the "max sequence length" specified. The `GenerationRequest` that stores the cache block bookkeeping structure will now keep "max sequence length" tokens of blocks. Given the above, main changes are (more context in the MR): - Remove "cyclic" concept under the kv cache manager, such concept originally guards the block reuse under kv cache manager. - Add detach mechanism and have it under `KVCacheManager::addToken`. Please note that detach is still guarded off for SWA when reuse is enabled. A follow-up merge request will proceed to improve this. - Enforce "max sequence length" to be a non-optional parameter to the `KVCacheManager`/`BlockManager` - Let all window size resource pool get identical proportion of memory - Fix free memory calculation under `resource_manager.py` Signed-off-by: eopXD <yuehtingc@nvidia.com> Co-authored-by: Tomer Asida <tasida@nvidia.com>	2025-09-24 14:28:24 +08:00
Ziyi Xiong	31ef03fd82	[https://nvbugs/5528405 ][fix] Set up draft_tokens before scheduling (#7903 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-09-24 09:56:17 +08:00
Venky	6ff0fad75e	[TRTLLM-7015] [feat] Enable `prompt_logprobs` in pytorch backend (#7580 ) Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>	2025-09-23 18:48:10 -07:00
mpikulski	9970345919	[TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler, cf. TRTLLM-7156) (#7294 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-09-23 16:05:05 -07:00
Daniel Cámpora	9f1d9b7b18	[None][feat] Use list instead of torch tensor for new tokens in update requests (#7730 ) Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-09-23 10:40:08 -04:00
Zheyu Fu	34963ec39c	[None][fix] Assign [] to req.py_draft_tokens instead of None when spec decode is off (#7511 ) Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>	2025-09-23 06:54:18 -07:00
yunruis	126cd707e3	[None][opt] Add batch waiting when scheduling (#7416 ) Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>	2025-09-23 10:27:37 +08:00
Enwei Zhu	8330d5363a	[TRTLLM-8209][feat] Support new structural tag API (upgrade XGrammar to 0.1.25) (#7893 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-09-23 09:10:09 +08:00
Yechan Kim	f77aca9f2c	[TRTLLM-7385][feat] Optimize Qwen2/2.5-VL performance (#7250 ) Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>	2025-09-22 03:40:02 -07:00
HuiGao-NV	0dac1ddb74	[https://nvbugs/5525849 ][fix] Cherry-pick to fix mismatch of max seq len between kv cache manager and dummy requests (#7855 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-09-22 18:07:47 +08:00
HuiGao-NV	af34c9713a	[https://nvbugs/5474169 ][fix] seq_len mismatch between kv cache manager and graph attn metadata (#7606 ) Signed-off-by: Hui Gao <huig@nvidia.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-22 14:28:38 +08:00
HuiGao-NV	123f5cbbf0	[https://nvbugs/5474169 ][fix]Adjust max seq len for kvcache for memory estimation (#7391 ) Signed-off-by: Hui Gao <huig@nvidia.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-22 14:28:38 +08:00
Stefan Niebler	8aead224fb	[https://nvbugs/5513423 ][fix] Correctly respect min_tokens in PyTorch Workflow (#7808 ) Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com> Co-authored-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>	2025-09-21 22:15:18 -07:00
Enwei Zhu	639d4109a7	[None][fix] Disable torch.compile for CapturableGuidedDecoder (#7871 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-09-22 10:04:30 +08:00
Ziyi Xiong	897c4dd23b	[https://nvbugs/5517404 ][fix] Use the correct cuda graph for dynamic spec dec (#7728 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-09-21 08:20:48 +08:00
Yuxian Qiu	d6ebcf7c4a	[TRTLLM-6994][feat] FP8 Context MLA integration (Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/6059 from release/1.1.0rc2) (#7610 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-09-19 09:40:49 +08:00
Ziyi Xiong	420f0fbcf5	[https://nvbugs/5522851 ][fix] Correct the logic to update kv_lens_cuda (#7790 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-09-19 08:11:29 +08:00
sunnyqgg	80dd8fe197	[TRTLLM-6746][feat] Enable two-model spec dec for MTP Eagle (#7001 ) Signed-off-by: qgai <qgai@nvidia.com>	2025-09-18 12:05:36 -04:00
Leslie Fang	870cfcf9a0	[None][chore] Remove executor config in create_py_executor (#7599 ) Signed-off-by: leslie-fang25 <leslief@nvidia.com>	2025-09-18 14:24:58 +08:00
Yukun He	cd80e0a7f1	[None][fix] Make tile_tokens_dim calculation just in time before kernel launching. (#7529 ) tile_tokens_dim directly depends on the num_token, which is a dynamic shape during tuning and inference. When AutoTuner prepares dummy tensors with different num_tokens, it does not update the value of tile_tokens_dim automatically. Therefore, the value stored in the AutoTuner cache is misaligned, which will introduce a lot of cache misses during inference, which hurts perf a lot. To avoid this issue, we move the calculation of tile_tokens_dim right before kernel launching, so that the value of tile_tokens_dim is always up to date with the num_tokens of the current input tensor used for the kernel runner. Also, the tile_tokens_dim is calculated based on the number of tokens of a tuned bucket, instead of the original token number. Because we only tune the value for the buckets, not for the raw input token number, to avoid unexpected misalignment between tile_tokens_dim and the token number. This PR also removes the warmup requests with the extra input shapes, which are triggered in the CUDA graph warmup phase. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-09-18 10:58:52 +08:00
Netanel Haber	a5cfc8368f	[https://nvbugs/5508536 ][fix] Revert #7041 : Move stop_criteria to sample_async (#7041 ) (#7796 ) Signed-off-by: Netanel Haber <nhaber@nvidia.com> Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> Co-authored-by: Mike Iovine <miovine@nvidia.com>	2025-09-17 21:27:01 -04:00
HuiGao-NV	a49cfb3e68	[https://nvbugs/5516666 ][fix] cherrypick fix to the CUDA graph warmup issue when using speculative decoding (#7737 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com> Co-authored-by: Signed-off-by: Hui Gao <huig@nvidia.com>	2025-09-17 06:24:20 +08:00
xiweny	c076a02b38	[TRTLLM-4629] [feat] Add support of CUDA13 and sm103 devices (#7568 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com> Signed-off-by: Daniel Stokes <dastokes@nvidia.com> Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com> Signed-off-by: Xiwen Yu <xiweny@nvidia.com> Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com> Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Signed-off-by: Bo Deng <deemod@nvidia.com> Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: xiweny <13230610+VALLIS-NERIA@users.noreply.github.com> Co-authored-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com> Co-authored-by: Daniel Stokes <dastokes@nvidia.com> Co-authored-by: Zhanrui Sun <zhanruis@nvidia.com> Co-authored-by: Jiagan Cheng <jiaganc@nvidia.com> Co-authored-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Bo Deng <deemod@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-09-16 09:56:18 +08:00
Ziyi Xiong	536e8776cd	[TRTLLM-6668][feat] Enable overlap scheduler for two-model spec decoding (#7651 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-09-16 07:33:44 +08:00
jmydurant	7deefb3d2b	[TRTLLM-7192][feat] optimize MLA chunked prefill && support fp8 mla chunked prefill (#7477 ) Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-09-15 21:43:49 +08:00
Zheng Duan	24fc1f9acf	[None][fix] using arrival time in llmapi when creating LlmRequest in pytorch workflow (#7553 ) Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>	2025-09-15 07:26:01 -04:00
DylanChen-NV	d5df0af017	[https://nvbugs/5467981 ][fix] Fix Qwen2.5-VL fails with cuda graph padding (#7122 ) Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>	2025-09-15 15:02:34 +08:00
Chang Liu	47e37755a3	[TRTLLM-6903][feat] Support chunked prefill for multimodal models (#6843 ) Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>	2025-09-14 20:10:10 -07:00
Leslie Fang	d219a4f225	[None][chore] remove executor config in kv cache creator (#7526 ) Signed-off-by: leslie-fang25 <leslief@nvidia.com>	2025-09-10 21:14:44 +08:00
Zheyu Fu	c353ff342e	[None][feat] Make the should_use_spec_decode logic a bit smarter (#7112 ) Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>	2025-09-10 12:53:59 +08:00
Richard Huo	dcd110cfac	[None][chore] add TorchLlmArgs to the connector api (#7493 ) Signed-off-by: richardhuo-nv <rihuo@nvidia.com>	2025-09-09 09:05:59 -04:00

1 2 3 4 5 ...

444 Commits