TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Shunkangz	bddf183e15	[None][feat] Add Request specific exception (#6931 ) Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>	2025-09-04 18:43:42 -04:00
Tian Zheng	e257cb3533	[None][feat] Support NVFP4 KV Cache (#6244 ) Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>	2025-09-01 09:24:52 +08:00
Richard Huo	ce580ce4f5	[None][feat] KV Cache Connector API (#7228 ) Signed-off-by: jthomson04 <jwillthomson19@gmail.com> Signed-off-by: richardhuo-nv <rihuo@nvidia.com> Co-authored-by: jthomson04 <jwillthomson19@gmail.com> Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com> Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>	2025-08-28 23:09:27 -04:00
qixiang-99	b165f8bc97	fix/improve kvcache allocation in PyTorch runtime (#5933 ) Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>	2025-08-26 12:40:22 +08:00
Robin Kobus	37543a9ad7	[None][refactor] Simplify decoder state initialization for speculative decoding (#6869 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-08-22 18:44:17 +02:00
brb-nv	9a2b44d0f2	[None][chore] No-op changes to support context parallelism in disaggregated serving later (#7063 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-08-21 08:21:27 -07:00
Yueh-Ting (eop) Chen	020fed97b6	[TRTLLM-6341][chore] Preliminary refactors on the kv cache manager before supporting swa kv cache reuse (#6767 ) This MR is a preliminary MR for implementing the SWA reuse mechanism for the kv cache manager. Please be aware that no functional change is intended in this merge request. The purpose of the clean-up is to decouple and remove existing functions for the up-coming SWA KV cache reuse change to be more natural and easier to review. Right now, (1) streamLLM, and (2) beam search with SWA, are broken. We do not want to complicate the code base by stacking more features upon something that does not work. This MR prunes out the logic and add assertions so we can come back and re-support the broken feature and remove the assertion. Since streamLLM (sink attention) is broken now, assertion is added under `KVCacheManager` ctor to guard for the value of `mSinkBlockTokenLength` and `mSinkBubbleLength`. Compute logics relate to it are pruned. The beam search with SWA will still be broke when introducing the SWA KV cache reuse. We will revisit this problem in the future. On top of this, we should make an effort to update the [supporting matrix](https://github.com/NVIDIA/TensorRT-LLM/blob/feat/1.0_doc_dev/docs/source/1.0/features/feature-combination-matrix.md) of the kv cache manager after merging the support of SWA KV cache reuse. Changes are listed as following: - Separate `KVCacheManager::updateToken` into `KVCacheManager::addToken` and `KVCacheManager::removeToken`. The functionality should be decoupled. - Push utility `cacheSequenceBlockOffsets` and `cacheNewBlockOffset` from `KVCacheManager` down to `WindowBlockManager`. `KVCacheManager`-exposed functions should be real utilities that users of the structure can leverage. Implementation-detailed function calls should not exist at this level. - Simplify "is shared last context block" logic under `KVCacheManager::addSequence`. Since no functional change is intended in this merge request, no test case is added. Several comments are added for future test coverage reminder. For `LlmRequestTest.ParamTest`, `streaming=True` is commented out because we guard sink attention with assertion now. In `capacitySchedulerTest`, `addToken` action to `crossKVCacheManager` is removed because in encoder-decoder model, generation tokens are added only to the decoder and not to the encoder. Signed-off-by: eopXD <yuehtingc@nvidia.com>	2025-08-20 13:57:57 +08:00
Zero Zeng	953f4fd69e	[None][fix] acceptance rate calculation fix in benchmark_serving (#6746 ) Signed-off-by: Zero Zeng <38289304+zerollzeng@users.noreply.github.com>	2025-08-19 17:29:36 +08:00
yifeizhang-c	4127d77678	[https://nvbugs/5394392 ][fix] Enlarge scheduler capacity under disagg bs == 1 (#6537 ) Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>	2025-08-15 09:52:06 -07:00
Robin Kobus	45c7518032	[None][refactor] Simplify decoder state initialization (#6559 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-08-12 21:44:41 +02:00
bhsueh_NV	83dbc6c75d	[TRTLLM-5532][feat] store the block of context request into kv cache (#6683 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-08-11 16:14:52 +08:00
Chuang Zhu	c566a8d2a2	[None][fix] fix same pp disagg (#6730 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-08-10 22:45:15 -04:00
Yueh-Ting (eop) Chen	199f306984	[None][chore][kv cache manager] Dead code elimination, we no longer record/fetch through WindowBlockManager:: mContextBlocksByHash (#6249 ) No functional change is intended in this MR. `WindowBlockManager::mCachedBlocksRoot` is now who is responsible for the bookkeeping of the `KVCacheBlock`, and the `mNextBlocks` is now the actual hash map that fetches the block. The `mEnableHashKey` knob and related hashing is removed. Signed-off-by: eopXD <yuehtingc@nvidia.com>	2025-08-10 09:10:10 -04:00
Ziyi Xiong	de472828b9	[TRTLLM-6637][feat] Resolve KV cache divergence issue (#6628 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-08-09 23:15:04 +08:00
Chuang Zhu	e251f7c00b	[None][fix]revert kvcache transfer (#6709 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-08-08 07:18:53 -04:00
Zheng Duan	ebdc43e69d	[None][feat] move kv cache measure into transfer session (#6633 ) Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>	2025-08-08 17:49:22 +08:00
Daniel Cámpora	efca359b66	[TRTLLM-6785][feat] BREAKING CHANGE Enable TRTLLM sampler by default (#6216 ) Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-08-07 22:19:37 -04:00
pcastonguay	453a06e6ab	[TRTLLM-6881][feat] Include attention dp rank info with KV cache events (#6563 ) Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>	2025-08-07 14:17:07 +02:00
amitz-nv	85af62184b	[TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter (#6510 ) Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>	2025-08-07 09:05:36 +03:00
Chuang Zhu	ee471df07c	[None][chore] optimize kv cache transfer for context TEP and gen DEP (#6657 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-08-07 11:36:05 +08:00
ixlmar	1ebceb790d	[TRTLLM-5508][feat] check input tokens + improve error handling (#5170 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-08-05 18:27:43 +01:00
Chuang Zhu	4d040b50b7	[None][chore] ucx establish connection with zmq (#6090 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-08-05 02:50:45 -04:00
Chuang Zhu	542f552d0b	use cudaSetDevice to create context ,fix nvbug 5394497 (#6403 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-08-03 13:32:55 -04:00
Robin Kobus	918fedf952	[None][refactor] Simplify finish reasons handling in DecoderState (#6524 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-08-02 07:17:43 +02:00
Jaedeok Kim	fbee279909	fix: remove duplicate layer multiplication in KV cache size calculation (#6481 ) Signed-off-by: Jaedeok Kim <jaedeokk@nvidia.com>	2025-07-31 22:34:34 -04:00
Enwei Zhu	4b299cb77e	feat: Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21 (#6408 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-07-31 09:53:52 +08:00
pcastonguay	e7ae5e2824	feat: Add support for disaggregation with pp with pytorch backend (#6369 ) Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> Signed-off-by: raayandhar <rdhar@nvidia.com> Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com> Signed-off-by: pcastonguay <55748270+pcastonguay@users.noreply.github.com> Co-authored-by: raayandhar <rdhar@nvidia.com> Co-authored-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>	2025-07-30 09:42:13 -04:00
Zheng Duan	c9ed1ab436	[TRTLLM-6549] chore: record delay introduced by disaggregated serving in kv cache measure (#6135 ) Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>	2025-07-30 10:39:40 +08:00
Michal Guzek	08d57123f9	[nvbug/5374773] chore: Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache (#5974 ) Signed-off-by: moraxu <mguzek@nvidia.com>	2025-07-25 18:10:40 -04:00
Chang Liu	7381f1dba7	[TRTLLM-5059][feat] Add KV cache reuse support for multimodal models (#5444 ) Only supports qwen in this PR	2025-07-21 16:11:58 -07:00
amitz-nv	98428f330e	[TRTLLM-5826][feat] Support pytorch LoRA adapter eviction (#5616 ) Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>	2025-07-20 08:00:14 +03:00
Bo Deng	0388ff9083	[https://nvbugs/5393961 ][fix] record kv-cache size in MLACacheFormatter (#6181 ) Signed-off-by: Bo Deng <deemod@nvidia.com>	2025-07-19 05:06:45 +08:00
Stefan Niebler	6d7874a467	[nvbugs/5369799] fix: Update disaggregation handling in sampler (#5762 ) Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>	2025-07-19 01:40:46 +08:00
Robin Kobus	ec2b953e7e	refactor: Enhanced handling of decoder requests and logits within the batch manager (#6055 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-18 12:12:08 +02:00
Iman Tabrizian	b75e53ab69	Revert "feat: nanobind bindings (#5961 )" (#6160 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-07-18 10:12:54 +08:00
Linda	5bff317abf	feat: nanobind bindings (#5961 ) Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>	2025-07-17 22:42:52 +08:00
Chuang Zhu	44c70c88f9	chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service (#5234 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-07-17 17:42:07 +08:00
Zheng Duan	38db4bc7fb	feat: use session abstraction in data transceiver and cache formatter (#5611 ) Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>	2025-07-16 13:52:44 +08:00
Robin Kobus	6d4b045d1f	refactor: Remove enforced sorted order of batch slots (#3502 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-14 17:23:02 +02:00
narutolhy	41ef1ade19	feat:enable kvcache to be reused during request generation (#4028 ) Signed-off-by: narutolhy <582909902@qq.com>	2025-07-10 22:18:01 +09:00
xiweny	eaf8bec88b	fix: Disaggregate serving with attention DP (#4993 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-07-08 16:15:03 +08:00
Robin Kobus	ae27261094	refactor: decoding inputs (#5679 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-06 08:21:02 +02:00
jthomson04	1b588f8390	feat: KV events for sliding window attention (#5580 ) Signed-off-by: jthomson04 <jwillthomson19@gmail.com>	2025-07-05 06:05:20 +08:00
Chuang Zhu	ffc0b8f5da	Cache transceiver support VSWA (#5505 ) Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> Co-authored-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>	2025-07-05 01:18:42 +09:00
Robin Kobus	07f9cf1519	fix: Improve chunking test and skip empty kernel calls (#5710 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-04 09:08:15 +02:00
Netanel Haber	134b2383ff	[fix: nvbugs/5355493] Correctly clamp max sequence len to max attention window (#5720 ) Signed-off-by: Netanel Haber <nhaber@nvidia.com>	2025-07-04 08:16:25 +02:00
Robin Kobus	4cd8543d8c	[TRTLLM-1316] refactor: Remove unnecessary pipeline parallelism logic from postProcessRequest (#5489 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-02 10:13:31 +02:00
Robin Kobus	9bdc5951f8	refactor: decoder state setup (#5093 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-30 11:09:43 +02:00
Robin Kobus	a8141a4513	refactor: Speculative decoding buffers part 2 (#5316 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-27 17:41:48 +02:00
Aurelien Chartier	833c0dea4a	[TRTLLM-6104] feat: add request_perf_metrics to LLMAPI (#5497 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-06-27 17:03:05 +02:00

1 2 3 4 5

212 Commits