TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
rongwei	526b3ca235	fix mtp rewind corner case	2026-01-08 10:25:20 +08:00
Balaram Buddharaju	a792c23dcf	[TRTLLM-9465][fix] Swap TP-CP grouping order (#10350 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2026-01-05 20:08:03 +08:00
Yueh-Ting (eop) Chen	9cee32ab39	[https://nvbugs/5625990 ][fix] Respect VSWA scheme when doing block store for reuse and load block for reuse in KV cache manager (#10183 ) Signed-off-by: eopXD <yuehtingc@nvidia.com>	2025-12-29 14:29:14 +08:00
Roey Azran	8408c40d8b	[https://nvbugs/5702786 ][fix] Fix race conditions in KV cache communication during unexpected termination (#10076 ) Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>	2025-12-23 14:09:51 +02:00
Wangjue Yao	9f283f330b	[None][feat] Support Mooncake transfer engine as a cache transceiver backend (#8309 ) Signed-off-by: wjueyao <wyao123@terpmail.umd.edu> Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>	2025-12-19 10:09:51 +08:00
Simeng Liu	f21e2b3329	[TRTLLM-9601][feat] Expose mmKeys for multimodal to integrate with dynamo. (#9604 ) Signed-off-by: SimengLiu-nv <simengl@nvidia.com>	2025-12-15 08:42:30 +08:00
Chuang Zhu	4cc4cbe926	[https://nvbugs/5716787 ][fix] terminate nixl running when exiting (#9785 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> Co-authored-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>	2025-12-12 11:15:02 -05:00
Jiagan Cheng	4a3a66b124	[https://nvbugs/5677746 ][fix] Use first PP rank's schedule result in other PP ranks to fix PP hang (#9659 ) Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>	2025-12-08 18:43:52 -08:00
Perkz Zheng	992781dc7b	[None][feat] update trtllm-gen nvfp4 kernels with better performance (#9510 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-12-03 21:35:49 +08:00
Thor Johnsen	95049eea86	[https://nvbugs/5627710 ][fix] Fix synchronization bugs in KvCacheTransferManager that can cause corrupted blocks (#9056 ) Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com> Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com> Co-authored-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-12-02 09:10:21 -06:00
Iman Tabrizian	356a52edf5	[None][feat] Add support for KVCache reuse for DSv32 (#9383 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-12-02 11:14:30 +08:00
Patrice Castonguay	1b2da426cd	[https://nvbugs/5680310 ][fix] Fix ctx only timed out test (#9410 ) Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>	2025-11-27 11:21:21 +08:00
Robin Kobus	32f53910ef	[TRTLLM-909][feat] Overlap context chunks in pipeline parallel mode (#9308 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-11-25 22:11:51 +01:00
Chuang Zhu	f95edb53e1	[None][fix] enhance warning in cacheTransBuffer (#9390 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-11-24 02:17:54 -08:00
cheshirekow	1379cfac3a	[TRTLLM-9197][infra] Move thirdparty stuff to it's own listfile (#8986 ) Signed-off-by: Josh Bialkowski <1309820+cheshirekow@users.noreply.github.com> Co-authored-by: Josh Bialkowski <1309820+cheshirekow@users.noreply.github.com>	2025-11-20 16:44:23 -08:00
Chuang Zhu	8846dac9b4	[https://nvbugs/5578175 ][fix] Fix block range index (#8470 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com>	2025-11-20 12:43:13 -05:00
Kanghwan	41e5870a70	[#8476 ][chore] Update license (#8807 ) Signed-off-by: Kanghwan Jang <861393+karljang@users.noreply.github.com>	2025-11-19 15:05:25 -08:00
Patrice Castonguay	9b0f45298f	[None][feat] Have ability to cancel disagg request if KV cache resource are exhausted (#9155 ) Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>	2025-11-18 20:59:17 -05:00
Robin Kobus	9913dc25ae	[None][refactor] decoding inputs, part 2 (#5799 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-11-18 14:38:51 +01:00
Robin Kobus	df41f220a2	[TRTLLM-8831][feat] Enable early exit with overlap scheduler (#8587 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-11-17 18:07:13 +01:00
Iman Tabrizian	cdde15b275	[TRTLLM-8540][feat] Add support for disagg in DSv3.2 (#8735 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-11-12 08:21:11 -08:00
brb-nv	d798d66976	[TRTLLM-7731][feat] Avoid over-allocation of KV cache for transmission in disagg with CP (#8145 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-10-31 17:32:39 -07:00
Iman Tabrizian	ae6875fe10	[TRTLLM-8976][feat] Move indexer-k-cache to KVCacheManager (#8699 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-10-29 08:04:26 -07:00
Zheng Duan	fea5bfbda7	[None][feat] add detailed KV cache transfer time breakdown (#8521 ) Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>	2025-10-29 10:11:09 +08:00
Chuang Zhu	2420918e5b	[TRTLLM-7078][chore] optimal kvcache transfer for VWSA (#7952 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-10-24 08:58:16 -04:00
Aurelien Chartier	32e1ad68e1	[None][chore] Cleanup GDS code (#8475 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-10-23 12:36:31 -07:00
Shi Xiaowei	a0024f4d34	[None][doc] Facilitates the integration of the transfer agent (#7867 ) Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>	2025-10-21 20:06:24 +08:00
Yueh-Ting (eop) Chen	128a351bdc	[None][fix] Avoid overwrite of `kv_cache_config.max_tokens` for VSWA scheme for the KVCacheManager (#8219 ) For VSWA scheme, we do not want `kv_cache_cnonfig.max_token` to control and cap the maximum memory of a block pool because block pool size are not identical amongst different window sizes. This MR omits the effect of `kv_cache_config.max_tokens` under `kvCacheManager.cpp` to allow the setting of block pool size to rely on the window size to share ratio and the total gpu memory analyzed and fed to the kv cache manager. Only skipping for VSWA scheme, no extra coverage was added. Signed-off-by: eopXD <yuehtingc@nvidia.com>	2025-10-20 10:48:40 +09:00
Bo Deng	dd25595ae8	[TRTLLM-7964][infra] Set nixl to default cache transceiver backend (#7926 ) Signed-off-by: Bo Deng <deemod@nvidia.com>	2025-10-19 19:24:43 +08:00
jthomson04	852316886e	[None][fix] Fix KV event consumption (#6346 ) Signed-off-by: jthomson04 <jwillthomson19@gmail.com>	2025-10-18 15:41:26 -07:00
Chuang Zhu	40d129a415	[None][fix] Fix cache buffer size for window (#8320 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-10-16 09:01:11 +08:00
Chuang Zhu	8733e830fc	[None][fix] Add lock for request_to_session in sendReadySingal (#8310 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-10-14 04:32:37 -07:00
Yueh-Ting (eop) Chen	4882815fa1	[TLLM-6777][feature] Support SWA KV cache reuse OOW block detach (#7922 ) This MR is a continuation of #6768. In the previous merge request, OOW (out-of-window) blocks are only detached when reuse is not enabled, that is, the block movement behavior is identical between SWA and full attention when reuse is enabled. This merge request attempts to enable OOW block detach when reuse is enabled. The required changes are: - Let KV cache manager keep track of which block is used by which sequence - Remove restriction for the eviction policy to be able to release a non-leaf block Along with the development, bugs inside freeChildren and offload mechanism under getFreeBlock is resolved because they will affect the functionality this merge request is trying to achieve. When a block goes OOW, it is released from the sequence, it will be available to be reclaimed and the block is held by the eviction policy for another sequence to acquire upon calling. On the other hand, we want to potentially store the sequence for reuse. To safely achieve this, the record of block ownership is done under WindowBlockManager::getFreeBlock. If the block acquired was originally owned by another sequence that is live inside the manager, then we invalidate the sequence for store for reuse. At the end of a sequence (when removeSequence is called toward it), the KV cache manager will check if the sequence has all blocks not reclaimed by another sequence. If so, then the sequence is safe to be stored for reuse and store for reuse action will be performed. Signed-off-by: eopXD <yuehtingc@nvidia.com>	2025-10-13 09:18:12 -07:00
Chuang Zhu	85f157f389	[None][fix] Add Lock to protect mReqeustToSession (#8085 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> Co-authored-by: Xianjie Qiao <5410381+qiaoxj07@users.noreply.github.com>	2025-10-10 21:51:50 +08:00
Jonas Yang CN	88ea2c4ee9	[TRTLLM-7349][feat] Adding new orchestrator type -- ray (#7520 ) Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com>	2025-10-04 08:12:24 +08:00
Robin Kobus	e2f69c5c23	[None] [refactor] Minor cleanup and improvements (#7619 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-10-03 11:40:06 +02:00
Yilin Fan	01423ac183	[None][feat] perf_metrics endpoint functionality improvement (#8005 ) Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com> Signed-off-by: nv-yilinf <206948969+nv-yilinf@users.noreply.github.com>	2025-10-02 17:43:25 -07:00
Patrice Castonguay	fefa7d8fa3	[None][feat] Support for cancelling requests with disaggregation (#8114 ) Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>	2025-10-02 11:04:26 -07:00
Iman Tabrizian	33282351a2	[TRTLLM-6106][feat] Add support for KVCache transfer from KVCache reuse path (#6348 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-09-27 19:29:30 -04:00
Yueh-Ting (eop) Chen	2db22fb4e5	[None][feature] Add environment variable to adjust block pool allocation ration under kv cache manager (#7923 ) By default, we allocate equal proportion shares of memory for all window sizes (see the else case). With TRTLLM_WINDOW_SIZE_SHARES, we can override this behavior to adjust the memory share of each window size. For example, if we have window size of [512, 32768], then setting TRTLLM_WINDOW_SIZE_SHARES=0.4,0.6 will be allocating 40% of the memory to window size 512 and 60% of the memory to window size 32768. Signed-off-by: eopXD <yuehtingc@nvidia.com>	2025-09-26 14:09:01 +08:00
Chuang Zhu	f98fa0cf8b	[None][feat] Optimize kv cache transfer TEP (#7613 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-09-25 20:20:04 -07:00
Yueh-Ting (eop) Chen	c5012423f5	[None][chore] Remove developer name in comment (#7981 ) Signed-off-by: eopXD <yuehtingc@nvidia.com>	2025-09-25 06:43:38 -07:00
Guoming Zhang	202bed4574	[None][chroe] Rename TensorRT-LLM to TensorRT LLM for source code. (#7851 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-25 21:02:35 +08:00
Yueh-Ting (eop) Chen	cf100933cc	[TRTLLM-6341][feature] Support SWA KV cache reuse (#6768 ) This merge request attempts to support more SWA KV cache functionality inside the KV cache manager. Before this merge request, the KV cache for sliding window attention (SWA) only holds "window size" number of blocks and reuse them in a cyclic manner. We will not be able to utilize more GPU memory with this design, leading to a limited max batch size throughput. Additionally, we will not be able to support KV cache reuse with this design. In this MR, we change such behavior to let the manager write blocks in a linear manner. With a linear block writing behavior, as the attention window moves on, the out-of-window (OOW) blocks will be detached. Right now for the sake of a correct feature first, we directly offload the OOW block from the primary block pool (GPU memory) to the secondary block pool (host memory). We will improve this in the future by delegating the block movement to the eviction policy. KV cache reuse for SWA is not developed in this merge request and will be amended in a follow-up merge request. Writing the blocks linearly, the maximum number of blocks allocated for a sequence(`GenerationRequest`) is the "max sequence length" specified. The `GenerationRequest` that stores the cache block bookkeeping structure will now keep "max sequence length" tokens of blocks. Given the above, main changes are (more context in the MR): - Remove "cyclic" concept under the kv cache manager, such concept originally guards the block reuse under kv cache manager. - Add detach mechanism and have it under `KVCacheManager::addToken`. Please note that detach is still guarded off for SWA when reuse is enabled. A follow-up merge request will proceed to improve this. - Enforce "max sequence length" to be a non-optional parameter to the `KVCacheManager`/`BlockManager` - Let all window size resource pool get identical proportion of memory - Fix free memory calculation under `resource_manager.py` Signed-off-by: eopXD <yuehtingc@nvidia.com> Co-authored-by: Tomer Asida <tasida@nvidia.com>	2025-09-24 14:28:24 +08:00
Zheng Duan	e3c1a9409f	[TRTLLM-6549][fix] add kv cache time output back (#7798 ) Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>	2025-09-23 14:12:42 -04:00
Enwei Zhu	8330d5363a	[TRTLLM-8209][feat] Support new structural tag API (upgrade XGrammar to 0.1.25) (#7893 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-09-23 09:10:09 +08:00
brb-nv	8879ec4d35	[https://nvbugs/5501557 ][fix] Fix out-of-bounds vector access for model with multiple layer types (#7636 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-22 14:28:38 +08:00
brb-nv	e10a027a03	[TRTLLM-7731][feat] KV cache transmission in disagg with CP on gen side (#7624 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-09-20 06:15:26 -07:00
Iman Tabrizian	6ce0624208	[TRTLLM-8044][refactor] Rename data -> cache for cacheTransceiver (#7659 )	2025-09-16 08:43:56 -04:00
Tomer Shmilovich	ecc0e687c6	[None][feat] Nixl support for GDS (#5488 ) Signed-off-by: Tomer Shmilovich <tshmilovich@nvidia.com> Signed-off-by: Guy Lev <glev@nvidia.com> Co-authored-by: Guy Lev <glev@nvidia.com>	2025-09-09 13:00:38 +08:00

1 2 3 4 5 ...

266 Commits