TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Nikita Korobov	9b3d7cc3e6	[None][feat] Update TRT-LLM Gen MoE kernels (#7970 ) Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>	2025-10-03 09:22:45 +08:00
Yilin Fan	01423ac183	[None][feat] perf_metrics endpoint functionality improvement (#8005 ) Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com> Signed-off-by: nv-yilinf <206948969+nv-yilinf@users.noreply.github.com>	2025-10-02 17:43:25 -07:00
Patrice Castonguay	fefa7d8fa3	[None][feat] Support for cancelling requests with disaggregation (#8114 ) Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>	2025-10-02 11:04:26 -07:00
dongfengy	6568e565db	[TRTLLM-7775][feat] Integrate tinygemm2 for gpt-oss (#7916 ) Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com> Signed-off-by: dongfengy <99041270+dongfengy@users.noreply.github.com> Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-10-02 10:47:04 -07:00
yifeizhang-c	34d158b6da	[TRTLLM-6589][feat] Support CUDA graph for DeepEP (#7514 ) Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>	2025-10-02 10:13:24 -07:00
bhsueh_NV	38d6e4e60b	[None][feat] Support Qwen3 next (#7892 ) Signed-off-by: mengw <12670782+wm2012011492@users.noreply.github.com> Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Co-authored-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-09-29 21:16:07 +08:00
xiweny	48e779ae8c	[https://nvbugs/5541494 ] [fix] add back missing sm100f bmm kernels (#8051 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-09-29 05:35:44 -04:00
Iman Tabrizian	33282351a2	[TRTLLM-6106][feat] Add support for KVCache transfer from KVCache reuse path (#6348 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-09-27 19:29:30 -04:00
Jhao-Ting Chen	c33f43e13a	[https://nvbugs/5518713 ][fix] Trtllm-gen moe backend for blockwise fp8 ckpt (Qwen3-235B-A22B-FP8) (#7856 ) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>	2025-09-26 14:29:32 -07:00
Yueh-Ting (eop) Chen	2db22fb4e5	[None][feature] Add environment variable to adjust block pool allocation ration under kv cache manager (#7923 ) By default, we allocate equal proportion shares of memory for all window sizes (see the else case). With TRTLLM_WINDOW_SIZE_SHARES, we can override this behavior to adjust the memory share of each window size. For example, if we have window size of [512, 32768], then setting TRTLLM_WINDOW_SIZE_SHARES=0.4,0.6 will be allocating 40% of the memory to window size 512 and 60% of the memory to window size 32768. Signed-off-by: eopXD <yuehtingc@nvidia.com>	2025-09-26 14:09:01 +08:00
Chuang Zhu	f98fa0cf8b	[None][feat] Optimize kv cache transfer TEP (#7613 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-09-25 20:20:04 -07:00
Yuan Tong	fae83c387b	[#6102 ][fix] support non-system python installation (#7763 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-09-26 10:16:15 +08:00
Matthias Jouanneaux	eda1467061	[TRTLLM-5966][feat] Helix: add alltoall op (#6815 ) Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>	2025-09-25 07:18:29 -07:00
Yueh-Ting (eop) Chen	c5012423f5	[None][chore] Remove developer name in comment (#7981 ) Signed-off-by: eopXD <yuehtingc@nvidia.com>	2025-09-25 06:43:38 -07:00
Guoming Zhang	202bed4574	[None][chroe] Rename TensorRT-LLM to TensorRT LLM for source code. (#7851 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-25 21:02:35 +08:00
Guoming Zhang	9f0f52249e	[None][doc] Rename TensorRT-LLM to TensorRT LLM for homepage and the … (#7850 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-25 21:02:35 +08:00
Jinyang Yuan	b622cde5d5	[None][perf] Fix the tactic sorting in TrtllmGenBatchedGemmRunner::getValidConfigIndices (#7419 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-09-25 10:27:57 +02:00
Void	336c2ef540	[None][feat] DeepEP LL fp8 dispatch/combine (#7927 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>	2025-09-25 09:20:24 +08:00
sychen52	5a65af24cd	[OMNIML-2336][feat] Add NVFP4 x FP8 moe kernels (#7821 ) Signed-off-by: Shiyang Chen <shiychen@nvidia.com>	2025-09-24 12:14:35 -07:00
Perkz Zheng	60101eb8a5	[None][fix] trtllm-gen cubins compiled with wrong arch. (#7953 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-09-24 04:13:36 -07:00
HuiGao-NV	29e63d3bc2	[https://nvbugs/5532248 ][fix] Fix fused_moe OOM (#7931 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-09-24 02:22:38 -07:00
qsang-nv	929ef4c474	[None][chore] remove cubins for ci cases (#7902 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-09-24 14:56:31 +08:00
xiweny	276d83c898	[https://nvbugs/5532225 ] [fix] MoE use stream-dependent workspace (#7940 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-09-24 14:44:27 +08:00
Yueh-Ting (eop) Chen	cf100933cc	[TRTLLM-6341][feature] Support SWA KV cache reuse (#6768 ) This merge request attempts to support more SWA KV cache functionality inside the KV cache manager. Before this merge request, the KV cache for sliding window attention (SWA) only holds "window size" number of blocks and reuse them in a cyclic manner. We will not be able to utilize more GPU memory with this design, leading to a limited max batch size throughput. Additionally, we will not be able to support KV cache reuse with this design. In this MR, we change such behavior to let the manager write blocks in a linear manner. With a linear block writing behavior, as the attention window moves on, the out-of-window (OOW) blocks will be detached. Right now for the sake of a correct feature first, we directly offload the OOW block from the primary block pool (GPU memory) to the secondary block pool (host memory). We will improve this in the future by delegating the block movement to the eviction policy. KV cache reuse for SWA is not developed in this merge request and will be amended in a follow-up merge request. Writing the blocks linearly, the maximum number of blocks allocated for a sequence(`GenerationRequest`) is the "max sequence length" specified. The `GenerationRequest` that stores the cache block bookkeeping structure will now keep "max sequence length" tokens of blocks. Given the above, main changes are (more context in the MR): - Remove "cyclic" concept under the kv cache manager, such concept originally guards the block reuse under kv cache manager. - Add detach mechanism and have it under `KVCacheManager::addToken`. Please note that detach is still guarded off for SWA when reuse is enabled. A follow-up merge request will proceed to improve this. - Enforce "max sequence length" to be a non-optional parameter to the `KVCacheManager`/`BlockManager` - Let all window size resource pool get identical proportion of memory - Fix free memory calculation under `resource_manager.py` Signed-off-by: eopXD <yuehtingc@nvidia.com> Co-authored-by: Tomer Asida <tasida@nvidia.com>	2025-09-24 14:28:24 +08:00
Jhao-Ting Chen	220dc01372	[None][feat] support JIT mha.cu for SPEC_DEC in runtime (#6078 ) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>	2025-09-23 14:56:17 -07:00
Zheng Duan	e3c1a9409f	[TRTLLM-6549][fix] add kv cache time output back (#7798 ) Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>	2025-09-23 14:12:42 -04:00
Perkz Zheng	bb64e7462c	[None][fix] fix a bug with trtllm-gen kernels + attention sinks (#7919 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-09-23 00:32:04 -07:00
Pengbo Wang	a4b4ed4535	[None][fix] Fix and add test for TRTLLM MoE backend (#7755 ) Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>	2025-09-23 11:26:25 +08:00
Pengbo Wang	08cc7a041f	[https://nvbugs/5355128 ][fix] Add missing wgmma intrinsic for starcoder (#7643 ) Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>	2025-09-23 10:38:58 +08:00
Enwei Zhu	8330d5363a	[TRTLLM-8209][feat] Support new structural tag API (upgrade XGrammar to 0.1.25) (#7893 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-09-23 09:10:09 +08:00
ChristinaZ	be576a3152	[None] [feat] Enable run_post_quant_allgather for MoE TRTLLM backend (#6794 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-09-23 08:24:21 +08:00
Bo Deng	8cf95681e6	[TRTLLM-7989][infra] Bundle UCX and NIXL libs in the TRTLLM python package (#7766 ) Signed-off-by: Bo Deng <deemod@nvidia.com>	2025-09-22 16:43:35 +08:00
brb-nv	8879ec4d35	[https://nvbugs/5501557 ][fix] Fix out-of-bounds vector access for model with multiple layer types (#7636 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-22 14:28:38 +08:00
xiweny	822cb0115b	[TRTLLM-6286] [perf] Add NoSmem epilogue schedule and dynamic cluster shape for sm10x group gemm (#7757 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com> Co-authored-by: djns99 <40156487+djns99@users.noreply.github.com>	2025-09-21 11:38:17 +08:00
brb-nv	e10a027a03	[TRTLLM-7731][feat] KV cache transmission in disagg with CP on gen side (#7624 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-09-20 06:15:26 -07:00
Mike Iovine	8030b540ac	[https://nvbugs/5522462 ][fix] Fix FP8 scout illegal memory access (#7845 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-09-19 10:30:37 -04:00
Matthias Jouanneaux	1be7faef37	[TRTLLM-5966][feat] Helix: add custom position ids to MLA kernels (#6904 ) Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com> Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com>	2025-09-19 20:55:32 +08:00
Chuang Zhu	c98b9468af	[None][fix] get Local IP by connect remote (#7719 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-09-19 10:01:03 +08:00
xiweny	423e5f6a3c	[TRTLLM-6286] [feat] Update CUTLASS to 4.2 and enable SM103 group gemm (#7832 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-09-19 09:50:54 +08:00
Yuxian Qiu	d6ebcf7c4a	[TRTLLM-6994][feat] FP8 Context MLA integration (Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/6059 from release/1.1.0rc2) (#7610 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-09-19 09:40:49 +08:00
QI JUN	7f87b278bc	[None][chore] remove generated fmha_cubin.h from source tree (#7836 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-09-18 20:10:04 +08:00
Wanli Jiang	a7ca0fff54	[TRTLLM-6577][feat] Support nano_v2_vlm in pytorch backend (#7207 ) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2025-09-18 16:26:20 +08:00
Matthias Jouanneaux	022d77807d	[TRTLLM-5966][feat] Helix: make softmax stats pointer available to attention gen (#6865 ) Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com> Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com>	2025-09-18 05:01:24 +08:00
Shiyu Li	8bdbb48264	[https://nvbugs/5489015 ][fix] Support communicator split in MNNVL allreduce and fix the binding issues. (#7387 ) Signed-off-by: Shiyu Li <shili@nvidia.com>	2025-09-17 07:43:20 +08:00
Iman Tabrizian	6ce0624208	[TRTLLM-8044][refactor] Rename data -> cache for cacheTransceiver (#7659 )	2025-09-16 08:43:56 -04:00
xiweny	c076a02b38	[TRTLLM-4629] [feat] Add support of CUDA13 and sm103 devices (#7568 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com> Signed-off-by: Daniel Stokes <dastokes@nvidia.com> Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com> Signed-off-by: Xiwen Yu <xiweny@nvidia.com> Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com> Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Signed-off-by: Bo Deng <deemod@nvidia.com> Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: xiweny <13230610+VALLIS-NERIA@users.noreply.github.com> Co-authored-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com> Co-authored-by: Daniel Stokes <dastokes@nvidia.com> Co-authored-by: Zhanrui Sun <zhanruis@nvidia.com> Co-authored-by: Jiagan Cheng <jiaganc@nvidia.com> Co-authored-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Bo Deng <deemod@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-09-16 09:56:18 +08:00
jmydurant	7deefb3d2b	[TRTLLM-7192][feat] optimize MLA chunked prefill && support fp8 mla chunked prefill (#7477 ) Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-09-15 21:43:49 +08:00
Zheng Duan	24fc1f9acf	[None][fix] using arrival time in llmapi when creating LlmRequest in pytorch workflow (#7553 ) Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>	2025-09-15 07:26:01 -04:00
Perkz Zheng	1b29c2e731	[None][feat] support gpt-oss with fp8 kv cache (#7612 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-09-15 02:17:37 +08:00
Fan - Yunfan	e3117731b3	[None][fix] Fix the incorrect header file import in dataType.h (#7133 ) Signed-off-by: fanyunfan <2569548856@qq.com> Co-authored-by: fanyunfan <2569658856@qq.com> Co-authored-by: Yunfan Fan <46273019+fyf2016@users.noreply.github.com> Co-authored-by: Kanghwan <861393+karljang@users.noreply.github.com>	2025-09-11 08:59:04 +08:00

1 2 3 4 5 ...

694 Commits