TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Kanghwan	41e5870a70	[#8476 ][chore] Update license (#8807 ) Signed-off-by: Kanghwan Jang <861393+karljang@users.noreply.github.com>	2025-11-19 15:05:25 -08:00
Bo Li	d8b05894ee	[None][perf] Adjust select_alltoall_method_type. (#8950 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-11-19 07:43:55 -08:00
CarstyYou	ee941ac779	[https://nvbugs/5456493 ][feat] add fp8 dense for sm120 (#9174 ) Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>	2025-11-19 14:40:34 +08:00
ChristinaZ	941a54c66a	[None][feat] Update the indexer topK (#9255 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-11-19 11:49:00 +08:00
ChristinaZ	fbf6c16cd2	[None][fix] Update the default invalid value for deepseek mode of routing (#9222 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-11-19 10:14:06 +08:00
Patrice Castonguay	9b0f45298f	[None][feat] Have ability to cancel disagg request if KV cache resource are exhausted (#9155 ) Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>	2025-11-18 20:59:17 -05:00
Enwei Zhu	7c4777a571	[TRTLLM-9286][feat] Integration of CuteDSL NVFP4 grouped GEMM (#8880 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-11-18 17:40:12 -08:00
Nikita Korobov	fe569f0594	[None][feat] bias for FP4 TRT-LLM Gen MoE (#9220 ) Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>	2025-11-18 09:59:47 -08:00
Robin Kobus	9913dc25ae	[None][refactor] decoding inputs, part 2 (#5799 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-11-18 14:38:51 +01:00
Gal Hubara-Agam	5e5300898b	[#8732 ][feat] Add ReLU2 to TRTLLM Cutlass MoE BF16 kernels (#9191 ) Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>	2025-11-17 20:30:00 -08:00
zackyoray	e3c9a97075	[None][feat] Add TRTLLM_NIXL_KVCACHE_BACKEND environment variable for NIXL backend selection (#9075 ) Signed-off-by: Yoray Zack <62789610+zackyoray@users.noreply.github.com>	2025-11-17 15:39:55 -08:00
Robin Kobus	df41f220a2	[TRTLLM-8831][feat] Enable early exit with overlap scheduler (#8587 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-11-17 18:07:13 +01:00
Kaiyu Xie	04be5a704e	[None] [fix] Fix missing ActivationType issue (#9171 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com> Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Co-authored-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>	2025-11-17 10:43:25 +08:00
Anthony Chang	86cfb3ea7e	[None][feat] Update TRTLLM MoE cubins; reduce mxfp4 weight padding requirement; tighten TMA bound (#9025 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-11-17 10:04:29 +08:00
sunnyqgg	7862b15a65	[TRTLLM-8778][feat] Add tree attention support for blackwell arch (#8975 ) Signed-off-by: qgai <qgai@nvidia.com>	2025-11-17 09:01:53 +08:00
heyuhhh	f07e9977c6	[None] [feat] Use triton kernels for RocketKV prediction module (#8682 ) Signed-off-by: yuhangh <58161490+heyuhhh@users.noreply.github.com>	2025-11-13 18:51:09 -08:00
Neta Zmora	34dc6869f3	[#8732 ][feat] Update TRTLLM Cutlass MoE kernels with ReLU2 (#9011 ) Update TRTLLM Cutlass MoE kernels with ReLU2 activation. Nemotron-6 requires ReLU2 (i.e. squared ReLU) MoE activation function. The PR adds this and adds an API to set the activation function, in general. The ReLU2 changes are based on this FlashInfer PR: https://github.com/flashinfer-ai/flashinfer/pull/1954. The PR also updates the Auto Deploy MoE backend for 16-bit and FP8 from Triton (`torch.ops.auto_deploy.triton_moe_fused`, `torch.ops.auto_deploy.triton_quant_fp8_moe`) to TRTLLM/Cutlass (`torch.ops.auto_deploy.trtllm_moe_fused`, `torch.ops.auto_deploy.trtllm_quant_fp8_moe_fused`). Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com> Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com> Co-authored-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>	2025-11-13 16:54:45 -08:00
dongxuy04	a370643b26	[None][fix] support topk autotuner input for expert slot per group larger than 32 (#9087 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-11-14 08:37:20 +08:00
Iman Tabrizian	9ef7eb70e0	[None][fix] Fix KV cache manager test warnings (#9103 )	2025-11-13 07:23:04 -08:00
Perkz Zheng	22c1748b80	[TRTLLM-8816][feat] add optimized trtllm-gen attention kernels on sm103 (#9081 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-11-13 12:41:07 +08:00
Iman Tabrizian	cdde15b275	[TRTLLM-8540][feat] Add support for disagg in DSv3.2 (#8735 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-11-12 08:21:11 -08:00
Jiagan Cheng	1a56722697	[None][fix] Remove unnecessary attention workspace memory check (#9064 ) Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>	2025-11-12 11:18:50 +08:00
xiweny	50c486367a	[https://nvbugs/5619396 ][fix] Add sm103 to CutlassFP8RowwiseGemm (#9042 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-11-10 08:12:14 -08:00
ChristinaZ	2e7769d1e8	[None][feat] Add customized topk and related unit tests for DSA (#8882 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-11-10 03:35:35 -08:00
bhsueh_NV	e8d4a56dd0	[None][fix] fix eagle3 accuracy issue on sm120 (#8944 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-11-10 14:02:03 +08:00
Chang Liu	7081f254cf	[None][perf] Add custom indexer k cache scatter op (#8960 ) Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>	2025-11-07 11:24:26 -08:00
DylanChen-NV	b275635a9a	[https://nvbugs/5498478 ][fix] Fix eagle3 fp8 kv target model + bf16 draft model + chunked prefill (#8910 ) Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>	2025-11-06 07:41:21 -08:00
yunruis	51545560da	[TRTLLM-8803][feat] Add rope and uk-bgemm overlap for mla generation (#8495 ) Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>	2025-11-06 17:39:57 +08:00
Perkz Zheng	222bc911cd	[None][feat] add swapsMmaAb sparseMla kernels (#8913 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-11-05 09:32:34 -08:00
Shiyu Li	eeb56c2848	[None][feat] MNNVLAllreduce Kernel Refactor (#8018 ) Signed-off-by: Shiyu Li <timlee0212@outlook.com> Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-11-05 08:49:47 +08:00
shuyixiong	70e4d72ffa	[TRTLLM-8511][feat] Add update_weights and sleep_wakeup support for rl integration (#8302 ) Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com> Co-authored-by: Liwei Ma <liweim@nvidia.com> Co-authored-by: Jonas Yang CN <joyang@nvidia.com>	2025-11-04 10:19:24 -08:00
Bo Li	e4bf29bc66	[None][feat] Integrate MnnvlThroughput into TRTLLM MoE. (#8728 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-11-04 21:36:29 +08:00
CarstyYou	4296c9553d	[TRTLLM-1234][feat] Add fp8 blockscaled Gemm for sm120 (#8844 ) Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>	2025-11-04 18:10:36 +08:00
Yukun He	2225745782	[TRTLLM-8129][feat] Allreduce tuning and benchmark script revising (#7870 ) Because we have encountered some perf regression due to using a one-shot kernel instead of NCCL on A100/H100, it will be beneficial if we can have a solid benchmarking of allreduce Op and analyze the data collected from it. Implemented new AllreduceOp heuristic: - Added Linear programming-based heuristic implementation. - Added LUT-based heuristic implementation and corresponding code generation script. AllreduceOp minor fixing: - Fixed a minor issue in AllreduceOp, that the strategy can not be overridden when ONESHOT or TWOSHOT is set. - Fixed a minor TWOSHOT kernel perf issue. - Cleaned up Dispatching code in AllReduceOp. This PR will fix the perf gaps reported in: https://nvbugspro.nvidia.com/bug/5517023 For Deepseek-R1, it shows a performance gain of about 3-4% in concurrency levels of 256 and 512. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-11-04 16:42:31 +08:00
Zhenhuan Chen	34fbc7052c	[https://nvbugs/5545522 ][fix] move PREEXIT in UB kernels to fix accuracy issue (#8318 ) Signed-off-by: Zhenhuan Chen <zhenhuanc@nvidia.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-11-04 16:42:31 +08:00
Matthias Jouanneaux	d0f107e4dd	[TRTLLM-5966][feat] Helix: add full MLA support for Helix (#8104 ) Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>	2025-11-04 09:06:58 +08:00
Perkz Zheng	497a07021d	[None][update] optimized sparse mla kernels && fix unspecified cuda launch (#8866 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-11-02 22:26:59 -08:00
qsang-nv	0f42a24f45	[None][feat] Fix attention sink load in xqa (#8836 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-11-03 09:39:45 +08:00
Bo Li	4c5a8f4ec6	[None][fix] Rename: slot_count -> invalid_expert_id (#8783 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-11-01 21:36:59 +08:00
brb-nv	d798d66976	[TRTLLM-7731][feat] Avoid over-allocation of KV cache for transmission in disagg with CP (#8145 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-10-31 17:32:39 -07:00
Fanrong Li	f0dc746738	[TRTLLM-8541][feat] Add trtllm-gen sparse MLA kernels to support per-Tensor FP8 KV Cache (#8692 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com> Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Co-authored-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> Co-authored-by: Tracin <10434017+Tracin@users.noreply.github.com>	2025-10-31 14:38:31 -07:00
Zhenhuan Chen	603ec03fb1	[https://nvbugs/5575687 ][fix] fix moe_gemm's preexit position that cause illegal memory access (#8786 ) Signed-off-by: Zhenhuan Chen <zhenhuanc@nvidia.com>	2025-10-31 09:08:23 +08:00
Anthony Chang	f666ad2f6b	[None][feat] Autotuner can iterate through all tactics for test purposes (#8663 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-10-30 13:11:25 +01:00
ChristinaZ	13cfd70f57	[None][feat] Add unit tests and revision in block_level kernel for invalid input (#8718 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-10-30 16:42:18 +08:00
Iman Tabrizian	ae6875fe10	[TRTLLM-8976][feat] Move indexer-k-cache to KVCacheManager (#8699 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-10-29 08:04:26 -07:00
dongxuy04	00eaf5f883	[None][feat] add flag for EPLB to force using GDRCopy (#8650 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-10-29 13:33:26 +08:00
Chang Liu	5f737b8dbe	[None][perf] Use fp8 quant kernel in DS3.2 indexer module (#8701 ) Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>	2025-10-29 12:45:09 +08:00
Cheng Hang	15c293a90b	[None][feat] Enable nvfp4 cuda core for sm120 (#8620 ) Signed-off-by: Cheng Hang <chang@nvidia.com>	2025-10-29 12:39:03 +08:00
Zheng Duan	fea5bfbda7	[None][feat] add detailed KV cache transfer time breakdown (#8521 ) Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>	2025-10-29 10:11:09 +08:00
Chuang Zhu	b828b6445b	[https://nvbugs/5612529 ][fix] Fix transferAgent_test (#8710 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-10-29 09:14:34 +08:00
dongxuy04	b37a8a9a74	[None][fix] fix EPLB init hang (#8649 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-10-28 05:22:49 -04:00
Aurelien Chartier	1401a3c09c	[None][feat] Add FP8 rowwise GEMMs for B200 (#8332 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-10-27 16:33:14 -04:00
Bo Li	9c4432f8a4	[TRTLLM-7318][feat] MnnvlThroughput AlltoAll implementation. (#7499 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-10-27 13:23:06 -04:00
nvxuanyuc	d1398c05e6	[None][feat] Support ignored prompt length for penalties via new sampling config parameter (#8127 ) Signed-off-by: Xuanyu Chen <xuanyuc@nvidia.com>	2025-10-27 13:12:31 -04:00
Jinyang Yuan	0a0f93d4a8	[None][fix] Fix the performance issue of FP8 blockwise grouped GEMM when using attention DP (#8501 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-10-27 10:18:19 +08:00
Chang Liu	e47c787dd7	[TRTLLM-8535][feat] Support DeepSeek V3.2 with FP8 + BF16 KV cache/NVFP4 + BF16 KV cache (#8405 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com> Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>	2025-10-24 13:40:41 -04:00
Chuang Zhu	2420918e5b	[TRTLLM-7078][chore] optimal kvcache transfer for VWSA (#7952 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-10-24 08:58:16 -04:00
Aurelien Chartier	32e1ad68e1	[None][chore] Cleanup GDS code (#8475 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-10-23 12:36:31 -07:00
Shijie	928247a3f9	[https://nvbugs/5451205 ][feat] Add cuBLASLt NVFP4 GEMM backend support (#7943 ) Signed-off-by: Shijie Wang <jaywan@nvidia.com>	2025-10-23 15:55:10 +08:00
dongxuy04	a7c2c8c212	[None][fix] Allow multi-threaded copy for GDRCopy wrapper (#8535 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-10-23 10:25:04 +08:00
Anthony Chang	8a3b870e09	[None][feat] Update TRTLLM MoE MxFP4 cubins; autotune tileN (#8156 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-10-23 09:14:18 +08:00
Anish Shanbhag	15de45d782	[TRTLLM-8682][chore] Remove auto_parallel module (#8329 ) Signed-off-by: Anish Shanbhag <ashanbhag@nvidia.com>	2025-10-22 20:53:08 -04:00
dongxuy04	df689f8fed	[None][fix] Fix EPLB CPU thread NUMA binding (#8579 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-10-22 10:52:09 -04:00
Patrice Castonguay	879039f6d5	[https://nvbugs/5429636 ][feat] Kv transfer timeout (#8459 ) Signed-off-by: raayandhar <raayan.dhar@gmail.com> Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> Co-authored-by: raayandhar <raayan.dhar@gmail.com>	2025-10-22 09:29:02 -04:00
qsang-nv	07edac2818	[None][feat] Add vLLM KV Pool support for XQA mla kernel (#8560 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-10-22 14:12:57 +08:00
Shi Xiaowei	a0024f4d34	[None][doc] Facilitates the integration of the transfer agent (#7867 ) Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>	2025-10-21 20:06:24 +08:00
Yuxian Qiu	ec32711b1e	[https://nvbugs/5542862 ][fix] Upgrade fmha_v2. (#8364 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-10-20 10:20:23 +08:00
ChristinaZ	c8b9998acb	[TRTLLM-8637][feat] Optimize the routing kernel for DeepseekV3 (MoE CUTLASS backend); Add support for KimiK2 and Qwen-next (MoE TRTLLM backend) (#7761 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-10-20 10:08:31 +08:00
Yueh-Ting (eop) Chen	128a351bdc	[None][fix] Avoid overwrite of `kv_cache_config.max_tokens` for VSWA scheme for the KVCacheManager (#8219 ) For VSWA scheme, we do not want `kv_cache_cnonfig.max_token` to control and cap the maximum memory of a block pool because block pool size are not identical amongst different window sizes. This MR omits the effect of `kv_cache_config.max_tokens` under `kvCacheManager.cpp` to allow the setting of block pool size to rely on the window size to share ratio and the total gpu memory analyzed and fed to the kv cache manager. Only skipping for VSWA scheme, no extra coverage was added. Signed-off-by: eopXD <yuehtingc@nvidia.com>	2025-10-20 10:48:40 +09:00
Bo Deng	dd25595ae8	[TRTLLM-7964][infra] Set nixl to default cache transceiver backend (#7926 ) Signed-off-by: Bo Deng <deemod@nvidia.com>	2025-10-19 19:24:43 +08:00
jthomson04	852316886e	[None][fix] Fix KV event consumption (#6346 ) Signed-off-by: jthomson04 <jwillthomson19@gmail.com>	2025-10-18 15:41:26 -07:00
Wanli Jiang	56f697be2e	[None][feat] Add fmha_v2 kernel for head_dim=80 and sm=100 to support VLM (#8392 ) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2025-10-17 19:42:47 +08:00
Perkz Zheng	0722717ec0	[None][fix] trtllm-gen regression in PR 8301 (#8426 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-10-17 03:21:31 -07:00
Iman Tabrizian	22eb1633ae	[None][bug] Set NCCL_GRAPH_REGISTER to false to avoid hang (#8413 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-10-16 18:59:18 +02:00
Patrice Castonguay	b7602f7bd4	[https://nvbugs/5534837 ][fix] Fix KV cache split on long context (#8247 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> Co-authored-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-10-16 22:46:19 +08:00
Min Yu	0a0159fdd8	[https://nvbugs/5378031 ] [feat] W4A8 AWQ MoE supports Per Expert Pre-quant Scale Factor for PyT backend (#7286 ) Signed-off-by: Min Yu <171526537+yumin066@users.noreply.github.com>	2025-10-16 11:07:48 +08:00
ChristinaZ	db1c271bc6	[None][feat] Revise the calculation related to TileN in routing of MOE TRTLLM backend (#8148 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-10-16 09:15:46 +08:00
Chuang Zhu	40d129a415	[None][fix] Fix cache buffer size for window (#8320 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-10-16 09:01:11 +08:00
Fanrong Li	0d20a8fd61	[TRTLLM-8536][feat] Add the sparse attention framework and one use case--RocketKV support (#8086 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Signed-off-by: yuhangh <58161490+heyuhhh@users.noreply.github.com> Co-authored-by: yuhangh <58161490+heyuhhh@users.noreply.github.com>	2025-10-14 08:23:16 -07:00
Aurelien Chartier	7291cdc422	[https://nvbugs/5404000 ][fix] Ensure consistency between firstTokenTime and lastTokenTime (#8294 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-10-14 08:15:08 -04:00
Chuang Zhu	8733e830fc	[None][fix] Add lock for request_to_session in sendReadySingal (#8310 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-10-14 04:32:37 -07:00
Yueh-Ting (eop) Chen	4882815fa1	[TLLM-6777][feature] Support SWA KV cache reuse OOW block detach (#7922 ) This MR is a continuation of #6768. In the previous merge request, OOW (out-of-window) blocks are only detached when reuse is not enabled, that is, the block movement behavior is identical between SWA and full attention when reuse is enabled. This merge request attempts to enable OOW block detach when reuse is enabled. The required changes are: - Let KV cache manager keep track of which block is used by which sequence - Remove restriction for the eviction policy to be able to release a non-leaf block Along with the development, bugs inside freeChildren and offload mechanism under getFreeBlock is resolved because they will affect the functionality this merge request is trying to achieve. When a block goes OOW, it is released from the sequence, it will be available to be reclaimed and the block is held by the eviction policy for another sequence to acquire upon calling. On the other hand, we want to potentially store the sequence for reuse. To safely achieve this, the record of block ownership is done under WindowBlockManager::getFreeBlock. If the block acquired was originally owned by another sequence that is live inside the manager, then we invalidate the sequence for store for reuse. At the end of a sequence (when removeSequence is called toward it), the KV cache manager will check if the sequence has all blocks not reclaimed by another sequence. If so, then the sequence is safe to be stored for reuse and store for reuse action will be performed. Signed-off-by: eopXD <yuehtingc@nvidia.com>	2025-10-13 09:18:12 -07:00
Fanrong Li	1e0fbb776d	[TRTLLM-8536][feat] Update trtllm gen fmha kernels to support block sparse attention (#8301 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-10-13 05:54:48 -07:00
xiweny	5ce9719759	[https://nvbugs/5503138 ] [fix] Remove compile warnings (#8167 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-10-13 13:24:23 +08:00
Zhenhuan Chen	84d2f12818	[TRTLLM-6748][feat] add PDL support for more kernels (#7977 ) Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>	2025-10-11 08:32:05 +08:00
Chuang Zhu	85f157f389	[None][fix] Add Lock to protect mReqeustToSession (#8085 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> Co-authored-by: Xianjie Qiao <5410381+qiaoxj07@users.noreply.github.com>	2025-10-10 21:51:50 +08:00
Jonas Li	76a47c7bef	[None][fix] Enable FP8 ContextMLA on GB300 (#8080 ) Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>	2025-10-10 10:20:46 +08:00
dongfengy	9f2a3ae88c	[None][fix] Restrict tinygemm use to certain SMs (#8182 ) Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com> Signed-off-by: dongfengy <99041270+dongfengy@users.noreply.github.com>	2025-10-08 17:55:57 -07:00
xiweny	9298f1bdcc	[None] [test] Add B300 cases to CI (#8056 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-10-06 19:23:31 -07:00
Faraz	27a5091fcb	[None][feat] GPT-OSS Sm120/Sm121 Support (#7937 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> Signed-off-by: list <58580514+farazkh80@users.noreply.github.com> Signed-off-by: Vincent Huang <vincenth@nvidia.com> Co-authored-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> Co-authored-by: Vincent Huang <vincenth@nvidia.com>	2025-10-06 16:59:06 -04:00
Jonas Yang CN	88ea2c4ee9	[TRTLLM-7349][feat] Adding new orchestrator type -- ray (#7520 ) Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com>	2025-10-04 08:12:24 +08:00
Robin Kobus	e2f69c5c23	[None] [refactor] Minor cleanup and improvements (#7619 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-10-03 11:40:06 +02:00
Nikita Korobov	9b3d7cc3e6	[None][feat] Update TRT-LLM Gen MoE kernels (#7970 ) Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>	2025-10-03 09:22:45 +08:00
Yilin Fan	01423ac183	[None][feat] perf_metrics endpoint functionality improvement (#8005 ) Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com> Signed-off-by: nv-yilinf <206948969+nv-yilinf@users.noreply.github.com>	2025-10-02 17:43:25 -07:00
Patrice Castonguay	fefa7d8fa3	[None][feat] Support for cancelling requests with disaggregation (#8114 ) Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>	2025-10-02 11:04:26 -07:00
dongfengy	6568e565db	[TRTLLM-7775][feat] Integrate tinygemm2 for gpt-oss (#7916 ) Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com> Signed-off-by: dongfengy <99041270+dongfengy@users.noreply.github.com> Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-10-02 10:47:04 -07:00
yifeizhang-c	34d158b6da	[TRTLLM-6589][feat] Support CUDA graph for DeepEP (#7514 ) Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>	2025-10-02 10:13:24 -07:00
bhsueh_NV	38d6e4e60b	[None][feat] Support Qwen3 next (#7892 ) Signed-off-by: mengw <12670782+wm2012011492@users.noreply.github.com> Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Co-authored-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-09-29 21:16:07 +08:00
xiweny	48e779ae8c	[https://nvbugs/5541494 ] [fix] add back missing sm100f bmm kernels (#8051 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-09-29 05:35:44 -04:00
Iman Tabrizian	33282351a2	[TRTLLM-6106][feat] Add support for KVCache transfer from KVCache reuse path (#6348 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-09-27 19:29:30 -04:00

1 2 3 4 5 ...

836 Commits