Commit Graph

836 Commits

Author SHA1 Message Date
Kanghwan
41e5870a70
[#8476][chore] Update license (#8807)
Signed-off-by: Kanghwan Jang <861393+karljang@users.noreply.github.com>
2025-11-19 15:05:25 -08:00
Bo Li
d8b05894ee
[None][perf] Adjust select_alltoall_method_type. (#8950)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-11-19 07:43:55 -08:00
CarstyYou
ee941ac779
[https://nvbugs/5456493][feat] add fp8 dense for sm120 (#9174)
Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>
2025-11-19 14:40:34 +08:00
ChristinaZ
941a54c66a
[None][feat] Update the indexer topK (#9255)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-11-19 11:49:00 +08:00
ChristinaZ
fbf6c16cd2
[None][fix] Update the default invalid value for deepseek mode of routing (#9222)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-11-19 10:14:06 +08:00
Patrice Castonguay
9b0f45298f
[None][feat] Have ability to cancel disagg request if KV cache resource are exhausted (#9155)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-11-18 20:59:17 -05:00
Enwei Zhu
7c4777a571
[TRTLLM-9286][feat] Integration of CuteDSL NVFP4 grouped GEMM (#8880)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-11-18 17:40:12 -08:00
Nikita Korobov
fe569f0594
[None][feat] bias for FP4 TRT-LLM Gen MoE (#9220)
Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>
2025-11-18 09:59:47 -08:00
Robin Kobus
9913dc25ae
[None][refactor] decoding inputs, part 2 (#5799)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-11-18 14:38:51 +01:00
Gal Hubara-Agam
5e5300898b
[#8732][feat] Add ReLU2 to TRTLLM Cutlass MoE BF16 kernels (#9191)
Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
2025-11-17 20:30:00 -08:00
zackyoray
e3c9a97075
[None][feat] Add TRTLLM_NIXL_KVCACHE_BACKEND environment variable for NIXL backend selection (#9075)
Signed-off-by: Yoray Zack <62789610+zackyoray@users.noreply.github.com>
2025-11-17 15:39:55 -08:00
Robin Kobus
df41f220a2
[TRTLLM-8831][feat] Enable early exit with overlap scheduler (#8587)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-11-17 18:07:13 +01:00
Kaiyu Xie
04be5a704e
[None] [fix] Fix missing ActivationType issue (#9171)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
2025-11-17 10:43:25 +08:00
Anthony Chang
86cfb3ea7e
[None][feat] Update TRTLLM MoE cubins; reduce mxfp4 weight padding requirement; tighten TMA bound (#9025)
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-11-17 10:04:29 +08:00
sunnyqgg
7862b15a65
[TRTLLM-8778][feat] Add tree attention support for blackwell arch (#8975)
Signed-off-by: qgai <qgai@nvidia.com>
2025-11-17 09:01:53 +08:00
heyuhhh
f07e9977c6
[None] [feat] Use triton kernels for RocketKV prediction module (#8682)
Signed-off-by: yuhangh <58161490+heyuhhh@users.noreply.github.com>
2025-11-13 18:51:09 -08:00
Neta Zmora
34dc6869f3
[#8732][feat] Update TRTLLM Cutlass MoE kernels with ReLU2 (#9011)
Update TRTLLM Cutlass MoE kernels with ReLU2 activation.

Nemotron-6 requires ReLU2 (i.e. squared ReLU) MoE activation function.
The PR adds this and adds an API to set the activation function, in general.
The ReLU2 changes are based on this FlashInfer PR: https://github.com/flashinfer-ai/flashinfer/pull/1954.

The PR also updates the Auto Deploy MoE backend for 16-bit and FP8 from
Triton (`torch.ops.auto_deploy.triton_moe_fused`, `torch.ops.auto_deploy.triton_quant_fp8_moe`) to TRTLLM/Cutlass (`torch.ops.auto_deploy.trtllm_moe_fused`, `torch.ops.auto_deploy.trtllm_quant_fp8_moe_fused`).

Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
2025-11-13 16:54:45 -08:00
dongxuy04
a370643b26
[None][fix] support topk autotuner input for expert slot per group larger than 32 (#9087)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-11-14 08:37:20 +08:00
Iman Tabrizian
9ef7eb70e0
[None][fix] Fix KV cache manager test warnings (#9103) 2025-11-13 07:23:04 -08:00
Perkz Zheng
22c1748b80
[TRTLLM-8816][feat] add optimized trtllm-gen attention kernels on sm103 (#9081)
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-11-13 12:41:07 +08:00
Iman Tabrizian
cdde15b275
[TRTLLM-8540][feat] Add support for disagg in DSv3.2 (#8735)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-11-12 08:21:11 -08:00
Jiagan Cheng
1a56722697
[None][fix] Remove unnecessary attention workspace memory check (#9064)
Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>
2025-11-12 11:18:50 +08:00
xiweny
50c486367a
[https://nvbugs/5619396][fix] Add sm103 to CutlassFP8RowwiseGemm (#9042)
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-11-10 08:12:14 -08:00
ChristinaZ
2e7769d1e8
[None][feat] Add customized topk and related unit tests for DSA (#8882)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-11-10 03:35:35 -08:00
bhsueh_NV
e8d4a56dd0
[None][fix] fix eagle3 accuracy issue on sm120 (#8944)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-11-10 14:02:03 +08:00
Chang Liu
7081f254cf
[None][perf] Add custom indexer k cache scatter op (#8960)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-11-07 11:24:26 -08:00
DylanChen-NV
b275635a9a
[https://nvbugs/5498478][fix] Fix eagle3 fp8 kv target model + bf16 draft model + chunked prefill (#8910)
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
2025-11-06 07:41:21 -08:00
yunruis
51545560da
[TRTLLM-8803][feat] Add rope and uk-bgemm overlap for mla generation (#8495)
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
2025-11-06 17:39:57 +08:00
Perkz Zheng
222bc911cd
[None][feat] add swapsMmaAb sparseMla kernels (#8913)
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-11-05 09:32:34 -08:00
Shiyu Li
eeb56c2848
[None][feat] MNNVLAllreduce Kernel Refactor (#8018)
Signed-off-by: Shiyu Li <timlee0212@outlook.com>
Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-11-05 08:49:47 +08:00
shuyixiong
70e4d72ffa
[TRTLLM-8511][feat] Add update_weights and sleep_wakeup support for rl integration (#8302)
Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com>
Co-authored-by: Liwei Ma <liweim@nvidia.com>
Co-authored-by: Jonas Yang CN <joyang@nvidia.com>
2025-11-04 10:19:24 -08:00
Bo Li
e4bf29bc66
[None][feat] Integrate MnnvlThroughput into TRTLLM MoE. (#8728)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-11-04 21:36:29 +08:00
CarstyYou
4296c9553d
[TRTLLM-1234][feat] Add fp8 blockscaled Gemm for sm120 (#8844)
Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>
2025-11-04 18:10:36 +08:00
Yukun He
2225745782 [TRTLLM-8129][feat] Allreduce tuning and benchmark script revising (#7870)
Because we have encountered some perf regression due to using a one-shot kernel instead of NCCL on A100/H100, it will be beneficial if we can have a solid benchmarking of allreduce Op and analyze the data collected from it.

Implemented new AllreduceOp heuristic:
- Added Linear programming-based heuristic implementation.
- Added LUT-based heuristic implementation and corresponding code generation script.

AllreduceOp minor fixing:
- Fixed a minor issue in AllreduceOp, that the strategy can not be overridden when ONESHOT or TWOSHOT is set.
- Fixed a minor TWOSHOT kernel perf issue.
- Cleaned up Dispatching code in AllReduceOp.

This PR will fix the perf gaps reported in:
https://nvbugspro.nvidia.com/bug/5517023

For Deepseek-R1, it shows a performance gain of about 3-4% in concurrency levels of 256 and 512.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-11-04 16:42:31 +08:00
Zhenhuan Chen
34fbc7052c [https://nvbugs/5545522][fix] move PREEXIT in UB kernels to fix accuracy issue (#8318)
Signed-off-by: Zhenhuan Chen <zhenhuanc@nvidia.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-11-04 16:42:31 +08:00
Matthias Jouanneaux
d0f107e4dd
[TRTLLM-5966][feat] Helix: add full MLA support for Helix (#8104)
Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>
2025-11-04 09:06:58 +08:00
Perkz Zheng
497a07021d
[None][update] optimized sparse mla kernels && fix unspecified cuda launch (#8866)
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-11-02 22:26:59 -08:00
qsang-nv
0f42a24f45
[None][feat] Fix attention sink load in xqa (#8836)
Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
2025-11-03 09:39:45 +08:00
Bo Li
4c5a8f4ec6
[None][fix] Rename: slot_count -> invalid_expert_id (#8783)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-11-01 21:36:59 +08:00
brb-nv
d798d66976
[TRTLLM-7731][feat] Avoid over-allocation of KV cache for transmission in disagg with CP (#8145)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-10-31 17:32:39 -07:00
Fanrong Li
f0dc746738
[TRTLLM-8541][feat] Add trtllm-gen sparse MLA kernels to support per-Tensor FP8 KV Cache (#8692)
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Co-authored-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
Co-authored-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-10-31 14:38:31 -07:00
Zhenhuan Chen
603ec03fb1
[https://nvbugs/5575687][fix] fix moe_gemm's preexit position that cause illegal memory access (#8786)
Signed-off-by: Zhenhuan Chen <zhenhuanc@nvidia.com>
2025-10-31 09:08:23 +08:00
Anthony Chang
f666ad2f6b
[None][feat] Autotuner can iterate through all tactics for test purposes (#8663)
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-10-30 13:11:25 +01:00
ChristinaZ
13cfd70f57
[None][feat] Add unit tests and revision in block_level kernel for invalid input (#8718)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-10-30 16:42:18 +08:00
Iman Tabrizian
ae6875fe10
[TRTLLM-8976][feat] Move indexer-k-cache to KVCacheManager (#8699)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-10-29 08:04:26 -07:00
dongxuy04
00eaf5f883
[None][feat] add flag for EPLB to force using GDRCopy (#8650)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-10-29 13:33:26 +08:00
Chang Liu
5f737b8dbe
[None][perf] Use fp8 quant kernel in DS3.2 indexer module (#8701)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-10-29 12:45:09 +08:00
Cheng Hang
15c293a90b
[None][feat] Enable nvfp4 cuda core for sm120 (#8620)
Signed-off-by: Cheng Hang <chang@nvidia.com>
2025-10-29 12:39:03 +08:00
Zheng Duan
fea5bfbda7
[None][feat] add detailed KV cache transfer time breakdown (#8521)
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-10-29 10:11:09 +08:00
Chuang Zhu
b828b6445b
[https://nvbugs/5612529][fix] Fix transferAgent_test (#8710)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-10-29 09:14:34 +08:00
dongxuy04
b37a8a9a74
[None][fix] fix EPLB init hang (#8649)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-10-28 05:22:49 -04:00
Aurelien Chartier
1401a3c09c
[None][feat] Add FP8 rowwise GEMMs for B200 (#8332)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-10-27 16:33:14 -04:00
Bo Li
9c4432f8a4
[TRTLLM-7318][feat] MnnvlThroughput AlltoAll implementation. (#7499)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-10-27 13:23:06 -04:00
nvxuanyuc
d1398c05e6
[None][feat] Support ignored prompt length for penalties via new sampling config parameter (#8127)
Signed-off-by: Xuanyu Chen <xuanyuc@nvidia.com>
2025-10-27 13:12:31 -04:00
Jinyang Yuan
0a0f93d4a8
[None][fix] Fix the performance issue of FP8 blockwise grouped GEMM when using attention DP (#8501)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-10-27 10:18:19 +08:00
Chang Liu
e47c787dd7
[TRTLLM-8535][feat] Support DeepSeek V3.2 with FP8 + BF16 KV cache/NVFP4 + BF16 KV cache (#8405)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-10-24 13:40:41 -04:00
Chuang Zhu
2420918e5b
[TRTLLM-7078][chore] optimal kvcache transfer for VWSA (#7952)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-10-24 08:58:16 -04:00
Aurelien Chartier
32e1ad68e1
[None][chore] Cleanup GDS code (#8475)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-10-23 12:36:31 -07:00
Shijie
928247a3f9
[https://nvbugs/5451205][feat] Add cuBLASLt NVFP4 GEMM backend support (#7943)
Signed-off-by: Shijie Wang <jaywan@nvidia.com>
2025-10-23 15:55:10 +08:00
dongxuy04
a7c2c8c212
[None][fix] Allow multi-threaded copy for GDRCopy wrapper (#8535)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-10-23 10:25:04 +08:00
Anthony Chang
8a3b870e09
[None][feat] Update TRTLLM MoE MxFP4 cubins; autotune tileN (#8156)
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-10-23 09:14:18 +08:00
Anish Shanbhag
15de45d782
[TRTLLM-8682][chore] Remove auto_parallel module (#8329)
Signed-off-by: Anish Shanbhag <ashanbhag@nvidia.com>
2025-10-22 20:53:08 -04:00
dongxuy04
df689f8fed
[None][fix] Fix EPLB CPU thread NUMA binding (#8579)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-10-22 10:52:09 -04:00
Patrice Castonguay
879039f6d5
[https://nvbugs/5429636][feat] Kv transfer timeout (#8459)
Signed-off-by: raayandhar <raayan.dhar@gmail.com>
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: raayandhar <raayan.dhar@gmail.com>
2025-10-22 09:29:02 -04:00
qsang-nv
07edac2818
[None][feat] Add vLLM KV Pool support for XQA mla kernel (#8560)
Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
2025-10-22 14:12:57 +08:00
Shi Xiaowei
a0024f4d34
[None][doc] Facilitates the integration of the transfer agent (#7867)
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-10-21 20:06:24 +08:00
Yuxian Qiu
ec32711b1e
[https://nvbugs/5542862][fix] Upgrade fmha_v2. (#8364)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-10-20 10:20:23 +08:00
ChristinaZ
c8b9998acb
[TRTLLM-8637][feat] Optimize the routing kernel for DeepseekV3 (MoE CUTLASS backend); Add support for KimiK2 and Qwen-next (MoE TRTLLM backend) (#7761)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-10-20 10:08:31 +08:00
Yueh-Ting (eop) Chen
128a351bdc
[None][fix] Avoid overwrite of kv_cache_config.max_tokens for VSWA scheme for the KVCacheManager (#8219)
For VSWA scheme, we do not want `kv_cache_cnonfig.max_token` to control
and cap the maximum memory of a block pool because block pool size are
not identical amongst different window sizes. This MR omits the effect
of `kv_cache_config.max_tokens` under `kvCacheManager.cpp` to allow the
setting of block pool size to rely on the window size to share ratio
and the total gpu memory analyzed and fed to the kv cache manager.

Only skipping for VSWA scheme, no extra coverage was added.

Signed-off-by: eopXD <yuehtingc@nvidia.com>
2025-10-20 10:48:40 +09:00
Bo Deng
dd25595ae8
[TRTLLM-7964][infra] Set nixl to default cache transceiver backend (#7926)
Signed-off-by: Bo Deng <deemod@nvidia.com>
2025-10-19 19:24:43 +08:00
jthomson04
852316886e
[None][fix] Fix KV event consumption (#6346)
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
2025-10-18 15:41:26 -07:00
Wanli Jiang
56f697be2e
[None][feat] Add fmha_v2 kernel for head_dim=80 and sm=100 to support VLM (#8392)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-10-17 19:42:47 +08:00
Perkz Zheng
0722717ec0
[None][fix] trtllm-gen regression in PR 8301 (#8426)
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-10-17 03:21:31 -07:00
Iman Tabrizian
22eb1633ae
[None][bug] Set NCCL_GRAPH_REGISTER to false to avoid hang (#8413)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-10-16 18:59:18 +02:00
Patrice Castonguay
b7602f7bd4 [https://nvbugs/5534837][fix] Fix KV cache split on long context (#8247)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-10-16 22:46:19 +08:00
Min Yu
0a0159fdd8
[https://nvbugs/5378031] [feat] W4A8 AWQ MoE supports Per Expert Pre-quant Scale Factor for PyT backend (#7286)
Signed-off-by: Min Yu <171526537+yumin066@users.noreply.github.com>
2025-10-16 11:07:48 +08:00
ChristinaZ
db1c271bc6
[None][feat] Revise the calculation related to TileN in routing of MOE TRTLLM backend (#8148)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-10-16 09:15:46 +08:00
Chuang Zhu
40d129a415
[None][fix] Fix cache buffer size for window (#8320)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-10-16 09:01:11 +08:00
Fanrong Li
0d20a8fd61
[TRTLLM-8536][feat] Add the sparse attention framework and one use case--RocketKV support (#8086)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Signed-off-by: yuhangh <58161490+heyuhhh@users.noreply.github.com>
Co-authored-by: yuhangh <58161490+heyuhhh@users.noreply.github.com>
2025-10-14 08:23:16 -07:00
Aurelien Chartier
7291cdc422
[https://nvbugs/5404000][fix] Ensure consistency between firstTokenTime and lastTokenTime (#8294)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-10-14 08:15:08 -04:00
Chuang Zhu
8733e830fc
[None][fix] Add lock for request_to_session in sendReadySingal (#8310)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-10-14 04:32:37 -07:00
Yueh-Ting (eop) Chen
4882815fa1
[TLLM-6777][feature] Support SWA KV cache reuse OOW block detach (#7922)
This MR is a continuation of #6768. In the previous merge request,
OOW (out-of-window) blocks are only detached when reuse is not enabled,
that is, the block movement behavior is identical between SWA and full
attention when reuse is enabled.

This merge request attempts to enable OOW block detach when reuse is
enabled. The required changes are:

- Let KV cache manager keep track of which block is used by which
  sequence
- Remove restriction for the eviction policy to be able to release a
  non-leaf block

Along with the development, bugs inside freeChildren and offload
mechanism under getFreeBlock is resolved because they will affect the
functionality this merge request is trying to achieve.

When a block goes OOW, it is released from the sequence, it will be
available to be reclaimed and the block is held by the eviction policy
for another sequence to acquire upon calling. On the other hand, we
want to potentially store the sequence for reuse. To safely achieve
this, the record of block ownership is done under
WindowBlockManager::getFreeBlock. If the block acquired was originally
owned by another sequence that is live inside the manager, then we
invalidate the sequence for store for reuse.

At the end of a sequence (when removeSequence is called toward it),
the KV cache manager will check if the sequence has all blocks not
reclaimed by another sequence. If so, then the sequence is safe to
be stored for reuse and store for reuse action will be performed.

Signed-off-by: eopXD <yuehtingc@nvidia.com>
2025-10-13 09:18:12 -07:00
Fanrong Li
1e0fbb776d
[TRTLLM-8536][feat] Update trtllm gen fmha kernels to support block sparse attention (#8301)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-10-13 05:54:48 -07:00
xiweny
5ce9719759
[https://nvbugs/5503138] [fix] Remove compile warnings (#8167)
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-10-13 13:24:23 +08:00
Zhenhuan Chen
84d2f12818
[TRTLLM-6748][feat] add PDL support for more kernels (#7977)
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-10-11 08:32:05 +08:00
Chuang Zhu
85f157f389
[None][fix] Add Lock to protect mReqeustToSession (#8085)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
Co-authored-by: Xianjie Qiao <5410381+qiaoxj07@users.noreply.github.com>
2025-10-10 21:51:50 +08:00
Jonas Li
76a47c7bef
[None][fix] Enable FP8 ContextMLA on GB300 (#8080)
Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>
2025-10-10 10:20:46 +08:00
dongfengy
9f2a3ae88c
[None][fix] Restrict tinygemm use to certain SMs (#8182)
Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
Signed-off-by: dongfengy <99041270+dongfengy@users.noreply.github.com>
2025-10-08 17:55:57 -07:00
xiweny
9298f1bdcc
[None] [test] Add B300 cases to CI (#8056)
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-10-06 19:23:31 -07:00
Faraz
27a5091fcb
[None][feat] GPT-OSS Sm120/Sm121 Support (#7937)
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
Signed-off-by: Vincent Huang <vincenth@nvidia.com>
Co-authored-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
Co-authored-by: Vincent Huang <vincenth@nvidia.com>
2025-10-06 16:59:06 -04:00
Jonas Yang CN
88ea2c4ee9
[TRTLLM-7349][feat] Adding new orchestrator type -- ray (#7520)
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-10-04 08:12:24 +08:00
Robin Kobus
e2f69c5c23
[None] [refactor] Minor cleanup and improvements (#7619)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-10-03 11:40:06 +02:00
Nikita Korobov
9b3d7cc3e6
[None][feat] Update TRT-LLM Gen MoE kernels (#7970)
Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>
2025-10-03 09:22:45 +08:00
Yilin Fan
01423ac183
[None][feat] perf_metrics endpoint functionality improvement (#8005)
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
Signed-off-by: nv-yilinf <206948969+nv-yilinf@users.noreply.github.com>
2025-10-02 17:43:25 -07:00
Patrice Castonguay
fefa7d8fa3
[None][feat] Support for cancelling requests with disaggregation (#8114)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-10-02 11:04:26 -07:00
dongfengy
6568e565db
[TRTLLM-7775][feat] Integrate tinygemm2 for gpt-oss (#7916)
Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
Signed-off-by: dongfengy <99041270+dongfengy@users.noreply.github.com>
Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-10-02 10:47:04 -07:00
yifeizhang-c
34d158b6da
[TRTLLM-6589][feat] Support CUDA graph for DeepEP (#7514)
Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>
2025-10-02 10:13:24 -07:00
bhsueh_NV
38d6e4e60b
[None][feat] Support Qwen3 next (#7892)
Signed-off-by: mengw <12670782+wm2012011492@users.noreply.github.com>
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-09-29 21:16:07 +08:00
xiweny
48e779ae8c
[https://nvbugs/5541494] [fix] add back missing sm100f bmm kernels (#8051)
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-29 05:35:44 -04:00
Iman Tabrizian
33282351a2
[TRTLLM-6106][feat] Add support for KVCache transfer from KVCache reuse path (#6348)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-09-27 19:29:30 -04:00