Commit Graph

198 Commits

Author SHA1 Message Date
Bo Deng
ff72ca90de
Improve TransferAgentTest.SyncMessage (#6250)
Signed-off-by: Bo Deng <deemod@nvidia.com>
2025-07-24 23:41:36 +08:00
Chang Liu
7381f1dba7
[TRTLLM-5059][feat] Add KV cache reuse support for multimodal models (#5444)
Only supports qwen in this PR
2025-07-21 16:11:58 -07:00
Robin Kobus
ec2b953e7e
refactor: Enhanced handling of decoder requests and logits within the batch manager (#6055)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-18 12:12:08 +02:00
Chuang Zhu
44c70c88f9
chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service (#5234)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-07-17 17:42:07 +08:00
Tomer Shmilovich
0552a02943
BlockManager copy constructor fix (#5982)
Signed-off-by: Tomer Shmilovich <tshmilovich@nvidia.com>
2025-07-16 17:33:17 +08:00
Bo Deng
ec3ebae43e
[TRTLLM-6471] Infra: Upgrade NIXL to 0.3.1 (#5991)
Signed-off-by: Rabia Loulou <174243936+rabial-nv@users.noreply.github.com>
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: Bo Deng <deemod@nvidia.com>
Co-authored-by: Rabia Loulou <174243936+rabial-nv@users.noreply.github.com>
Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-07-16 13:54:42 +08:00
Zheng Duan
38db4bc7fb
feat: use session abstraction in data transceiver and cache formatter (#5611)
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-07-16 13:52:44 +08:00
Robin Kobus
6d4b045d1f
refactor: Remove enforced sorted order of batch slots (#3502)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-14 17:23:02 +02:00
ChristinaZ
c5fb692a7d
Refactor the rest routing part for the routing kernels in the MoE TRT-LLM backend (#5771)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-07-11 16:37:56 +08:00
narutolhy
41ef1ade19
feat:enable kvcache to be reused during request generation (#4028)
Signed-off-by: narutolhy <582909902@qq.com>
2025-07-10 22:18:01 +09:00
DylanChen-NV
74dca0aa7b
[NVBUG-5304516/5319741]Qwen2.5VL FP8 support (#5029)
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
2025-07-09 23:16:42 +08:00
xavier-nvidia
b6013da198
Fix GEMM+AR fusion on blackwell (#5563)
Signed-off-by: xsimmons <xsimmons@nvidia.com>
2025-07-09 08:48:47 +08:00
ChristinaZ
12d8c7d129
Refactor the topk parallelization part for the routing kernels (#5567)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-07-07 15:53:25 +08:00
Daniel Stokes
ec6c7dff1a
feat: Add support for MXFP8xMXFP4 in pytorch (#5535)
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-07-06 15:32:06 -07:00
jthomson04
1b588f8390
feat: KV events for sliding window attention (#5580)
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
2025-07-05 06:05:20 +08:00
Chuang Zhu
ffc0b8f5da
Cache transceiver support VSWA (#5505)
Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
Co-authored-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-07-05 01:18:42 +09:00
Robin Kobus
07f9cf1519
fix: Improve chunking test and skip empty kernel calls (#5710)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-04 09:08:15 +02:00
Netanel Haber
134b2383ff
[fix: nvbugs/5355493] Correctly clamp max sequence len to max attention window (#5720)
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
2025-07-04 08:16:25 +02:00
Robin Kobus
4cd8543d8c
[TRTLLM-1316] refactor: Remove unnecessary pipeline parallelism logic from postProcessRequest (#5489)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-02 10:13:31 +02:00
Robin Kobus
d68fa728d8
refactor: Clean up DecodingInput and DecodingOutput (#5617)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-01 14:31:42 +02:00
Robin Kobus
5f77d212ef
test: Reduce number of C++ test cases (#5437)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-01 09:40:49 +02:00
Robin Kobus
9bdc5951f8
refactor: decoder state setup (#5093)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-30 11:09:43 +02:00
Cheng Hang
64db7d27f6
[feat] Optimizations on weight-only batched gemv kernel (#5420)
Signed-off-by: Cheng Hang <chang@nvidia.com>
2025-06-30 10:20:16 +08:00
Enwei Zhu
b4dab23e7b
[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP (#5435)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-30 01:02:07 +08:00
Daniel Stokes
5773cfdcf2
feat: Add support for per expert activation scaling factors (#5013)
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-06-28 09:10:35 +12:00
Darragh Hanley
5437075def
ReDrafter support for Qwen (#4875)
Signed-off-by: darraghdog <darragh.hanley@gmail.com>
Signed-off-by: Darragh Hanley <darragh.hanley@gmail.com>
Co-authored-by: rakib-hasan <rhasan@nvidia.com>
2025-06-28 02:33:10 +08:00
Robin Kobus
a8141a4513
refactor: Speculative decoding buffers part 2 (#5316)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-27 17:41:48 +02:00
ChristinaZ
a608b00d38
Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-06-27 20:17:40 +08:00
jmydurant
8836990bde
[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) (#5475)
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2025-06-26 22:18:08 +08:00
Robin Kobus
8dfa31c71d
refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead (#5384)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-26 19:45:52 +08:00
dongxuy04
490d2e5819
feat: large-scale EP(part 8: Online EP load balancer integration for PCIe fp8) (#5226)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-06-25 22:25:13 -07:00
Daniel Stokes
942841417e
opensource: Opensource MOE MXFP8-MXFP4 implementation (#5222)
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-06-26 12:18:19 +08:00
ChristinaZ
d135f5993d
Add unit test for routing kernels (#5405)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-06-26 09:49:11 +08:00
jmydurant
578dbc8d9a
feat: chunked prefill for MLA (Blackwell) (#4651)
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2025-06-26 09:01:00 +08:00
Robin Kobus
e2a8cbc80b
refactor: manage cache indirection in decoder state (#5315)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-24 09:15:59 +02:00
Robin Kobus
b3045c44b9
refactor: remove TrtGptModelOptionalParams (#5165)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-20 10:31:40 +02:00
dongxuy04
4f0f17ac8a
feat: Misc Opt for large scale EP (#5374)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-06-20 13:11:31 +08:00
Kaiyu Xie
113f6fbadd
Fix: missing clientId when serialize and deserialize response (#5231)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-06-19 23:05:11 +08:00
Robin Kobus
627062c265
refactor: Update decoder buffer and logits management (#4450)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-18 08:10:32 +08:00
Robin Kobus
dc3861b4aa
refactor: Unify decoder test with e2e worklfow (#5239)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-17 12:04:58 +02:00
Enwei Zhu
4b82b8b4c7
[TRTLLM-5330] perf: Optimize MoE supplementary kernels for large-scale EP (#5215)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-17 15:23:24 +08:00
Robin Kobus
b6ca677741
refactor: remove decoder request from decoder interface (#5129)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-16 09:12:30 +02:00
yunruis
30c5b4183a
refactoring: port customized kernels with public cutlass version (#5027)
Signed-off-by: yunruis 

Merge this to unblock others since the full CI has been run through
2025-06-13 16:19:31 +08:00
zhhuang-nv
a891013e3c
[feat] Optimize KV Cache Reuse for MLA (#4869)
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
2025-06-13 11:03:05 +08:00
Netanel Haber
e692779ead
Solve underallocation in VSWA+/VGQA (#4667)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-06-12 12:12:46 +08:00
Zheng Duan
ee44fa00f8
chore: rename IOFormatter to BaseCacheFormatter (#5068)
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
2025-06-12 10:50:14 +08:00
Tracin
6c91f1c7ac
Mxfp8xmxfp4 quant mode(#4978)
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-10 22:01:37 +08:00
Chang Liu
f70815c945
[TRTLLM-5007][feat] Add multimodal hashing support (image hashing) (#4145)
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-06-10 01:59:56 +08:00
Chuang Zhu
9a874760c1
Kv cache transfer support duplicate heads (#4929)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-06-09 14:11:19 +08:00
Daniel Stokes
3a4851b7c3
feat: Add Mixture of Experts FP8xMXFP4 support (#4750)
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-06-09 13:25:04 +08:00