Zheng Duan
|
ebdc43e69d
|
[None][feat] move kv cache measure into transfer session (#6633)
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
|
2025-08-08 17:49:22 +08:00 |
|
Daniel Cámpora
|
efca359b66
|
[TRTLLM-6785][feat] BREAKING CHANGE Enable TRTLLM sampler by default (#6216)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
|
2025-08-07 22:19:37 -04:00 |
|
pcastonguay
|
453a06e6ab
|
[TRTLLM-6881][feat] Include attention dp rank info with KV cache events (#6563)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
|
2025-08-07 14:17:07 +02:00 |
|
amitz-nv
|
85af62184b
|
[TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter (#6510)
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
|
2025-08-07 09:05:36 +03:00 |
|
Chuang Zhu
|
ee471df07c
|
[None][chore] optimize kv cache transfer for context TEP and gen DEP (#6657)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
|
2025-08-07 11:36:05 +08:00 |
|
ixlmar
|
1ebceb790d
|
[TRTLLM-5508][feat] check input tokens + improve error handling (#5170)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
|
2025-08-05 18:27:43 +01:00 |
|
Chuang Zhu
|
4d040b50b7
|
[None][chore] ucx establish connection with zmq (#6090)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
|
2025-08-05 02:50:45 -04:00 |
|
Chuang Zhu
|
542f552d0b
|
use cudaSetDevice to create context ,fix nvbug 5394497 (#6403)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
|
2025-08-03 13:32:55 -04:00 |
|
Robin Kobus
|
918fedf952
|
[None][refactor] Simplify finish reasons handling in DecoderState (#6524)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-08-02 07:17:43 +02:00 |
|
Jaedeok Kim
|
fbee279909
|
fix: remove duplicate layer multiplication in KV cache size calculation (#6481)
Signed-off-by: Jaedeok Kim <jaedeokk@nvidia.com>
|
2025-07-31 22:34:34 -04:00 |
|
Enwei Zhu
|
4b299cb77e
|
feat: Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21 (#6408)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
|
2025-07-31 09:53:52 +08:00 |
|
pcastonguay
|
e7ae5e2824
|
feat: Add support for disaggregation with pp with pytorch backend (#6369)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Signed-off-by: raayandhar <rdhar@nvidia.com>
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Signed-off-by: pcastonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: raayandhar <rdhar@nvidia.com>
Co-authored-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
|
2025-07-30 09:42:13 -04:00 |
|
Zheng Duan
|
c9ed1ab436
|
[TRTLLM-6549] chore: record delay introduced by disaggregated serving in kv cache measure (#6135)
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
|
2025-07-30 10:39:40 +08:00 |
|
Michal Guzek
|
08d57123f9
|
[nvbug/5374773] chore: Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache (#5974)
Signed-off-by: moraxu <mguzek@nvidia.com>
|
2025-07-25 18:10:40 -04:00 |
|
Chang Liu
|
7381f1dba7
|
[TRTLLM-5059][feat] Add KV cache reuse support for multimodal models (#5444)
Only supports qwen in this PR
|
2025-07-21 16:11:58 -07:00 |
|
amitz-nv
|
98428f330e
|
[TRTLLM-5826][feat] Support pytorch LoRA adapter eviction (#5616)
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
|
2025-07-20 08:00:14 +03:00 |
|
Bo Deng
|
0388ff9083
|
[https://nvbugs/5393961][fix] record kv-cache size in MLACacheFormatter (#6181)
Signed-off-by: Bo Deng <deemod@nvidia.com>
|
2025-07-19 05:06:45 +08:00 |
|
Stefan Niebler
|
6d7874a467
|
[nvbugs/5369799] fix: Update disaggregation handling in sampler (#5762)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
|
2025-07-19 01:40:46 +08:00 |
|
Robin Kobus
|
ec2b953e7e
|
refactor: Enhanced handling of decoder requests and logits within the batch manager (#6055)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-07-18 12:12:08 +02:00 |
|
Iman Tabrizian
|
b75e53ab69
|
Revert "feat: nanobind bindings (#5961)" (#6160)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
|
2025-07-18 10:12:54 +08:00 |
|
Linda
|
5bff317abf
|
feat: nanobind bindings (#5961)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
|
2025-07-17 22:42:52 +08:00 |
|
Chuang Zhu
|
44c70c88f9
|
chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service (#5234)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
|
2025-07-17 17:42:07 +08:00 |
|
Zheng Duan
|
38db4bc7fb
|
feat: use session abstraction in data transceiver and cache formatter (#5611)
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
|
2025-07-16 13:52:44 +08:00 |
|
Robin Kobus
|
6d4b045d1f
|
refactor: Remove enforced sorted order of batch slots (#3502)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-07-14 17:23:02 +02:00 |
|
narutolhy
|
41ef1ade19
|
feat:enable kvcache to be reused during request generation (#4028)
Signed-off-by: narutolhy <582909902@qq.com>
|
2025-07-10 22:18:01 +09:00 |
|
xiweny
|
eaf8bec88b
|
fix: Disaggregate serving with attention DP (#4993)
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
|
2025-07-08 16:15:03 +08:00 |
|
Robin Kobus
|
ae27261094
|
refactor: decoding inputs (#5679)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-07-06 08:21:02 +02:00 |
|
jthomson04
|
1b588f8390
|
feat: KV events for sliding window attention (#5580)
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
|
2025-07-05 06:05:20 +08:00 |
|
Chuang Zhu
|
ffc0b8f5da
|
Cache transceiver support VSWA (#5505)
Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
Co-authored-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
|
2025-07-05 01:18:42 +09:00 |
|
Robin Kobus
|
07f9cf1519
|
fix: Improve chunking test and skip empty kernel calls (#5710)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-07-04 09:08:15 +02:00 |
|
Netanel Haber
|
134b2383ff
|
[fix: nvbugs/5355493] Correctly clamp max sequence len to max attention window (#5720)
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
|
2025-07-04 08:16:25 +02:00 |
|
Robin Kobus
|
4cd8543d8c
|
[TRTLLM-1316] refactor: Remove unnecessary pipeline parallelism logic from postProcessRequest (#5489)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-07-02 10:13:31 +02:00 |
|
Robin Kobus
|
9bdc5951f8
|
refactor: decoder state setup (#5093)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-06-30 11:09:43 +02:00 |
|
Robin Kobus
|
a8141a4513
|
refactor: Speculative decoding buffers part 2 (#5316)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-06-27 17:41:48 +02:00 |
|
Aurelien Chartier
|
833c0dea4a
|
[TRTLLM-6104] feat: add request_perf_metrics to LLMAPI (#5497)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
|
2025-06-27 17:03:05 +02:00 |
|
Robin Kobus
|
8dfa31c71d
|
refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead (#5384)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-06-26 19:45:52 +08:00 |
|
Enwei Zhu
|
fc7a81ceb0
|
test: Add LLGuidance test and refine guided decoding (#5348)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
|
2025-06-25 14:12:56 +08:00 |
|
Robin Kobus
|
e2a8cbc80b
|
refactor: manage cache indirection in decoder state (#5315)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-06-24 09:15:59 +02:00 |
|
Robin Kobus
|
b3045c44b9
|
refactor: remove TrtGptModelOptionalParams (#5165)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-06-20 10:31:40 +02:00 |
|
Robin Kobus
|
627062c265
|
refactor: Update decoder buffer and logits management (#4450)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-06-18 08:10:32 +08:00 |
|
QI JUN
|
f899c4d294
|
Re-implement LlmResponse in Python to reduce host overhead of pybind (#5224)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
|
2025-06-17 21:28:09 +08:00 |
|
Robin Kobus
|
dc3861b4aa
|
refactor: Unify decoder test with e2e worklfow (#5239)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-06-17 12:04:58 +02:00 |
|
Robin Kobus
|
b6ca677741
|
refactor: remove decoder request from decoder interface (#5129)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-06-16 09:12:30 +02:00 |
|
Robin Kobus
|
dda64166cd
|
refactor: Scheduling based on KV cache state (#4865)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-06-16 08:14:58 +02:00 |
|
Robin Kobus
|
443b2eb51f
|
refactor: Speculative decoding buffers (#5091)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-06-14 11:39:32 +02:00 |
|
liji-nv
|
10ab9791ec
|
[fix] Do not reuse dummy request KVCache (#4804)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
|
2025-06-12 15:24:50 +08:00 |
|
Netanel Haber
|
e692779ead
|
Solve underallocation in VSWA+/VGQA (#4667)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
|
2025-06-12 12:12:46 +08:00 |
|
Zheng Duan
|
ee44fa00f8
|
chore: rename IOFormatter to BaseCacheFormatter (#5068)
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
|
2025-06-12 10:50:14 +08:00 |
|
Aurelien Chartier
|
dcf72c6ad3
|
chore: cleanup GDS Cmake interface (#4928)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
|
2025-06-10 17:25:43 +08:00 |
|
pcastonguay
|
87c56ab024
|
perf: Removing initializing ptuning buffers to zero (#4915)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
|
2025-06-09 21:57:21 -04:00 |
|