Zero Zeng
953f4fd69e
[None][fix] acceptance rate calculation fix in benchmark_serving ( #6746 )
...
Signed-off-by: Zero Zeng <38289304+zerollzeng@users.noreply.github.com>
2025-08-19 17:29:36 +08:00
yifeizhang-c
4127d77678
[ https://nvbugs/5394392 ][fix] Enlarge scheduler capacity under disagg bs == 1 ( #6537 )
...
Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>
2025-08-15 09:52:06 -07:00
Robin Kobus
45c7518032
[None][refactor] Simplify decoder state initialization ( #6559 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-08-12 21:44:41 +02:00
bhsueh_NV
83dbc6c75d
[TRTLLM-5532][feat] store the block of context request into kv cache ( #6683 )
...
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-08-11 16:14:52 +08:00
Chuang Zhu
c566a8d2a2
[None][fix] fix same pp disagg ( #6730 )
...
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-08-10 22:45:15 -04:00
Yueh-Ting (eop) Chen
199f306984
[None][chore][kv cache manager] Dead code elimination, we no longer record/fetch through WindowBlockManager:: mContextBlocksByHash ( #6249 )
...
No functional change is intended in this MR.
`WindowBlockManager::mCachedBlocksRoot` is now who is responsible
for the bookkeeping of the `KVCacheBlock`, and the `mNextBlocks` is
now the actual hash map that fetches the block.
The `mEnableHashKey` knob and related hashing is removed.
Signed-off-by: eopXD <yuehtingc@nvidia.com>
2025-08-10 09:10:10 -04:00
Ziyi Xiong
de472828b9
[TRTLLM-6637][feat] Resolve KV cache divergence issue ( #6628 )
...
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-08-09 23:15:04 +08:00
Chuang Zhu
e251f7c00b
[None][fix]revert kvcache transfer ( #6709 )
...
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-08-08 07:18:53 -04:00
Zheng Duan
ebdc43e69d
[None][feat] move kv cache measure into transfer session ( #6633 )
...
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-08-08 17:49:22 +08:00
Daniel Cámpora
efca359b66
[TRTLLM-6785][feat] BREAKING CHANGE Enable TRTLLM sampler by default ( #6216 )
...
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-08-07 22:19:37 -04:00
pcastonguay
453a06e6ab
[TRTLLM-6881][feat] Include attention dp rank info with KV cache events ( #6563 )
...
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-08-07 14:17:07 +02:00
amitz-nv
85af62184b
[TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter ( #6510 )
...
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-08-07 09:05:36 +03:00
Chuang Zhu
ee471df07c
[None][chore] optimize kv cache transfer for context TEP and gen DEP ( #6657 )
...
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-08-07 11:36:05 +08:00
ixlmar
1ebceb790d
[TRTLLM-5508][feat] check input tokens + improve error handling ( #5170 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-08-05 18:27:43 +01:00
Chuang Zhu
4d040b50b7
[None][chore] ucx establish connection with zmq ( #6090 )
...
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-08-05 02:50:45 -04:00
Chuang Zhu
542f552d0b
use cudaSetDevice to create context ,fix nvbug 5394497 ( #6403 )
...
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-08-03 13:32:55 -04:00
Robin Kobus
918fedf952
[None][refactor] Simplify finish reasons handling in DecoderState ( #6524 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-08-02 07:17:43 +02:00
Jaedeok Kim
fbee279909
fix: remove duplicate layer multiplication in KV cache size calculation ( #6481 )
...
Signed-off-by: Jaedeok Kim <jaedeokk@nvidia.com>
2025-07-31 22:34:34 -04:00
Enwei Zhu
4b299cb77e
feat: Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21 ( #6408 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-07-31 09:53:52 +08:00
pcastonguay
e7ae5e2824
feat: Add support for disaggregation with pp with pytorch backend ( #6369 )
...
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Signed-off-by: raayandhar <rdhar@nvidia.com>
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Signed-off-by: pcastonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: raayandhar <rdhar@nvidia.com>
Co-authored-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-07-30 09:42:13 -04:00
Zheng Duan
c9ed1ab436
[TRTLLM-6549] chore: record delay introduced by disaggregated serving in kv cache measure ( #6135 )
...
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-07-30 10:39:40 +08:00
Michal Guzek
08d57123f9
[nvbug/5374773] chore: Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache ( #5974 )
...
Signed-off-by: moraxu <mguzek@nvidia.com>
2025-07-25 18:10:40 -04:00
Chang Liu
7381f1dba7
[TRTLLM-5059][feat] Add KV cache reuse support for multimodal models ( #5444 )
...
Only supports qwen in this PR
2025-07-21 16:11:58 -07:00
amitz-nv
98428f330e
[TRTLLM-5826][feat] Support pytorch LoRA adapter eviction ( #5616 )
...
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-07-20 08:00:14 +03:00
Bo Deng
0388ff9083
[ https://nvbugs/5393961 ][fix] record kv-cache size in MLACacheFormatter ( #6181 )
...
Signed-off-by: Bo Deng <deemod@nvidia.com>
2025-07-19 05:06:45 +08:00
Stefan Niebler
6d7874a467
[nvbugs/5369799] fix: Update disaggregation handling in sampler ( #5762 )
...
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-07-19 01:40:46 +08:00
Robin Kobus
ec2b953e7e
refactor: Enhanced handling of decoder requests and logits within the batch manager ( #6055 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-18 12:12:08 +02:00
Iman Tabrizian
b75e53ab69
Revert "feat: nanobind bindings ( #5961 )" ( #6160 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-18 10:12:54 +08:00
Linda
5bff317abf
feat: nanobind bindings ( #5961 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-17 22:42:52 +08:00
Chuang Zhu
44c70c88f9
chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service ( #5234 )
...
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-07-17 17:42:07 +08:00
Zheng Duan
38db4bc7fb
feat: use session abstraction in data transceiver and cache formatter ( #5611 )
...
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-07-16 13:52:44 +08:00
Robin Kobus
6d4b045d1f
refactor: Remove enforced sorted order of batch slots ( #3502 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-14 17:23:02 +02:00
narutolhy
41ef1ade19
feat:enable kvcache to be reused during request generation ( #4028 )
...
Signed-off-by: narutolhy <582909902@qq.com>
2025-07-10 22:18:01 +09:00
xiweny
eaf8bec88b
fix: Disaggregate serving with attention DP ( #4993 )
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-07-08 16:15:03 +08:00
Robin Kobus
ae27261094
refactor: decoding inputs ( #5679 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-06 08:21:02 +02:00
jthomson04
1b588f8390
feat: KV events for sliding window attention ( #5580 )
...
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
2025-07-05 06:05:20 +08:00
Chuang Zhu
ffc0b8f5da
Cache transceiver support VSWA ( #5505 )
...
Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
Co-authored-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-07-05 01:18:42 +09:00
Robin Kobus
07f9cf1519
fix: Improve chunking test and skip empty kernel calls ( #5710 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-04 09:08:15 +02:00
Netanel Haber
134b2383ff
[fix: nvbugs/5355493] Correctly clamp max sequence len to max attention window ( #5720 )
...
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
2025-07-04 08:16:25 +02:00
Robin Kobus
4cd8543d8c
[TRTLLM-1316] refactor: Remove unnecessary pipeline parallelism logic from postProcessRequest ( #5489 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-02 10:13:31 +02:00
Robin Kobus
9bdc5951f8
refactor: decoder state setup ( #5093 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-30 11:09:43 +02:00
Robin Kobus
a8141a4513
refactor: Speculative decoding buffers part 2 ( #5316 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-27 17:41:48 +02:00
Aurelien Chartier
833c0dea4a
[TRTLLM-6104] feat: add request_perf_metrics to LLMAPI ( #5497 )
...
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-06-27 17:03:05 +02:00
Robin Kobus
8dfa31c71d
refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead ( #5384 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-26 19:45:52 +08:00
Enwei Zhu
fc7a81ceb0
test: Add LLGuidance test and refine guided decoding ( #5348 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-25 14:12:56 +08:00
Robin Kobus
e2a8cbc80b
refactor: manage cache indirection in decoder state ( #5315 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-24 09:15:59 +02:00
Robin Kobus
b3045c44b9
refactor: remove TrtGptModelOptionalParams ( #5165 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-20 10:31:40 +02:00
Robin Kobus
627062c265
refactor: Update decoder buffer and logits management ( #4450 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-18 08:10:32 +08:00
QI JUN
f899c4d294
Re-implement LlmResponse in Python to reduce host overhead of pybind ( #5224 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-17 21:28:09 +08:00
Robin Kobus
dc3861b4aa
refactor: Unify decoder test with e2e worklfow ( #5239 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-17 12:04:58 +02:00