Yuan Tong
a2f271c8e0
[TRTLLM-4406][feat] LLM sleep & wakeup Part 1: virtual device memory ( #5034 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-08-04 13:51:01 +08:00
Robin Kobus
918fedf952
[None][refactor] Simplify finish reasons handling in DecoderState ( #6524 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-08-02 07:17:43 +02:00
Robin Kobus
d3c14682f0
refactor: Remove unused buffers and bindings from sampler ( #6484 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-08-01 00:43:03 -04:00
Yuan Tong
413a83ff80
fix: compatibility with CUDA < 12.9 on __CUDA_ARCH_SPECIFIC__ macro ( #5917 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-07-28 16:02:26 +08:00
Michal Guzek
08d57123f9
[nvbug/5374773] chore: Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache ( #5974 )
...
Signed-off-by: moraxu <mguzek@nvidia.com>
2025-07-25 18:10:40 -04:00
Chang Liu
7381f1dba7
[TRTLLM-5059][feat] Add KV cache reuse support for multimodal models ( #5444 )
...
Only supports qwen in this PR
2025-07-21 16:11:58 -07:00
Ziyi Xiong
66030ef815
[TRTLLM-6452][feat]: Two-model engine KV cache reuse support ( #6133 )
...
Signed-off-by: ziyixiong-nv <fxiong@nvidia.com>
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-07-19 13:17:15 +08:00
Robin Kobus
ec2b953e7e
refactor: Enhanced handling of decoder requests and logits within the batch manager ( #6055 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-18 12:12:08 +02:00
Iman Tabrizian
b75e53ab69
Revert "feat: nanobind bindings ( #5961 )" ( #6160 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-18 10:12:54 +08:00
Linda
5bff317abf
feat: nanobind bindings ( #5961 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-17 22:42:52 +08:00
Chuang Zhu
44c70c88f9
chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service ( #5234 )
...
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-07-17 17:42:07 +08:00
qixiang-99
e09e409dfb
Fix: Enhance ModelConfig for kv cache size calculations ( #5868 )
...
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
2025-07-16 14:41:31 -07:00
Tomer Shmilovich
0552a02943
BlockManager copy constructor fix ( #5982 )
...
Signed-off-by: Tomer Shmilovich <tshmilovich@nvidia.com>
2025-07-16 17:33:17 +08:00
Robin Kobus
6d4b045d1f
refactor: Remove enforced sorted order of batch slots ( #3502 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-14 17:23:02 +02:00
narutolhy
41ef1ade19
feat:enable kvcache to be reused during request generation ( #4028 )
...
Signed-off-by: narutolhy <582909902@qq.com>
2025-07-10 22:18:01 +09:00
Pamela Peng
da8c7372d4
[TRTLLM-5366][feat]Add support for sm121 ( #5524 )
...
Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Initial CI run failed a single step A30-CPP-3 due to timeout. Rerunning that step succeeded.
2025-07-08 14:27:00 -07:00
Robin Kobus
ae27261094
refactor: decoding inputs ( #5679 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-06 08:21:02 +02:00
jthomson04
1b588f8390
feat: KV events for sliding window attention ( #5580 )
...
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
2025-07-05 06:05:20 +08:00
Yuan Tong
32b244af38
feat: reduce unnecessary kernel generation ( #5476 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-07-04 14:37:49 +08:00
Robin Kobus
1a3bd140ed
chore: Remove unused isFullContextRequest method ( #5666 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-03 15:08:09 +02:00
Robin Kobus
d68fa728d8
refactor: Clean up DecodingInput and DecodingOutput ( #5617 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-01 14:31:42 +02:00
Robin Kobus
9bdc5951f8
refactor: decoder state setup ( #5093 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-30 11:09:43 +02:00
Robin Kobus
a8141a4513
refactor: Speculative decoding buffers part 2 ( #5316 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-27 17:41:48 +02:00
Aurelien Chartier
833c0dea4a
[TRTLLM-6104] feat: add request_perf_metrics to LLMAPI ( #5497 )
...
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-06-27 17:03:05 +02:00
wili
56cdfe5c6c
[TRTLLM-5000][feat] NGrams V2 ( #4569 )
...
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-06-27 23:00:17 +08:00
Robin Kobus
8dfa31c71d
refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead ( #5384 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-26 19:45:52 +08:00
Daniel Stokes
942841417e
opensource: Opensource MOE MXFP8-MXFP4 implementation ( #5222 )
...
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-06-26 12:18:19 +08:00
Robin Kobus
e2a8cbc80b
refactor: manage cache indirection in decoder state ( #5315 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-24 09:15:59 +02:00
Robin Kobus
b3045c44b9
refactor: remove TrtGptModelOptionalParams ( #5165 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-20 10:31:40 +02:00
jellysnack
0623ffe3bc
feat: Add LLGuidance Support for PyTorch Backend ( #5214 )
...
Signed-off-by: jellysnack <oleg.jellysnack@gmail.com>
Signed-off-by: jellysnack <158609015+jellysnack@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-18 19:33:34 +08:00
Robin Kobus
627062c265
refactor: Update decoder buffer and logits management ( #4450 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-18 08:10:32 +08:00
QI JUN
f899c4d294
Re-implement LlmResponse in Python to reduce host overhead of pybind ( #5224 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-17 21:28:09 +08:00
Robin Kobus
dc3861b4aa
refactor: Unify decoder test with e2e worklfow ( #5239 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-17 12:04:58 +02:00
Robin Kobus
b6ca677741
refactor: remove decoder request from decoder interface ( #5129 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-16 09:12:30 +02:00
Robin Kobus
dda64166cd
refactor: Scheduling based on KV cache state ( #4865 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-16 08:14:58 +02:00
Robin Kobus
443b2eb51f
refactor: Speculative decoding buffers ( #5091 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-14 11:39:32 +02:00
yunruis
e5be3a95b3
fix: fix license bug ( #5200 )
...
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
2025-06-13 18:58:15 +08:00
yunruis
30c5b4183a
refactoring: port customized kernels with public cutlass version ( #5027 )
...
Signed-off-by: yunruis
Merge this to unblock others since the full CI has been run through
2025-06-13 16:19:31 +08:00
liji-nv
10ab9791ec
[fix] Do not reuse dummy request KVCache ( #4804 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-06-12 15:24:50 +08:00
Netanel Haber
e692779ead
Solve underallocation in VSWA+/VGQA ( #4667 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-06-12 12:12:46 +08:00
Tracin
6c91f1c7ac
Mxfp8xmxfp4 quant mode( #4978 )
...
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-10 22:01:37 +08:00
Chang Liu
f70815c945
[TRTLLM-5007][feat] Add multimodal hashing support (image hashing) ( #4145 )
...
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-06-10 01:59:56 +08:00
liji-nv
1d4f748773
[fix] Fix illegal mem access and possible accuracy lose. Cherry-pick … ( #5017 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-06-09 17:50:57 +08:00
Zheng Duan
dd2191c5b3
fix: correct the order of llm request state ( #4781 )
...
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
2025-06-04 14:45:13 +08:00
Zheng Duan
ded694b1aa
feat: cache reuse support (selective cache transfer) in mla cache formatter ( #4749 )
...
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
2025-06-04 09:56:31 +08:00
Robin Kobus
3de02582dd
refactor: Separate DecoderState from GptDecoderBatched ( #4700 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-03 09:42:01 +02:00
Robin Kobus
b9263a8e10
fix: max_num_sequences calculation with overlap scheduling ( #4532 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Co-authored-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-03 09:31:22 +02:00
Dom Brown
338d6e9f95
[nvbug 5305210] fix: Resolve nvbug 5305210 ( #4759 )
...
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-05-31 19:21:06 +08:00
Thor Johnsen
55d56f8155
[JIRA-5226219][fix] Fix Bug in KV cache manager ( #4596 )
...
Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>
2025-05-29 22:03:20 -07:00
Jinyang Yuan
5339d367ce
[perf] Reduce the workspace size of FP4 activation scales for MoE ( #4303 )
...
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-05-30 09:03:52 +08:00