Xiwen Yu
1b846046dd
fix kernel select code to recognize sm103/sm100f
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-06 14:24:43 +08:00
Xiwen Yu
5c09dc8304
CUDA13 breaking changes: c++ compile successful
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-06 14:24:40 +08:00
ixlmar
1ebceb790d
[TRTLLM-5508][feat] check input tokens + improve error handling ( #5170 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-08-05 18:27:43 +01:00
Olya Kozlova
13cc1c4878
[TRTLLM-5271][feat] best_of/n for pytorch workflow ( #5997 )
...
Signed-off-by: Olya Kozlova <okozlova@nvidia.com>
2025-08-04 14:08:06 +02:00
Yuan Tong
a2f271c8e0
[TRTLLM-4406][feat] LLM sleep & wakeup Part 1: virtual device memory ( #5034 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-08-04 13:51:01 +08:00
Robin Kobus
918fedf952
[None][refactor] Simplify finish reasons handling in DecoderState ( #6524 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-08-02 07:17:43 +02:00
Robin Kobus
d3c14682f0
refactor: Remove unused buffers and bindings from sampler ( #6484 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-08-01 00:43:03 -04:00
Yuan Tong
413a83ff80
fix: compatibility with CUDA < 12.9 on __CUDA_ARCH_SPECIFIC__ macro ( #5917 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-07-28 16:02:26 +08:00
Michal Guzek
08d57123f9
[nvbug/5374773] chore: Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache ( #5974 )
...
Signed-off-by: moraxu <mguzek@nvidia.com>
2025-07-25 18:10:40 -04:00
Chang Liu
7381f1dba7
[TRTLLM-5059][feat] Add KV cache reuse support for multimodal models ( #5444 )
...
Only supports qwen in this PR
2025-07-21 16:11:58 -07:00
Ziyi Xiong
66030ef815
[TRTLLM-6452][feat]: Two-model engine KV cache reuse support ( #6133 )
...
Signed-off-by: ziyixiong-nv <fxiong@nvidia.com>
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-07-19 13:17:15 +08:00
Robin Kobus
ec2b953e7e
refactor: Enhanced handling of decoder requests and logits within the batch manager ( #6055 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-18 12:12:08 +02:00
Iman Tabrizian
b75e53ab69
Revert "feat: nanobind bindings ( #5961 )" ( #6160 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-18 10:12:54 +08:00
Linda
5bff317abf
feat: nanobind bindings ( #5961 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-17 22:42:52 +08:00
Chuang Zhu
44c70c88f9
chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service ( #5234 )
...
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-07-17 17:42:07 +08:00
qixiang-99
e09e409dfb
Fix: Enhance ModelConfig for kv cache size calculations ( #5868 )
...
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
2025-07-16 14:41:31 -07:00
Tomer Shmilovich
0552a02943
BlockManager copy constructor fix ( #5982 )
...
Signed-off-by: Tomer Shmilovich <tshmilovich@nvidia.com>
2025-07-16 17:33:17 +08:00
Robin Kobus
6d4b045d1f
refactor: Remove enforced sorted order of batch slots ( #3502 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-14 17:23:02 +02:00
narutolhy
41ef1ade19
feat:enable kvcache to be reused during request generation ( #4028 )
...
Signed-off-by: narutolhy <582909902@qq.com>
2025-07-10 22:18:01 +09:00
Pamela Peng
da8c7372d4
[TRTLLM-5366][feat]Add support for sm121 ( #5524 )
...
Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Initial CI run failed a single step A30-CPP-3 due to timeout. Rerunning that step succeeded.
2025-07-08 14:27:00 -07:00
Robin Kobus
ae27261094
refactor: decoding inputs ( #5679 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-06 08:21:02 +02:00
jthomson04
1b588f8390
feat: KV events for sliding window attention ( #5580 )
...
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
2025-07-05 06:05:20 +08:00
Yuan Tong
32b244af38
feat: reduce unnecessary kernel generation ( #5476 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-07-04 14:37:49 +08:00
Robin Kobus
1a3bd140ed
chore: Remove unused isFullContextRequest method ( #5666 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-03 15:08:09 +02:00
Robin Kobus
d68fa728d8
refactor: Clean up DecodingInput and DecodingOutput ( #5617 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-01 14:31:42 +02:00
Robin Kobus
9bdc5951f8
refactor: decoder state setup ( #5093 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-30 11:09:43 +02:00
Robin Kobus
a8141a4513
refactor: Speculative decoding buffers part 2 ( #5316 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-27 17:41:48 +02:00
Aurelien Chartier
833c0dea4a
[TRTLLM-6104] feat: add request_perf_metrics to LLMAPI ( #5497 )
...
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-06-27 17:03:05 +02:00
wili
56cdfe5c6c
[TRTLLM-5000][feat] NGrams V2 ( #4569 )
...
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-06-27 23:00:17 +08:00
Robin Kobus
8dfa31c71d
refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead ( #5384 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-26 19:45:52 +08:00
Daniel Stokes
942841417e
opensource: Opensource MOE MXFP8-MXFP4 implementation ( #5222 )
...
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-06-26 12:18:19 +08:00
Robin Kobus
e2a8cbc80b
refactor: manage cache indirection in decoder state ( #5315 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-24 09:15:59 +02:00
Robin Kobus
b3045c44b9
refactor: remove TrtGptModelOptionalParams ( #5165 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-20 10:31:40 +02:00
jellysnack
0623ffe3bc
feat: Add LLGuidance Support for PyTorch Backend ( #5214 )
...
Signed-off-by: jellysnack <oleg.jellysnack@gmail.com>
Signed-off-by: jellysnack <158609015+jellysnack@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-18 19:33:34 +08:00
Robin Kobus
627062c265
refactor: Update decoder buffer and logits management ( #4450 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-18 08:10:32 +08:00
QI JUN
f899c4d294
Re-implement LlmResponse in Python to reduce host overhead of pybind ( #5224 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-17 21:28:09 +08:00
Robin Kobus
dc3861b4aa
refactor: Unify decoder test with e2e worklfow ( #5239 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-17 12:04:58 +02:00
Robin Kobus
b6ca677741
refactor: remove decoder request from decoder interface ( #5129 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-16 09:12:30 +02:00
Robin Kobus
dda64166cd
refactor: Scheduling based on KV cache state ( #4865 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-16 08:14:58 +02:00
Robin Kobus
443b2eb51f
refactor: Speculative decoding buffers ( #5091 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-14 11:39:32 +02:00
yunruis
e5be3a95b3
fix: fix license bug ( #5200 )
...
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
2025-06-13 18:58:15 +08:00
yunruis
30c5b4183a
refactoring: port customized kernels with public cutlass version ( #5027 )
...
Signed-off-by: yunruis
Merge this to unblock others since the full CI has been run through
2025-06-13 16:19:31 +08:00
liji-nv
10ab9791ec
[fix] Do not reuse dummy request KVCache ( #4804 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-06-12 15:24:50 +08:00
Netanel Haber
e692779ead
Solve underallocation in VSWA+/VGQA ( #4667 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-06-12 12:12:46 +08:00
Tracin
6c91f1c7ac
Mxfp8xmxfp4 quant mode( #4978 )
...
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-10 22:01:37 +08:00
Chang Liu
f70815c945
[TRTLLM-5007][feat] Add multimodal hashing support (image hashing) ( #4145 )
...
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-06-10 01:59:56 +08:00
liji-nv
1d4f748773
[fix] Fix illegal mem access and possible accuracy lose. Cherry-pick … ( #5017 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-06-09 17:50:57 +08:00
Zheng Duan
dd2191c5b3
fix: correct the order of llm request state ( #4781 )
...
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
2025-06-04 14:45:13 +08:00
Zheng Duan
ded694b1aa
feat: cache reuse support (selective cache transfer) in mla cache formatter ( #4749 )
...
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
2025-06-04 09:56:31 +08:00
Robin Kobus
3de02582dd
refactor: Separate DecoderState from GptDecoderBatched ( #4700 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-03 09:42:01 +02:00