Jhao-Ting Chen
77082cde38
[ https://nvbugspro.nvidia.com/bug/5329655 ] [feat] Pytorch path add spec dec param to attention op ( #5146 )
...
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
2025-07-02 04:54:43 -04:00
qixiang-99
ca7b6ec8d8
Feat/pytorch vswa kvcachemanager ( #5151 )
...
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
2025-07-02 15:58:00 +08:00
Aurelien Chartier
fa95e402a5
feat: add LLmArgs option to force using dynamic quantization ( #5346 )
...
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-07-01 12:16:09 -07:00
liji-nv
c345f5876c
[feat] Support torch compile for attention dp ( #5086 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-07-01 13:48:52 -04:00
Wanli Jiang
3789ba1d37
feat: TRTLLM-5941 Upgrade xgrammar to 0.1.18 ( #5364 )
...
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-07-01 20:12:55 +08:00
Netanel Haber
6ee94c7ac8
Reintroduce with perf fixes: feature: unify new_tokens format sample state to trtllm samper tokens format ( #5513 )
...
58a8a8f - these changes were previously merged to main here.
6aef149 - the changes were temporarily reverted in main, due to a significant perf regression in models using the TorchSampler (observed by @byshiue).
This PR is meant to re-merge these changes along with a fix to prevent the regression.
The first commit of this PR is actually just the reverted revert - filter it out of the changes to see previously unmerged changes.
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
2025-06-30 11:58:59 -07:00
Robin Kobus
9bdc5951f8
refactor: decoder state setup ( #5093 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-30 11:09:43 +02:00
Robin Kobus
a8141a4513
refactor: Speculative decoding buffers part 2 ( #5316 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-27 17:41:48 +02:00
Aurelien Chartier
833c0dea4a
[TRTLLM-6104] feat: add request_perf_metrics to LLMAPI ( #5497 )
...
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-06-27 17:03:05 +02:00
wili
56cdfe5c6c
[TRTLLM-5000][feat] NGrams V2 ( #4569 )
...
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-06-27 23:00:17 +08:00
Daniel Cámpora
73b8a95049
feat: Use inference mode in update_requests to improve perf of TRTLLM Sampler ( #5538 )
2025-06-27 18:40:53 +08:00
jmydurant
8836990bde
[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) ( #5475 )
...
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2025-06-26 22:18:08 +08:00
Robin Kobus
8dfa31c71d
refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead ( #5384 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-26 19:45:52 +08:00
Netanel Haber
6aef14943c
Revert "feature: unify new_tokens format sample state to trtllm samper new_tokens format ( #4401 )" ( #5474 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-06-25 20:56:04 -07:00
jmydurant
578dbc8d9a
feat: chunked prefill for MLA (Blackwell) ( #4651 )
...
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2025-06-26 09:01:00 +08:00
Xianjie Qiao
1e4fa13d33
Add sleep function for disagg gen-only benchmarking ( #5398 )
...
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
2025-06-26 07:32:16 +08:00
Mike Iovine
5bc8c894f7
[chore] Disable block reuse when draft model speculation is being used ( #5448 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-26 03:51:20 +08:00
Daniel Cámpora
205c97a4ae
[TRTLLM-5974][feat] Support disaggregated serving in TRTLLM Sampler ( #5328 )
...
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-06-25 17:41:36 +02:00
Enwei Zhu
fc7a81ceb0
test: Add LLGuidance test and refine guided decoding ( #5348 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-25 14:12:56 +08:00
QI JUN
d93a5e04b5
Chore: remove unused variables ( #5314 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-24 22:27:32 +08:00
Robin Kobus
e2a8cbc80b
refactor: manage cache indirection in decoder state ( #5315 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-24 09:15:59 +02:00
HuiGao-NV
e16c1bef6e
[fix] Add 1 and draft_token_num to seq_len when overlap scheduling is enabled during memory estimation ( #5343 )
...
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-24 11:43:43 +08:00
Netanel Haber
58a8a8fd37
feature: unify new_tokens format sample state to trtllm sampler new_tokens format ( #4401 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-06-23 10:38:37 -07:00
Kaiyu Xie
7246fd75d1
feat: Support stream_interval ( #5284 )
...
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-06-19 21:57:10 +08:00
jellysnack
0623ffe3bc
feat: Add LLGuidance Support for PyTorch Backend ( #5214 )
...
Signed-off-by: jellysnack <oleg.jellysnack@gmail.com>
Signed-off-by: jellysnack <158609015+jellysnack@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-18 19:33:34 +08:00
Robin Kobus
38547b92f3
refactor: Introduce ResourceManagerType enum for resource management ( #5246 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-18 09:55:59 +02:00
QI JUN
855036d8ee
update LlmRequest.is_dummy property ( #5283 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-18 10:52:13 +08:00
Robin Kobus
627062c265
refactor: Update decoder buffer and logits management ( #4450 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-18 08:10:32 +08:00
Mike Iovine
9bf69c9fdb
[chore] Remove BaseDraftTokenManager ( #5251 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-17 11:57:52 -04:00
QI JUN
f899c4d294
Re-implement LlmResponse in Python to reduce host overhead of pybind ( #5224 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-17 21:28:09 +08:00
liji-nv
13eef642e6
[feat] Piecewise cuda graph support for MLA ( #4467 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-06-17 18:58:38 +08:00
Izzy Putterman
e607768e45
Speculation: Draft Target in new FW ( #4558 )
...
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
2025-06-17 02:26:08 +08:00
tomeras91
cea5dd1e38
[TRTLLM-5835][feat] Optimized Mamba2Mixer prefill ( #5128 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-06-16 16:29:17 +03:00
Yilin Fan
dd29063538
[feat] Add llm args to tune python gc threshold ( #5141 )
...
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
2025-06-16 17:45:22 +08:00
Robin Kobus
b6ca677741
refactor: remove decoder request from decoder interface ( #5129 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-16 09:12:30 +02:00
Robin Kobus
dda64166cd
refactor: Scheduling based on KV cache state ( #4865 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-16 08:14:58 +02:00
Yan Chunwei
c84e41fd9d
fix: build_config in TorchLlmArgs and avoid arbitrary args ( #4972 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-15 17:51:56 -07:00
Fanrong Li
39bba63758
[TRTLLM-4983] feat: enable overlap scheduler between draft forwards ( #4802 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-15 23:09:16 +08:00
Fanrong Li
159ffc584e
fix: fix cuda graph max batch size for spec decoding cases. ( #5076 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-15 14:57:28 +08:00
Yuan Tong
6bce7337a9
perf: avoid dynamic import overhead in is_llm_response with duck typing ( #5110 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-06-15 07:45:02 +08:00
Tailing Yuan
0b60da2c45
feat: large-scale EP(part 7: DeepEP integration) ( #4792 )
...
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-06-14 19:12:38 +08:00
Mike Iovine
25aa3881d7
[nvbug/5319281][fix] Stop drafting when we hit the draft model's max seq len ( #4879 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-13 11:06:36 -04:00
nv-guomingz
b959618579
refactor [BREAKING CHANGE]:: remove the redundant use_kv_cache field from PytorchConfig ( #5031 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-06-13 16:34:24 +08:00
Fanrong Li
38a907aaca
[TRTLLM-5278][feat] Add attention dp support to MTP relaxed acceptance ( #5119 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-13 08:58:44 +08:00
liji-nv
10ab9791ec
[fix] Do not reuse dummy request KVCache ( #4804 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-06-12 15:24:50 +08:00
Daniel Cámpora
e46267765f
Fix logprobs issues. ( #5136 )
...
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-12 15:07:01 +08:00
Netanel Haber
e692779ead
Solve underallocation in VSWA+/VGQA ( #4667 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-06-12 12:12:46 +08:00
HuiGao-NV
43192379af
Use backend to replace macro to control enablement of MNNVL all reduce ( #4635 )
...
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-12 11:22:49 +08:00
Zheng Duan
c592798f64
fix: limit process pool size when prefetching ( #5088 )
...
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
2025-06-12 10:52:52 +08:00
Daniel Cámpora
fdf1c47d1d
[TRTLLM-4995][feat] TRTLLM Sampler log probs support ( #4836 )
...
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-11 08:18:13 +02:00