Daniel Cámpora
|
205c97a4ae
|
[TRTLLM-5974][feat] Support disaggregated serving in TRTLLM Sampler (#5328)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
2025-06-25 17:41:36 +02:00 |
|
Enwei Zhu
|
fc7a81ceb0
|
test: Add LLGuidance test and refine guided decoding (#5348)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
|
2025-06-25 14:12:56 +08:00 |
|
QI JUN
|
d93a5e04b5
|
Chore: remove unused variables (#5314)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
|
2025-06-24 22:27:32 +08:00 |
|
Robin Kobus
|
e2a8cbc80b
|
refactor: manage cache indirection in decoder state (#5315)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-06-24 09:15:59 +02:00 |
|
HuiGao-NV
|
e16c1bef6e
|
[fix] Add 1 and draft_token_num to seq_len when overlap scheduling is enabled during memory estimation (#5343)
Signed-off-by: Hui Gao <huig@nvidia.com>
|
2025-06-24 11:43:43 +08:00 |
|
Netanel Haber
|
58a8a8fd37
|
feature: unify new_tokens format sample state to trtllm sampler new_tokens format (#4401)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
|
2025-06-23 10:38:37 -07:00 |
|
Kaiyu Xie
|
7246fd75d1
|
feat: Support stream_interval (#5284)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
|
2025-06-19 21:57:10 +08:00 |
|
jellysnack
|
0623ffe3bc
|
feat: Add LLGuidance Support for PyTorch Backend (#5214)
Signed-off-by: jellysnack <oleg.jellysnack@gmail.com>
Signed-off-by: jellysnack <158609015+jellysnack@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
|
2025-06-18 19:33:34 +08:00 |
|
Robin Kobus
|
38547b92f3
|
refactor: Introduce ResourceManagerType enum for resource management (#5246)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-06-18 09:55:59 +02:00 |
|
QI JUN
|
855036d8ee
|
update LlmRequest.is_dummy property (#5283)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
|
2025-06-18 10:52:13 +08:00 |
|
Robin Kobus
|
627062c265
|
refactor: Update decoder buffer and logits management (#4450)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-06-18 08:10:32 +08:00 |
|
Mike Iovine
|
9bf69c9fdb
|
[chore] Remove BaseDraftTokenManager (#5251)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
|
2025-06-17 11:57:52 -04:00 |
|
QI JUN
|
f899c4d294
|
Re-implement LlmResponse in Python to reduce host overhead of pybind (#5224)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
|
2025-06-17 21:28:09 +08:00 |
|
liji-nv
|
13eef642e6
|
[feat] Piecewise cuda graph support for MLA (#4467)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
|
2025-06-17 18:58:38 +08:00 |
|
Izzy Putterman
|
e607768e45
|
Speculation: Draft Target in new FW (#4558)
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
|
2025-06-17 02:26:08 +08:00 |
|
tomeras91
|
cea5dd1e38
|
[TRTLLM-5835][feat] Optimized Mamba2Mixer prefill (#5128)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
|
2025-06-16 16:29:17 +03:00 |
|
Yilin Fan
|
dd29063538
|
[feat] Add llm args to tune python gc threshold (#5141)
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
|
2025-06-16 17:45:22 +08:00 |
|
Robin Kobus
|
b6ca677741
|
refactor: remove decoder request from decoder interface (#5129)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-06-16 09:12:30 +02:00 |
|
Robin Kobus
|
dda64166cd
|
refactor: Scheduling based on KV cache state (#4865)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-06-16 08:14:58 +02:00 |
|
Yan Chunwei
|
c84e41fd9d
|
fix: build_config in TorchLlmArgs and avoid arbitrary args (#4972)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
|
2025-06-15 17:51:56 -07:00 |
|
Fanrong Li
|
39bba63758
|
[TRTLLM-4983] feat: enable overlap scheduler between draft forwards (#4802)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
|
2025-06-15 23:09:16 +08:00 |
|
Fanrong Li
|
159ffc584e
|
fix: fix cuda graph max batch size for spec decoding cases. (#5076)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
|
2025-06-15 14:57:28 +08:00 |
|
Yuan Tong
|
6bce7337a9
|
perf: avoid dynamic import overhead in is_llm_response with duck typing (#5110)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
|
2025-06-15 07:45:02 +08:00 |
|
Tailing Yuan
|
0b60da2c45
|
feat: large-scale EP(part 7: DeepEP integration) (#4792)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
|
2025-06-14 19:12:38 +08:00 |
|
Mike Iovine
|
25aa3881d7
|
[nvbug/5319281][fix] Stop drafting when we hit the draft model's max seq len (#4879)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
|
2025-06-13 11:06:36 -04:00 |
|
nv-guomingz
|
b959618579
|
refactor [BREAKING CHANGE]:: remove the redundant use_kv_cache field from PytorchConfig (#5031)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
|
2025-06-13 16:34:24 +08:00 |
|
Fanrong Li
|
38a907aaca
|
[TRTLLM-5278][feat] Add attention dp support to MTP relaxed acceptance (#5119)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
|
2025-06-13 08:58:44 +08:00 |
|
liji-nv
|
10ab9791ec
|
[fix] Do not reuse dummy request KVCache (#4804)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
|
2025-06-12 15:24:50 +08:00 |
|
Daniel Cámpora
|
e46267765f
|
Fix logprobs issues. (#5136)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
|
2025-06-12 15:07:01 +08:00 |
|
Netanel Haber
|
e692779ead
|
Solve underallocation in VSWA+/VGQA (#4667)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
|
2025-06-12 12:12:46 +08:00 |
|
HuiGao-NV
|
43192379af
|
Use backend to replace macro to control enablement of MNNVL all reduce (#4635)
Signed-off-by: Hui Gao <huig@nvidia.com>
|
2025-06-12 11:22:49 +08:00 |
|
Zheng Duan
|
c592798f64
|
fix: limit process pool size when prefetching (#5088)
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
|
2025-06-12 10:52:52 +08:00 |
|
Daniel Cámpora
|
fdf1c47d1d
|
[TRTLLM-4995][feat] TRTLLM Sampler log probs support (#4836)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
|
2025-06-11 08:18:13 +02:00 |
|
tomeras91
|
f121f13ddf
|
[nvbug 5325284][fix] Increase Nemotron-H warmup request robustness (#4954)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
|
2025-06-10 11:09:37 +03:00 |
|
Daniel Cámpora
|
d68b8180d3
|
feat: port MakeDecodingBatchInputOutput to python in TRTLLMSampler (#4828)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
|
2025-06-10 07:28:34 +08:00 |
|
Chang Liu
|
f70815c945
|
[TRTLLM-5007][feat] Add multimodal hashing support (image hashing) (#4145)
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
|
2025-06-10 01:59:56 +08:00 |
|
Yuxian Qiu
|
e79527d195
|
chore: Refine weight prefetching. (#4893)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
|
2025-06-09 21:24:16 +08:00 |
|
Mike Iovine
|
f4d9c87c51
|
[nvbug/5314469][feat] Include the executor's max batch size in CUDA g… (#4843)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
|
2025-06-09 08:31:35 -04:00 |
|
Yukun He
|
137fe35539
|
fix: Fix warmup phase batch size out of range. (#4986)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
|
2025-06-09 19:19:16 +08:00 |
|
amitz-nv
|
77e8d739f1
|
[TRTLLM-4987][feat] Support generation logits in TRTLLMSampler (#4819)
|
2025-06-09 06:30:01 +03:00 |
|
Yechan Kim
|
8b4104d34a
|
feat: add HyperCLOVAX-SEED-Vision support in refactored way (#4799)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
|
2025-06-09 11:04:04 +08:00 |
|
Omer Ullman Argov
|
8731f5f14f
|
chore: Mass integration of release/0.20 (#4898)
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Hui Gao <huig@nvidia.com>
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
Signed-off-by: moraxu <mguzek@nvidia.com>
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: HuiGao-NV <huig@nvidia.com>
Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com>
Co-authored-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
Co-authored-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Co-authored-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
Co-authored-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Faraz <58580514+farazkh80@users.noreply.github.com>
Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com>
|
2025-06-08 23:26:26 +08:00 |
|
Mike Iovine
|
ec0d984656
|
[nvbug/5280806][fix] Fix 2 model spec decode flow (#4807)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
|
2025-06-08 07:40:02 -04:00 |
|
dongxuy04
|
1e369658f1
|
feat: large-scale EP(part 6: Online EP load balancer integration for GB200 nvfp4) (#4818)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Co-authored-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
|
2025-06-08 10:25:18 +08:00 |
|
QI JUN
|
5ee0de7f2a
|
Resubmit #4894 (#4969)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
|
2025-06-08 04:42:15 +08:00 |
|
nv-guomingz
|
0c7dd660d8
|
fix:https://nvbugs/5324248 (#4973)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
|
2025-06-07 04:14:07 +08:00 |
|
Fanrong Li
|
75d020cf07
|
fix: fix cuda graph padding for spec decoding (#4853)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
|
2025-06-06 22:21:42 +08:00 |
|
QI JUN
|
ec50684d80
|
Revert "fix a bug of global cuda graph dummy request" (#4970)
|
2025-06-06 08:54:45 +08:00 |
|
QI JUN
|
154f7cc40a
|
fix a bug of global cuda graph dummy request (#4894)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
|
2025-06-05 19:47:40 +08:00 |
|
QI JUN
|
b8c5e3892b
|
Revert "fix: build_config in TorchLlmArgs and avoid invalid args" (#4949)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
|
2025-06-05 17:43:30 +08:00 |
|