Commit Graph

375 Commits

Author SHA1 Message Date
Robin Kobus
b6ca677741
refactor: remove decoder request from decoder interface (#5129)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-16 09:12:30 +02:00
Robin Kobus
dda64166cd
refactor: Scheduling based on KV cache state (#4865)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-16 08:14:58 +02:00
Tracin
ef3fdc8051
feat: Add w4a8_mxfp4_fp8 quantization recipe. (#4867)
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-06-16 11:30:57 +08:00
Yilin Fan
7a5e0fd300
[fix] Fix Llama4 min-latency import error (#5209)
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
2025-06-16 10:03:07 +08:00
Yan Chunwei
c84e41fd9d
fix: build_config in TorchLlmArgs and avoid arbitrary args (#4972)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-15 17:51:56 -07:00
Fanrong Li
39bba63758
[TRTLLM-4983] feat: enable overlap scheduler between draft forwards (#4802)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-15 23:09:16 +08:00
Fanrong Li
159ffc584e
fix: fix cuda graph max batch size for spec decoding cases. (#5076)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-15 14:57:28 +08:00
Enwei Zhu
63bc62ddf4
feat: Enable EPLB to existing MoE models (#5203)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-15 11:48:06 +08:00
Yuan Tong
6bce7337a9
perf: avoid dynamic import overhead in is_llm_response with duck typing (#5110)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-06-15 07:45:02 +08:00
Tailing Yuan
0b60da2c45
feat: large-scale EP(part 7: DeepEP integration) (#4792)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-06-14 19:12:38 +08:00
yunruis
b99c5ce8c1
Feat/ds r1 min latency opt round3, add router gemm, fused a gemm, PDL (#4560)
Signed-off-by: yunruis <yunruis@nvidia.com>
Signed-off-by: kduan <176893526+Kefeng-Duan@users.noreply.github.com>
Signed-off-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com>
Co-authored-by: kduan <176893526+Kefeng-Duan@users.noreply.github.com>
2025-06-14 17:36:22 +08:00
Yilin Fan
06342ffb4d
[feat] Implement model-agnostic one-engine eagle3 (#4778)
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
2025-06-13 08:11:41 -07:00
Mike Iovine
25aa3881d7
[nvbug/5319281][fix] Stop drafting when we hit the draft model's max seq len (#4879)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-13 11:06:36 -04:00
brb-nv
089be8912a
feat: Basic skeleton for Gemma3 VLM (#5108)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-06-13 17:27:04 +08:00
nv-guomingz
b959618579
refactor [BREAKING CHANGE]:: remove the redundant use_kv_cache field from PytorchConfig (#5031)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-06-13 16:34:24 +08:00
zhhuang-nv
a891013e3c
[feat] Optimize KV Cache Reuse for MLA (#4869)
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
2025-06-13 11:03:05 +08:00
Fanrong Li
38a907aaca
[TRTLLM-5278][feat] Add attention dp support to MTP relaxed acceptance (#5119)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-13 08:58:44 +08:00
Mike Iovine
690873ba1a
[nvbug/5334370][fix] Fix one model EAGLE3 (#5134)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-12 10:28:14 -04:00
HuiGao-NV
dfeeaf6746
Move allreduce_strategy from committed api to reference (#5147)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-12 21:00:20 +08:00
liji-nv
10ab9791ec
[fix] Do not reuse dummy request KVCache (#4804)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-06-12 15:24:50 +08:00
Daniel Cámpora
e46267765f
Fix logprobs issues. (#5136)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-12 15:07:01 +08:00
Lucas Liebenwein
49d7268acc
[nvbugs/5331013] fix AutoDeploy for PyTorch 25.05 dependency upgrade (#5106)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-06-12 13:07:27 +08:00
Netanel Haber
e692779ead
Solve underallocation in VSWA+/VGQA (#4667)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-06-12 12:12:46 +08:00
HuiGao-NV
43192379af
Use backend to replace macro to control enablement of MNNVL all reduce (#4635)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-12 11:22:49 +08:00
Zheng Duan
c592798f64
fix: limit process pool size when prefetching (#5088)
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
2025-06-12 10:52:52 +08:00
liji-nv
8282d6c1a7
[fix] Fix llama4 min latency (#5117)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-06-11 15:44:38 +08:00
Daniel Cámpora
fdf1c47d1d
[TRTLLM-4995][feat] TRTLLM Sampler log probs support (#4836)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-11 08:18:13 +02:00
nvpohanh
7b210ae9c3
test: add unit tests for Llama4 min_latency code (#4980)
Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
2025-06-10 12:10:26 -07:00
Lucas Liebenwein
7ddc4d6282
[AutoDeploy] Merge Feature Branch Week 3 (#5054)
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
2025-06-11 00:20:43 +08:00
Zongfei Jing
6d1f2d0fd7
[TRTLLM-3927] [feat] Finalize + Allreduce + add + rmsnorm fusion (#4756)
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-06-10 19:55:16 +08:00
tomeras91
f121f13ddf
[nvbug 5325284][fix] Increase Nemotron-H warmup request robustness (#4954)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-06-10 11:09:37 +03:00
Xiaowei Wang
ec6b1821c7
[fix] Fix W4A8 weight loading error in WInt4AFP8FusedMoEMethod (#5026)
Signed-off-by: Xiaowei Wang <100599594+xiaoweiw-nv@users.noreply.github.com>
2025-06-10 15:09:06 +08:00
Daniel Cámpora
d68b8180d3
feat: port MakeDecodingBatchInputOutput to python in TRTLLMSampler (#4828)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-10 07:28:34 +08:00
Chang Liu
f70815c945
[TRTLLM-5007][feat] Add multimodal hashing support (image hashing) (#4145)
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-06-10 01:59:56 +08:00
Yuxian Qiu
e79527d195
chore: Refine weight prefetching. (#4893)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-06-09 21:24:16 +08:00
Mike Iovine
f4d9c87c51
[nvbug/5314469][feat] Include the executor's max batch size in CUDA g… (#4843)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-09 08:31:35 -04:00
Yukun He
137fe35539
fix: Fix warmup phase batch size out of range. (#4986)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-09 19:19:16 +08:00
Yuxian Qiu
88480197da
ci: [nvbugs/5280806] Unwaive unittests/_torch. (#4951)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-06-09 19:04:11 +08:00
Dom Brown
9c012d5bf8
[TRTLLM-5589] feat: Integrate TRT-LLM Gen FP8 Batched GEMM with Pytorch workflow kernel autotuner (#4872)
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-06-09 11:02:48 +01:00
ChristinaZ
f45aff2b7d
Add customized renormalized moe routing kernel for moe cutlass backend (#4955)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-06-09 17:38:50 +08:00
Bo Li
c104388d37
chore: Refactor apply_rope. (#4918)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-06-09 16:51:59 +08:00
Daniel Stokes
3a4851b7c3
feat: Add Mixture of Experts FP8xMXFP4 support (#4750)
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-06-09 13:25:04 +08:00
amitz-nv
77e8d739f1
[TRTLLM-4987][feat] Support generation logits in TRTLLMSampler (#4819) 2025-06-09 06:30:01 +03:00
Yechan Kim
8b4104d34a
feat: add HyperCLOVAX-SEED-Vision support in refactored way (#4799)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-06-09 11:04:04 +08:00
Omer Ullman Argov
8731f5f14f
chore: Mass integration of release/0.20 (#4898)
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Hui Gao <huig@nvidia.com>
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
Signed-off-by: moraxu <mguzek@nvidia.com>
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: HuiGao-NV <huig@nvidia.com>
Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com>
Co-authored-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
Co-authored-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Co-authored-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
Co-authored-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Faraz <58580514+farazkh80@users.noreply.github.com>
Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com>
2025-06-08 23:26:26 +08:00
Mike Iovine
ec0d984656
[nvbug/5280806][fix] Fix 2 model spec decode flow (#4807)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-08 07:40:02 -04:00
dongxuy04
1e369658f1
feat: large-scale EP(part 6: Online EP load balancer integration for GB200 nvfp4) (#4818)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Co-authored-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-06-08 10:25:18 +08:00
QI JUN
5ee0de7f2a
Resubmit #4894 (#4969)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-08 04:42:15 +08:00
Bo Li
f414a079ad
chore: Change the type annotations of input_ids and position_ids to int32. (#4632)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-06-07 16:10:47 +08:00
nv-guomingz
0c7dd660d8
fix:https://nvbugs/5324248 (#4973)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-06-07 04:14:07 +08:00