Commit Graph

1510 Commits

Author SHA1 Message Date
dongxuy04
490d2e5819
feat: large-scale EP(part 8: Online EP load balancer integration for PCIe fp8) (#5226)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-06-25 22:25:13 -07:00
amitz-nv
e0bb123ae7
[TRTLLM-5921][feat] Prevent serialization of entire LoRA adapters in each request (#5080)
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-06-26 08:15:06 +03:00
Yukun He
9ee33605bb
[TRTLLM-6019] feat: Remove cutlass min latency code from AutoTuner. (#5394)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-06-26 13:12:03 +08:00
Daniel Stokes
942841417e
opensource: Opensource MOE MXFP8-MXFP4 implementation (#5222)
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-06-26 12:18:19 +08:00
qsang-nv
e9cd810071
keep sm90 headsize 128 cubins (#5320)
Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
2025-06-26 12:14:01 +08:00
Netanel Haber
6aef14943c
Revert "feature: unify new_tokens format sample state to trtllm samper new_tokens format (#4401)" (#5474)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-06-25 20:56:04 -07:00
Emma Qiao
32d1573c43
[Infra] - Add timeout setting for long tests found in post-merge (#5501)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-06-26 11:31:39 +08:00
Venky
d9b75f83fd
[CI] Waive test_fp8_block_scales_4gpus[ep4-mtp_nextn=0-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False] (#5494)
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
2025-06-25 20:17:12 -07:00
ChristinaZ
d135f5993d
Add unit test for routing kernels (#5405)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-06-26 09:49:11 +08:00
jmydurant
578dbc8d9a
feat: chunked prefill for MLA (Blackwell) (#4651)
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2025-06-26 09:01:00 +08:00
Yukun He
3fc57543e2
[5356427] fix: Remove the seq_len of 4096 from FP8 block scale MoE tuning configs. (#5485)
The seq_len of 4096 will cause some unknown CUDA illegal memory access issue if run with some other tests consecutively.
Put a saturated upper bound for any sequence length larger than it.
2025-06-26 08:38:35 +08:00
HuiGao-NV
74ae15a26b
CI: enable test cases on single device type (#5484)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-26 08:03:44 +08:00
Xianjie Qiao
1e4fa13d33
Add sleep function for disagg gen-only benchmarking (#5398)
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
2025-06-26 07:32:16 +08:00
QI JUN
feaf789342
CI: reduce BF16 test cases in B200 (#5482)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-06-26 07:18:20 +08:00
Omer Ullman Argov
bdc8dfebc3
[fix][ci] dont build wheel for cpp tests (#5443)
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-26 00:13:47 +03:00
Omer Ullman Argov
61bb71fd1b
[fix][test] remove test in global scope (#5470)
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-25 23:42:26 +03:00
QI JUN
3a2c4ca77b
chore: split _build_model method for TorchLlm and TrtLlm (#5418)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-06-26 04:32:46 +08:00
Mike Iovine
5bc8c894f7
[chore] Disable block reuse when draft model speculation is being used (#5448)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-26 03:51:20 +08:00
Daniel Cámpora
205c97a4ae
[TRTLLM-5974][feat] Support disaggregated serving in TRTLLM Sampler (#5328)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-06-25 17:41:36 +02:00
Kaiyu Xie
c5ae3272b9
feat: Make benchmark_serving part of the library (#5428)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-06-25 23:13:56 +08:00
HuiGao-NV
314f15f0a7
Fix: fix nvbug 5356427 (#5464)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-25 22:24:26 +08:00
HuiGao-NV
cc3c2b3be2
Move 3 disaggregated cases from 4 GPUs devices to 1 GPU device (#5457)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-25 21:38:14 +08:00
Kaiyu Xie
d6ada5ffce
[nvbug/5354956] fix: unexpected keyword argument 'streaming' (#5436)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-06-25 20:37:24 +08:00
HuiGao-NV
b3a4c1f404
feat: Remove not used padding_idx in models (#5385)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-25 17:19:59 +08:00
QI JUN
2901c5a5bc
CI: waive test_ad_build_small_multi (#5471)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-25 16:44:42 +08:00
Perkz Zheng
1f292ff2a0
[https://jirasw.nvidia.com/browse/TRTLLM-4645] support mutliCtasKvMode for high-throughput MLA kernels (#5426)
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-06-25 16:31:10 +08:00
Yiqing Yan
f3cfe86dd1
chore: bump version to 1.0.0rc1 (#5460)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-06-25 16:21:34 +08:00
Netanel Haber
3ca2f6ac51
start OAIServer with max_beam_width=1 for TorchSampler (#5427)
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
2025-06-25 15:52:06 +08:00
QI JUN
478f668dcc
CI: update multi gpu test triggering file list (#5466)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-25 15:51:02 +08:00
Enwei Zhu
fc7a81ceb0
test: Add LLGuidance test and refine guided decoding (#5348)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-25 14:12:56 +08:00
Enwei Zhu
76da7fed86
fix (NvBug 5354925): Fix static EPLB (#5411)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-25 13:14:40 +08:00
HuiGao-NV
da98e03747
tests: Set kv cache free memory fraction in test case (#5433)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-25 12:31:58 +08:00
Shunkangz
d5354897c0
feat: Dynamically remove servers in PD (#5270)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-06-25 09:50:04 +08:00
Lucas Liebenwein
5cffb7e0ec
[AutoDeploy] Merge feat/ad_2025_06_13 feature branch (#5454)
Signed-off-by: Grzegorz Kwasniewski <213329731+greg-kwasniewski1@users.noreply.github.com>
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Co-authored-by: Grzegorz Kwasniewski <213329731+greg-kwasniewski1@users.noreply.github.com>
Co-authored-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Co-authored-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
2025-06-25 09:30:13 +08:00
bhsueh_NV
73ba4fc320
fix: fix bug of qwen3 + eagle3 + finalize_moe_fusion (#5369)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-06-25 09:20:23 +08:00
QI JUN
241f921800
waive test_moe.py::test_moe_fp8[autotune] (#5455)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-25 09:14:44 +08:00
dongxuy04
699520082b
Add MTP support for Online EPLB (#5213)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-06-25 07:58:13 +08:00
Iman Tabrizian
846bbf1edc
Fix test Pytorch model engine (#5416)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-06-24 11:09:27 -07:00
QI JUN
d93a5e04b5
Chore: remove unused variables (#5314)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-24 22:27:32 +08:00
HuiGao-NV
35a92f6bab
Add debug hook to support dump tensor data and add new debug functions easily (#5182)
Signed-off-by: Hui Gao
2025-06-24 17:45:28 +08:00
Emma Qiao
475272046a
[Infra] - Waive failed tests in post-merge and increase some timeout setting (#5424)
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-06-24 17:19:31 +08:00
Luis Vega
d26040e5d9
chore: delete mamba hybrid, since it is now called NemotronH (#5409)
Signed-off-by: Luis Vega <vegaluisjose@users.noreply.github.com>
2025-06-24 16:27:31 +08:00
xinhe-nv
658fb5b54e
tests: update benchmark test lists (#5365)
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-06-24 15:23:38 +08:00
Robin Kobus
e2a8cbc80b
refactor: manage cache indirection in decoder state (#5315)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-24 09:15:59 +02:00
xinhe-nv
4b32a3f1a7
test: [CI] remove closed bugs (#5400)
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-06-24 13:39:57 +08:00
HuiGao-NV
e16c1bef6e
[fix] Add 1 and draft_token_num to seq_len when overlap scheduling is enabled during memory estimation (#5343)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-24 11:43:43 +08:00
Netanel Haber
58a8a8fd37
feature: unify new_tokens format sample state to trtllm sampler new_tokens format (#4401)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-06-23 10:38:37 -07:00
Fanrong Li
ebadc13086
[doc] update mtp documents (#5387)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-21 16:05:52 +08:00
Robin Kobus
b3045c44b9
refactor: remove TrtGptModelOptionalParams (#5165)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-20 10:31:40 +02:00
dongxuy04
4f0f17ac8a
feat: Misc Opt for large scale EP (#5374)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-06-20 13:11:31 +08:00