Commit Graph

672 Commits

Author SHA1 Message Date
Yukun He
3fc57543e2
[5356427] fix: Remove the seq_len of 4096 from FP8 block scale MoE tuning configs. (#5485)
The seq_len of 4096 will cause some unknown CUDA illegal memory access issue if run with some other tests consecutively.
Put a saturated upper bound for any sequence length larger than it.
2025-06-26 08:38:35 +08:00
Xianjie Qiao
1e4fa13d33
Add sleep function for disagg gen-only benchmarking (#5398)
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
2025-06-26 07:32:16 +08:00
QI JUN
3a2c4ca77b
chore: split _build_model method for TorchLlm and TrtLlm (#5418)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-06-26 04:32:46 +08:00
Mike Iovine
5bc8c894f7
[chore] Disable block reuse when draft model speculation is being used (#5448)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-26 03:51:20 +08:00
Daniel Cámpora
205c97a4ae
[TRTLLM-5974][feat] Support disaggregated serving in TRTLLM Sampler (#5328)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-06-25 17:41:36 +02:00
Kaiyu Xie
c5ae3272b9
feat: Make benchmark_serving part of the library (#5428)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-06-25 23:13:56 +08:00
HuiGao-NV
b3a4c1f404
feat: Remove not used padding_idx in models (#5385)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-25 17:19:59 +08:00
Yiqing Yan
f3cfe86dd1
chore: bump version to 1.0.0rc1 (#5460)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-06-25 16:21:34 +08:00
Enwei Zhu
fc7a81ceb0
test: Add LLGuidance test and refine guided decoding (#5348)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-25 14:12:56 +08:00
Enwei Zhu
76da7fed86
fix (NvBug 5354925): Fix static EPLB (#5411)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-25 13:14:40 +08:00
Shunkangz
d5354897c0
feat: Dynamically remove servers in PD (#5270)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-06-25 09:50:04 +08:00
Lucas Liebenwein
5cffb7e0ec
[AutoDeploy] Merge feat/ad_2025_06_13 feature branch (#5454)
Signed-off-by: Grzegorz Kwasniewski <213329731+greg-kwasniewski1@users.noreply.github.com>
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Co-authored-by: Grzegorz Kwasniewski <213329731+greg-kwasniewski1@users.noreply.github.com>
Co-authored-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Co-authored-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
2025-06-25 09:30:13 +08:00
bhsueh_NV
73ba4fc320
fix: fix bug of qwen3 + eagle3 + finalize_moe_fusion (#5369)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-06-25 09:20:23 +08:00
dongxuy04
699520082b
Add MTP support for Online EPLB (#5213)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-06-25 07:58:13 +08:00
QI JUN
d93a5e04b5
Chore: remove unused variables (#5314)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-24 22:27:32 +08:00
HuiGao-NV
35a92f6bab
Add debug hook to support dump tensor data and add new debug functions easily (#5182)
Signed-off-by: Hui Gao
2025-06-24 17:45:28 +08:00
Luis Vega
d26040e5d9
chore: delete mamba hybrid, since it is now called NemotronH (#5409)
Signed-off-by: Luis Vega <vegaluisjose@users.noreply.github.com>
2025-06-24 16:27:31 +08:00
Robin Kobus
e2a8cbc80b
refactor: manage cache indirection in decoder state (#5315)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-24 09:15:59 +02:00
HuiGao-NV
e16c1bef6e
[fix] Add 1 and draft_token_num to seq_len when overlap scheduling is enabled during memory estimation (#5343)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-24 11:43:43 +08:00
Netanel Haber
58a8a8fd37
feature: unify new_tokens format sample state to trtllm sampler new_tokens format (#4401)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-06-23 10:38:37 -07:00
dongxuy04
4f0f17ac8a
feat: Misc Opt for large scale EP (#5374)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-06-20 13:11:31 +08:00
Fanrong Li
5d4ab47d5b
fix: refactor and fix mtp vanilla (#4762)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-20 05:23:39 +08:00
Yan Chunwei
9bd42ecf9b
[TRTLLM-5208][BREAKING CHANGE] chore: make pytorch LLM the default (#5312)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-20 03:01:10 +08:00
Kaiyu Xie
7246fd75d1
feat: Support stream_interval (#5284)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-06-19 21:57:10 +08:00
Fanrong Li
c7af650d5a
Fix: fix the deterministic issue in the MTP Eagle path (#5285)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-19 18:08:40 +08:00
Frank
68687a9f56
[WAR][nvbug/5321947] Add an async sleep to unblock event loop. (#5342)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-06-19 17:25:18 +08:00
hlu1
b558232ce1
Refactor CutlassFusedMoE (#5344)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-06-19 00:04:07 -07:00
amitz-nv
1753202b61
[TRTLLM-5825][fix] Fix torch LoRA TP (#5338)
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-06-19 09:12:00 +03:00
Yiqing Yan
dedce8ab0e
chore: bump version to 1.0.0rc0 (#5326)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-06-19 12:02:28 +08:00
nv-guomingz
6a388b105a
chore: remove torch_compile prefix for TorchCompileConfig field members (#5261)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-06-19 09:21:51 +08:00
Zongfei Jing
2b23cd56ce
[feat] Fusion finalize and allreduce for qwenmoe model (#5223)
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
Co-authored-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com>
2025-06-19 08:03:58 +08:00
Yan Chunwei
3946e798db
fix[nvbug5298640]: trtllm-llmapi-launch multiple LLM instances (#4727)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-19 06:13:53 +08:00
jellysnack
0623ffe3bc
feat: Add LLGuidance Support for PyTorch Backend (#5214)
Signed-off-by: jellysnack <oleg.jellysnack@gmail.com>
Signed-off-by: jellysnack <158609015+jellysnack@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-18 19:33:34 +08:00
Zhanrui Sun
516bd4dc05
chore: bump version to 0.21.0rc3 (#5309)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-06-18 15:59:53 +08:00
Robin Kobus
38547b92f3
refactor: Introduce ResourceManagerType enum for resource management (#5246)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-18 09:55:59 +02:00
Yukun He
6711ad9cf3
[TRTLLM-5589] feat: Minor optimizations for tunable FP8 batched GEMM op. (#5139)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-06-18 14:33:46 +08:00
Yan Chunwei
724e495254
chore: partition LLM class into TorchLLM and TrtLLM (#4900)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-18 14:01:25 +08:00
Yi Zhang
e44f7687af
feat: Add no_kv_cache_reuse option and streaming support for trtllm serve bench (#4971)
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-06-18 13:37:31 +08:00
QI JUN
855036d8ee
update LlmRequest.is_dummy property (#5283)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-18 10:52:13 +08:00
Robin Kobus
627062c265
refactor: Update decoder buffer and logits management (#4450)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-18 08:10:32 +08:00
Mike Iovine
9bf69c9fdb
[chore] Remove BaseDraftTokenManager (#5251)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-17 11:57:52 -04:00
QI JUN
f899c4d294
Re-implement LlmResponse in Python to reduce host overhead of pybind (#5224)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-17 21:28:09 +08:00
Dom Brown
44fb3c1673
[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner (#5207)
- Adds a new Python custom op (fp8_block_scale_moe_runner) and a FP8BlockScaleMoERunner class for autotuning.
- Updates C++ MoE and batched GEMM kernels to accept a configIndex for workspace sizing and execution.
- Extends the unit test to run both autotuned and non-autotuned code paths.

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-06-17 21:01:56 +08:00
amirkl94
8451a87742
chore: Mass integration of release/0.20 (#5082)
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: Erin <14718778+hchings@users.noreply.github.com>
Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com>
Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-06-17 14:32:02 +03:00
liji-nv
13eef642e6
[feat] Piecewise cuda graph support for MLA (#4467)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-06-17 18:58:38 +08:00
Yilin Fan
498fadceb4
[feat] Add EAGLE3 support for Qwen3 (#5206)
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
2025-06-17 17:07:06 +08:00
Enwei Zhu
4b82b8b4c7
[TRTLLM-5330] perf: Optimize MoE supplementary kernels for large-scale EP (#5215)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-17 15:23:24 +08:00
Izzy Putterman
e607768e45
Speculation: Draft Target in new FW (#4558)
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
2025-06-17 02:26:08 +08:00
tomeras91
cea5dd1e38
[TRTLLM-5835][feat] Optimized Mamba2Mixer prefill (#5128)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-06-16 16:29:17 +03:00
Yilin Fan
dd29063538
[feat] Add llm args to tune python gc threshold (#5141)
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
2025-06-16 17:45:22 +08:00