Bo Li
6000380a0c
perf: Avoid reswizzle_sf after allgather. ( #5504 )
...
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-06-29 21:25:50 +08:00
amirkl94
de9779900c
feat: Add support for YARN in NemotronNAS models ( #4906 )
...
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
2025-06-29 09:45:49 +03:00
Lucas Liebenwein
619709fc33
[AutoDeploy] merge feat/ad-2025-06-13 ( #5556 )
...
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-06-29 03:52:14 +08:00
Li Min
6021a439ab
Make moe permute and final as custom op ( #5412 )
...
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
2025-06-27 15:48:33 -07:00
Robin Kobus
a8141a4513
refactor: Speculative decoding buffers part 2 ( #5316 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-27 17:41:48 +02:00
Aurelien Chartier
833c0dea4a
[TRTLLM-6104] feat: add request_perf_metrics to LLMAPI ( #5497 )
...
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-06-27 17:03:05 +02:00
wili
56cdfe5c6c
[TRTLLM-5000][feat] NGrams V2 ( #4569 )
...
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-06-27 23:00:17 +08:00
peaceh-nv
cb58073ab7
Fix : fix build for sm120 ( #5265 )
...
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
2025-06-27 20:42:47 +08:00
Daniel Cámpora
73b8a95049
feat: Use inference mode in update_requests to improve perf of TRTLLM Sampler ( #5538 )
2025-06-27 18:40:53 +08:00
Daniel Stokes
83a1f60556
feat: Expose bias and FP8_MXFP4 MOE CUTLASS backend features to pytorch ( #5410 )
...
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-06-27 12:29:34 +08:00
Yuxian Qiu
dc36228f52
fix: Fix block scale fp8 support for deepseek v3 on Blackwell. ( #5514 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-06-27 11:03:38 +08:00
jmydurant
8836990bde
[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) ( #5475 )
...
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2025-06-26 22:18:08 +08:00
Robin Kobus
8dfa31c71d
refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead ( #5384 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-26 19:45:52 +08:00
Bo Li
1bab9000a6
perf: Optimize swizzle_sf, unswizzle_sf, reswizzle_sf ( #5318 )
...
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-06-26 14:03:56 +08:00
dongxuy04
490d2e5819
feat: large-scale EP(part 8: Online EP load balancer integration for PCIe fp8) ( #5226 )
...
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-06-25 22:25:13 -07:00
Yukun He
9ee33605bb
[TRTLLM-6019] feat: Remove cutlass min latency code from AutoTuner. ( #5394 )
...
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-06-26 13:12:03 +08:00
Netanel Haber
6aef14943c
Revert "feature: unify new_tokens format sample state to trtllm samper new_tokens format ( #4401 )" ( #5474 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-06-25 20:56:04 -07:00
jmydurant
578dbc8d9a
feat: chunked prefill for MLA (Blackwell) ( #4651 )
...
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2025-06-26 09:01:00 +08:00
Yukun He
3fc57543e2
[5356427] fix: Remove the seq_len of 4096 from FP8 block scale MoE tuning configs. ( #5485 )
...
The seq_len of 4096 will cause some unknown CUDA illegal memory access issue if run with some other tests consecutively.
Put a saturated upper bound for any sequence length larger than it.
2025-06-26 08:38:35 +08:00
Xianjie Qiao
1e4fa13d33
Add sleep function for disagg gen-only benchmarking ( #5398 )
...
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
2025-06-26 07:32:16 +08:00
Mike Iovine
5bc8c894f7
[chore] Disable block reuse when draft model speculation is being used ( #5448 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-26 03:51:20 +08:00
Daniel Cámpora
205c97a4ae
[TRTLLM-5974][feat] Support disaggregated serving in TRTLLM Sampler ( #5328 )
...
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-06-25 17:41:36 +02:00
HuiGao-NV
b3a4c1f404
feat: Remove not used padding_idx in models ( #5385 )
...
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-25 17:19:59 +08:00
Enwei Zhu
fc7a81ceb0
test: Add LLGuidance test and refine guided decoding ( #5348 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-25 14:12:56 +08:00
Enwei Zhu
76da7fed86
fix (NvBug 5354925): Fix static EPLB ( #5411 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-25 13:14:40 +08:00
Lucas Liebenwein
5cffb7e0ec
[AutoDeploy] Merge feat/ad_2025_06_13 feature branch ( #5454 )
...
Signed-off-by: Grzegorz Kwasniewski <213329731+greg-kwasniewski1@users.noreply.github.com>
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Co-authored-by: Grzegorz Kwasniewski <213329731+greg-kwasniewski1@users.noreply.github.com>
Co-authored-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Co-authored-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
2025-06-25 09:30:13 +08:00
bhsueh_NV
73ba4fc320
fix: fix bug of qwen3 + eagle3 + finalize_moe_fusion ( #5369 )
...
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-06-25 09:20:23 +08:00
dongxuy04
699520082b
Add MTP support for Online EPLB ( #5213 )
...
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-06-25 07:58:13 +08:00
QI JUN
d93a5e04b5
Chore: remove unused variables ( #5314 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-24 22:27:32 +08:00
HuiGao-NV
35a92f6bab
Add debug hook to support dump tensor data and add new debug functions easily ( #5182 )
...
Signed-off-by: Hui Gao
2025-06-24 17:45:28 +08:00
Luis Vega
d26040e5d9
chore: delete mamba hybrid, since it is now called NemotronH ( #5409 )
...
Signed-off-by: Luis Vega <vegaluisjose@users.noreply.github.com>
2025-06-24 16:27:31 +08:00
Robin Kobus
e2a8cbc80b
refactor: manage cache indirection in decoder state ( #5315 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-24 09:15:59 +02:00
HuiGao-NV
e16c1bef6e
[fix] Add 1 and draft_token_num to seq_len when overlap scheduling is enabled during memory estimation ( #5343 )
...
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-24 11:43:43 +08:00
Netanel Haber
58a8a8fd37
feature: unify new_tokens format sample state to trtllm sampler new_tokens format ( #4401 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-06-23 10:38:37 -07:00
dongxuy04
4f0f17ac8a
feat: Misc Opt for large scale EP ( #5374 )
...
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-06-20 13:11:31 +08:00
Fanrong Li
5d4ab47d5b
fix: refactor and fix mtp vanilla ( #4762 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-20 05:23:39 +08:00
Yan Chunwei
9bd42ecf9b
[TRTLLM-5208][BREAKING CHANGE] chore: make pytorch LLM the default ( #5312 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-20 03:01:10 +08:00
Kaiyu Xie
7246fd75d1
feat: Support stream_interval ( #5284 )
...
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-06-19 21:57:10 +08:00
Fanrong Li
c7af650d5a
Fix: fix the deterministic issue in the MTP Eagle path ( #5285 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-19 18:08:40 +08:00
hlu1
b558232ce1
Refactor CutlassFusedMoE ( #5344 )
...
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-06-19 00:04:07 -07:00
amitz-nv
1753202b61
[TRTLLM-5825][fix] Fix torch LoRA TP ( #5338 )
...
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-06-19 09:12:00 +03:00
Zongfei Jing
2b23cd56ce
[feat] Fusion finalize and allreduce for qwenmoe model ( #5223 )
...
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
Co-authored-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com>
2025-06-19 08:03:58 +08:00
jellysnack
0623ffe3bc
feat: Add LLGuidance Support for PyTorch Backend ( #5214 )
...
Signed-off-by: jellysnack <oleg.jellysnack@gmail.com>
Signed-off-by: jellysnack <158609015+jellysnack@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-18 19:33:34 +08:00
Robin Kobus
38547b92f3
refactor: Introduce ResourceManagerType enum for resource management ( #5246 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-18 09:55:59 +02:00
Yukun He
6711ad9cf3
[TRTLLM-5589] feat: Minor optimizations for tunable FP8 batched GEMM op. ( #5139 )
...
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-06-18 14:33:46 +08:00
Yan Chunwei
724e495254
chore: partition LLM class into TorchLLM and TrtLLM ( #4900 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-18 14:01:25 +08:00
QI JUN
855036d8ee
update LlmRequest.is_dummy property ( #5283 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-18 10:52:13 +08:00
Robin Kobus
627062c265
refactor: Update decoder buffer and logits management ( #4450 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-18 08:10:32 +08:00
Mike Iovine
9bf69c9fdb
[chore] Remove BaseDraftTokenManager ( #5251 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-17 11:57:52 -04:00
QI JUN
f899c4d294
Re-implement LlmResponse in Python to reduce host overhead of pybind ( #5224 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-17 21:28:09 +08:00