Izzy Putterman
|
e607768e45
|
Speculation: Draft Target in new FW (#4558)
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
|
2025-06-17 02:26:08 +08:00 |
|
tomeras91
|
cea5dd1e38
|
[TRTLLM-5835][feat] Optimized Mamba2Mixer prefill (#5128)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
|
2025-06-16 16:29:17 +03:00 |
|
Yilin Fan
|
dd29063538
|
[feat] Add llm args to tune python gc threshold (#5141)
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
|
2025-06-16 17:45:22 +08:00 |
|
Robin Kobus
|
b6ca677741
|
refactor: remove decoder request from decoder interface (#5129)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-06-16 09:12:30 +02:00 |
|
Robin Kobus
|
dda64166cd
|
refactor: Scheduling based on KV cache state (#4865)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-06-16 08:14:58 +02:00 |
|
Tracin
|
ef3fdc8051
|
feat: Add w4a8_mxfp4_fp8 quantization recipe. (#4867)
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
|
2025-06-16 11:30:57 +08:00 |
|
Enwei Zhu
|
babdd9ce06
|
test: Add json_mode_eval for guided decoding evaluation (#5179)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
|
2025-06-16 10:03:55 +08:00 |
|
Yilin Fan
|
7a5e0fd300
|
[fix] Fix Llama4 min-latency import error (#5209)
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
|
2025-06-16 10:03:07 +08:00 |
|
Yan Chunwei
|
c84e41fd9d
|
fix: build_config in TorchLlmArgs and avoid arbitrary args (#4972)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
|
2025-06-15 17:51:56 -07:00 |
|
amitz-nv
|
109c426077
|
Enable trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA in PyT flow (#5130)
|
2025-06-15 18:54:04 +03:00 |
|
Fanrong Li
|
39bba63758
|
[TRTLLM-4983] feat: enable overlap scheduler between draft forwards (#4802)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
|
2025-06-15 23:09:16 +08:00 |
|
Fanrong Li
|
159ffc584e
|
fix: fix cuda graph max batch size for spec decoding cases. (#5076)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
|
2025-06-15 14:57:28 +08:00 |
|
Kaiyu Xie
|
dce1dcc4f9
|
feat: Support post_proc for bench (#5122)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
|
2025-06-15 13:02:38 +08:00 |
|
Enwei Zhu
|
63bc62ddf4
|
feat: Enable EPLB to existing MoE models (#5203)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
|
2025-06-15 11:48:06 +08:00 |
|
Yuan Tong
|
6bce7337a9
|
perf: avoid dynamic import overhead in is_llm_response with duck typing (#5110)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
|
2025-06-15 07:45:02 +08:00 |
|
ixlmar
|
e055af1bc9
|
chore: improve disagg test failure detection (#4738)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
|
2025-06-15 01:28:26 +08:00 |
|
Aurelien Chartier
|
1389f5a4d3
|
feat: Add support for fp8 rowwise quantization (#4876)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Co-authored-by: aikitoria <151776613+aikitoria@users.noreply.github.com>
|
2025-06-14 06:37:48 -07:00 |
|
2ez4bz
|
dc52b67492
|
linting(python): Enable ruff on more files (wave 1/N) (#5140)
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
|
2025-06-14 19:19:34 +08:00 |
|
Tailing Yuan
|
0b60da2c45
|
feat: large-scale EP(part 7: DeepEP integration) (#4792)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
|
2025-06-14 19:12:38 +08:00 |
|
yunruis
|
b99c5ce8c1
|
Feat/ds r1 min latency opt round3, add router gemm, fused a gemm, PDL (#4560)
Signed-off-by: yunruis <yunruis@nvidia.com>
Signed-off-by: kduan <176893526+Kefeng-Duan@users.noreply.github.com>
Signed-off-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com>
Co-authored-by: kduan <176893526+Kefeng-Duan@users.noreply.github.com>
|
2025-06-14 17:36:22 +08:00 |
|
nv-guomingz
|
3b7b5a5ad5
|
refactor [BREAKING CHANGE]: enhance the llm args pytorch config part 3(torch_compile_config) (#5032)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
|
2025-06-14 14:23:13 +08:00 |
|
Yilin Fan
|
06342ffb4d
|
[feat] Implement model-agnostic one-engine eagle3 (#4778)
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
|
2025-06-13 08:11:41 -07:00 |
|
Mike Iovine
|
25aa3881d7
|
[nvbug/5319281][fix] Stop drafting when we hit the draft model's max seq len (#4879)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
|
2025-06-13 11:06:36 -04:00 |
|
brb-nv
|
089be8912a
|
feat: Basic skeleton for Gemma3 VLM (#5108)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
|
2025-06-13 17:27:04 +08:00 |
|
nv-guomingz
|
b959618579
|
refactor [BREAKING CHANGE]:: remove the redundant use_kv_cache field from PytorchConfig (#5031)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
|
2025-06-13 16:34:24 +08:00 |
|
yunruis
|
30c5b4183a
|
refactoring: port customized kernels with public cutlass version (#5027)
Signed-off-by: yunruis
Merge this to unblock others since the full CI has been run through
|
2025-06-13 16:19:31 +08:00 |
|
Zheng Duan
|
4d0a5ad384
|
chore: gracefully exit disagg process in tests; better startup and logging (#5109)
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
|
2025-06-13 14:03:55 +08:00 |
|
Yibin Li
|
b79eb34bfe
|
[fix]: Fall back to HMAC to Avoid IPC Serialization Churn (#5074)
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
|
2025-06-13 11:37:50 +08:00 |
|
zhhuang-nv
|
a891013e3c
|
[feat] Optimize KV Cache Reuse for MLA (#4869)
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
|
2025-06-13 11:03:05 +08:00 |
|
Fanrong Li
|
38a907aaca
|
[TRTLLM-5278][feat] Add attention dp support to MTP relaxed acceptance (#5119)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
|
2025-06-13 08:58:44 +08:00 |
|
pcastonguay
|
3a04c9fa7b
|
chore: Include prompt_token_ids only for context-only disagg requests (#5055)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
|
2025-06-12 15:00:08 -04:00 |
|
Mike Iovine
|
690873ba1a
|
[nvbug/5334370][fix] Fix one model EAGLE3 (#5134)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
|
2025-06-12 10:28:14 -04:00 |
|
HuiGao-NV
|
dfeeaf6746
|
Move allreduce_strategy from committed api to reference (#5147)
Signed-off-by: Hui Gao <huig@nvidia.com>
|
2025-06-12 21:00:20 +08:00 |
|
brb-nv
|
8cfb567182
|
fix: Updates to yarn implementation (#5105)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
|
2025-06-12 20:45:34 +08:00 |
|
nv-guomingz
|
58d4ca2385
|
fix:remove duplicated trust_remote_code knob from trtllm-serve (#5143)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
|
2025-06-12 19:48:24 +08:00 |
|
liji-nv
|
10ab9791ec
|
[fix] Do not reuse dummy request KVCache (#4804)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
|
2025-06-12 15:24:50 +08:00 |
|
Daniel Cámpora
|
e46267765f
|
Fix logprobs issues. (#5136)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
|
2025-06-12 15:07:01 +08:00 |
|
Lucas Liebenwein
|
49d7268acc
|
[nvbugs/5331013] fix AutoDeploy for PyTorch 25.05 dependency upgrade (#5106)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
|
2025-06-12 13:07:27 +08:00 |
|
Netanel Haber
|
e692779ead
|
Solve underallocation in VSWA+/VGQA (#4667)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
|
2025-06-12 12:12:46 +08:00 |
|
HuiGao-NV
|
43192379af
|
Use backend to replace macro to control enablement of MNNVL all reduce (#4635)
Signed-off-by: Hui Gao <huig@nvidia.com>
|
2025-06-12 11:22:49 +08:00 |
|
Zheng Duan
|
c592798f64
|
fix: limit process pool size when prefetching (#5088)
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
|
2025-06-12 10:52:52 +08:00 |
|
liji-nv
|
8282d6c1a7
|
[fix] Fix llama4 min latency (#5117)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
|
2025-06-11 15:44:38 +08:00 |
|
Zhanrui Sun
|
e2863a3159
|
chore: bump version to 0.21.0rc2 (#5112)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
|
2025-06-11 15:08:14 +08:00 |
|
Daniel Cámpora
|
fdf1c47d1d
|
[TRTLLM-4995][feat] TRTLLM Sampler log probs support (#4836)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
|
2025-06-11 08:18:13 +02:00 |
|
nvpohanh
|
7b210ae9c3
|
test: add unit tests for Llama4 min_latency code (#4980)
Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
|
2025-06-10 12:10:26 -07:00 |
|
Lucas Liebenwein
|
7ddc4d6282
|
[AutoDeploy] Merge Feature Branch Week 3 (#5054)
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
|
2025-06-11 00:20:43 +08:00 |
|
Tracin
|
6c91f1c7ac
|
Mxfp8xmxfp4 quant mode(#4978)
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
|
2025-06-10 22:01:37 +08:00 |
|
Zongfei Jing
|
6d1f2d0fd7
|
[TRTLLM-3927] [feat] Finalize + Allreduce + add + rmsnorm fusion (#4756)
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
|
2025-06-10 19:55:16 +08:00 |
|
Yuxian Qiu
|
08dc369a4d
|
fix: pytorch_backend_config is deprecated in update_llm_args_with_extra_dict. (#4890)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
|
2025-06-10 18:40:29 +08:00 |
|
tomeras91
|
f121f13ddf
|
[nvbug 5325284][fix] Increase Nemotron-H warmup request robustness (#4954)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
|
2025-06-10 11:09:37 +03:00 |
|