Commit Graph

212 Commits

Author SHA1 Message Date
danielafrimi
7a617ad1fe
feat: W4A16 GEMM (#4232)
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>
2025-07-01 10:36:05 +03:00
Wei-Ming Chen
f28cd3056e
feat: AutoDeploy fp8 quantization support for bmm (#3849)
Signed-off-by: Wei-Ming Chen <17592131+meenchen@users.noreply.github.com>
2025-06-30 12:36:34 -04:00
WeiHaocheng
42a9385d02
[TRTLLM-5331] perf: Replace allgaher with AllToAllPrepare (#5570)
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
2025-06-30 13:06:09 +08:00
Omer Ullman Argov
1db63c2546
[fix] speedup modeling unittests (#5579)
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-30 06:30:45 +03:00
nv-guomingz
578430e64c
[TRTLLM-5530][BREAKING CHANGE]: enhance the llm args pytorch config part 1(cuda_graph_config) (#5014)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-06-30 11:05:40 +08:00
Omer Ullman Argov
94dc97ab10
[feat][test] reuse MPI pool executor across tests (#5566)
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-29 17:23:12 +03:00
tomeras91
a1c1c6b504
[CI] reduce mamba2 ssm test parameterization (#5571)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-06-29 15:56:23 +03:00
Lucas Liebenwein
619709fc33
[AutoDeploy] merge feat/ad-2025-06-13 (#5556)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-06-29 03:52:14 +08:00
Li Min
6021a439ab
Make moe permute and final as custom op (#5412)
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
2025-06-27 15:48:33 -07:00
wili
56cdfe5c6c
[TRTLLM-5000][feat] NGrams V2 (#4569)
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-06-27 23:00:17 +08:00
Enwei Zhu
7f1893f54c
ci: waive flaky test test_llama_eagle3 (#5548)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-27 19:16:07 +08:00
Bo Li
1bab9000a6
perf: Optimize swizzle_sf, unswizzle_sf, reswizzle_sf (#5318)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-06-26 14:03:56 +08:00
dongxuy04
490d2e5819
feat: large-scale EP(part 8: Online EP load balancer integration for PCIe fp8) (#5226)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-06-25 22:25:13 -07:00
Omer Ullman Argov
61bb71fd1b
[fix][test] remove test in global scope (#5470)
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-25 23:42:26 +03:00
QI JUN
3a2c4ca77b
chore: split _build_model method for TorchLlm and TrtLlm (#5418)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-06-26 04:32:46 +08:00
HuiGao-NV
314f15f0a7
Fix: fix nvbug 5356427 (#5464)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-25 22:24:26 +08:00
QI JUN
2901c5a5bc
CI: waive test_ad_build_small_multi (#5471)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-25 16:44:42 +08:00
Lucas Liebenwein
5cffb7e0ec
[AutoDeploy] Merge feat/ad_2025_06_13 feature branch (#5454)
Signed-off-by: Grzegorz Kwasniewski <213329731+greg-kwasniewski1@users.noreply.github.com>
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Co-authored-by: Grzegorz Kwasniewski <213329731+greg-kwasniewski1@users.noreply.github.com>
Co-authored-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Co-authored-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
2025-06-25 09:30:13 +08:00
QI JUN
241f921800
waive test_moe.py::test_moe_fp8[autotune] (#5455)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-25 09:14:44 +08:00
Iman Tabrizian
846bbf1edc
Fix test Pytorch model engine (#5416)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-06-24 11:09:27 -07:00
HuiGao-NV
35a92f6bab
Add debug hook to support dump tensor data and add new debug functions easily (#5182)
Signed-off-by: Hui Gao
2025-06-24 17:45:28 +08:00
Fanrong Li
5d4ab47d5b
fix: refactor and fix mtp vanilla (#4762)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-20 05:23:39 +08:00
Yan Chunwei
9bd42ecf9b
[TRTLLM-5208][BREAKING CHANGE] chore: make pytorch LLM the default (#5312)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-20 03:01:10 +08:00
hlu1
b558232ce1
Refactor CutlassFusedMoE (#5344)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-06-19 00:04:07 -07:00
Robin Kobus
38547b92f3
refactor: Introduce ResourceManagerType enum for resource management (#5246)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-18 09:55:59 +02:00
QI JUN
f899c4d294
Re-implement LlmResponse in Python to reduce host overhead of pybind (#5224)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-17 21:28:09 +08:00
Dom Brown
44fb3c1673
[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner (#5207)
- Adds a new Python custom op (fp8_block_scale_moe_runner) and a FP8BlockScaleMoERunner class for autotuning.
- Updates C++ MoE and batched GEMM kernels to accept a configIndex for workspace sizing and execution.
- Extends the unit test to run both autotuned and non-autotuned code paths.

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-06-17 21:01:56 +08:00
qsang-nv
134cb66a53
fix mla test (#5240)
Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
2025-06-17 15:26:25 +08:00
Izzy Putterman
e607768e45
Speculation: Draft Target in new FW (#4558)
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
2025-06-17 02:26:08 +08:00
Yiqing Yan
8445416c39
Waive L0 tests (#5233)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-06-16 15:19:03 +08:00
Anthony Chang
4f9fa9f21d
feat: MoE trtllm backend kernel update (#5183)
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-06-16 14:46:13 +08:00
Tracin
ef3fdc8051
feat: Add w4a8_mxfp4_fp8 quantization recipe. (#4867)
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-06-16 11:30:57 +08:00
Yan Chunwei
c84e41fd9d
fix: build_config in TorchLlmArgs and avoid arbitrary args (#4972)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-15 17:51:56 -07:00
Omer Ullman Argov
4eade3ae33
[fix][test] Speedup Nemotron NAS unittests (#5202)
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-15 11:26:03 +03:00
ixlmar
e055af1bc9
chore: improve disagg test failure detection (#4738)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-06-15 01:28:26 +08:00
Tailing Yuan
0b60da2c45
feat: large-scale EP(part 7: DeepEP integration) (#4792)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-06-14 19:12:38 +08:00
yunruis
b99c5ce8c1
Feat/ds r1 min latency opt round3, add router gemm, fused a gemm, PDL (#4560)
Signed-off-by: yunruis <yunruis@nvidia.com>
Signed-off-by: kduan <176893526+Kefeng-Duan@users.noreply.github.com>
Signed-off-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com>
Co-authored-by: kduan <176893526+Kefeng-Duan@users.noreply.github.com>
2025-06-14 17:36:22 +08:00
Mike Iovine
25aa3881d7
[nvbug/5319281][fix] Stop drafting when we hit the draft model's max seq len (#4879)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-13 11:06:36 -04:00
Daniel Cámpora
dec326ba7d
[fix] Reenable test return logits (#5160)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-13 06:07:22 +02:00
Yuxian Qiu
4ae46b6714
fix: [nvbugs/5324229] Fix broken WInt4AFP8FusedMoEMethod since FusedMoE refactor. (#4930)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-06-13 10:21:32 +08:00
Matthias Jouanneaux
a0b6c635b1
[feat] trtllmGen MoE routing: added support for top groups and top K bounds (#4063)
Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
Co-authored-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>
2025-06-13 06:00:02 +08:00
tomeras91
06d9f1e2f6
[test] Use LLM API for Nemotron-H correctness test (#5097)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-06-12 09:54:46 +03:00
Lucas Liebenwein
49d7268acc
[nvbugs/5331013] fix AutoDeploy for PyTorch 25.05 dependency upgrade (#5106)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-06-12 13:07:27 +08:00
HuiGao-NV
43192379af
Use backend to replace macro to control enablement of MNNVL all reduce (#4635)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-12 11:22:49 +08:00
Daniel Cámpora
fdf1c47d1d
[TRTLLM-4995][feat] TRTLLM Sampler log probs support (#4836)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-11 08:18:13 +02:00
ChristinaZ
273c6b9355
[https://nvbugspro.nvidia.com/bug/5332927][fix] Fix the bug in the routing unit test (#5065)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-06-11 09:44:35 +08:00
Mike Iovine
fcd71921f1
[fix] Unwaive test_llama_eagle3 (#5042)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-10 18:11:07 -04:00
Jinyang Yuan
194a708d83
[fix] Fix test_attention_mla (#5084)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-06-10 14:20:11 -07:00
nvpohanh
7b210ae9c3
test: add unit tests for Llama4 min_latency code (#4980)
Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
2025-06-10 12:10:26 -07:00
Lucas Liebenwein
7ddc4d6282
[AutoDeploy] Merge Feature Branch Week 3 (#5054)
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
2025-06-11 00:20:43 +08:00