Commit Graph

1542 Commits

Author SHA1 Message Date
Robin Kobus
a8141a4513
refactor: Speculative decoding buffers part 2 (#5316)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-27 17:41:48 +02:00
Aurelien Chartier
833c0dea4a
[TRTLLM-6104] feat: add request_perf_metrics to LLMAPI (#5497)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-06-27 17:03:05 +02:00
wili
56cdfe5c6c
[TRTLLM-5000][feat] NGrams V2 (#4569)
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-06-27 23:00:17 +08:00
peaceh-nv
cb58073ab7
Fix : fix build for sm120 (#5265)
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
2025-06-27 20:42:47 +08:00
Omer Ullman Argov
6fc1c6fd7b
[fix][ci] correct unittests test prefix (#5547)
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-27 20:34:44 +08:00
ChristinaZ
a608b00d38
Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-06-27 20:17:40 +08:00
Enwei Zhu
7f1893f54c
ci: waive flaky test test_llama_eagle3 (#5548)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-27 19:16:07 +08:00
Daniel Cámpora
73b8a95049
feat: Use inference mode in update_requests to improve perf of TRTLLM Sampler (#5538) 2025-06-27 18:40:53 +08:00
Emma Qiao
980030c816
[Infra] - Waive failed case in post-merge (#5536)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-06-27 13:55:49 +08:00
Daniel Stokes
83a1f60556
feat: Expose bias and FP8_MXFP4 MOE CUTLASS backend features to pytorch (#5410)
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-06-27 12:29:34 +08:00
Tailing Yuan
ef43b95aa1
Fix execute_process: check results using EQUAL (#5481) 2025-06-27 11:57:04 +08:00
Iman Tabrizian
49af791f66
Add testing for trtllm-llmapi-launch with tritonserver (#5528)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-06-27 11:19:52 +08:00
Yuxian Qiu
dc36228f52
fix: Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-06-27 11:03:38 +08:00
xinhe-nv
a3494bebec
tests: waive failed tests on main (#5512)
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-06-27 10:13:22 +08:00
Yibin Li
0f3bd7800e
[TRTLLM-4971]: Use safe deserialization in ParallelConfig (#4630)
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-06-27 09:58:41 +08:00
Frank
aa6e015ef8
Update trtllm-bench to support new Pytorch default. (#5491)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-06-26 17:05:43 -07:00
Venky
0083228d2a
fix: Mapping rank boundary check bug (#4935)
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
2025-06-27 07:27:59 +08:00
yuanjingx87
69c4ef2e0e
Update allow list 2025_06_26 (#5526)
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-06-26 15:25:09 -07:00
Anthony Chang
de7cd0de05
fix: MoE autotune fallback failed to query default heuristic (#5520)
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-06-26 17:28:48 +01:00
jmydurant
8836990bde
[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) (#5475)
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2025-06-26 22:18:08 +08:00
Robin Kobus
8dfa31c71d
refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead (#5384)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-26 19:45:52 +08:00
Omer Ullman Argov
6bae76d7ca
[fix][ci] move torch tests to run under torch stage (#5473)
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-26 14:31:38 +03:00
Omer Ullman Argov
1633bd2bef
[CI] move flashinfer llama tests to post merge (#5506)
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-26 19:27:32 +08:00
Frank
baf7eaa1cc
Add trtllm-bench reviewers. (#5452)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-06-26 18:48:00 +08:00
Rashid Kaleem
3a1f4d4001
[feat] Add progress bar to benchmark (#5173)
Signed-off-by: Rashid Kaleem <rkaleem@nvidia.com>
Co-authored-by: Zhihan Jiang <68881590+nvzhihanj@users.noreply.github.com>
Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com>
2025-06-26 18:39:45 +08:00
Kaiyu Xie
2eb6502b1d
feat: Add support for TRTLLM CustomDataset (#5511)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-06-26 18:27:37 +08:00
Yao Yao
0788c5d0d6
[perf] improve XQA-MLA perf (#5468)
Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>
2025-06-26 18:09:13 +08:00
Kaiyu Xie
749393ec9f
doc: Fix benchmark cmd in disagg scripts (#5515)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-06-26 17:04:55 +08:00
xinhe-nv
ff2dd72df4
tests: waive tests (#5458)
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-06-26 14:53:55 +08:00
Omer Ullman Argov
fa0ea92dfd
[fix][ci] trigger multigpu tests for deepseek changes (#5423)
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-26 14:30:00 +08:00
Bo Li
1bab9000a6
perf: Optimize swizzle_sf, unswizzle_sf, reswizzle_sf (#5318)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-06-26 14:03:56 +08:00
Alessio Netti
7e681fbe52
[chore] Allow configuring linking of NVRTC wrapper (#5189)
Signed-off-by: Alessio Netti <netti.alessio@gmail.com>
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-26 07:26:10 +02:00
dongxuy04
490d2e5819
feat: large-scale EP(part 8: Online EP load balancer integration for PCIe fp8) (#5226)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-06-25 22:25:13 -07:00
amitz-nv
e0bb123ae7
[TRTLLM-5921][feat] Prevent serialization of entire LoRA adapters in each request (#5080)
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-06-26 08:15:06 +03:00
Yukun He
9ee33605bb
[TRTLLM-6019] feat: Remove cutlass min latency code from AutoTuner. (#5394)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-06-26 13:12:03 +08:00
Daniel Stokes
942841417e
opensource: Opensource MOE MXFP8-MXFP4 implementation (#5222)
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-06-26 12:18:19 +08:00
qsang-nv
e9cd810071
keep sm90 headsize 128 cubins (#5320)
Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
2025-06-26 12:14:01 +08:00
Netanel Haber
6aef14943c
Revert "feature: unify new_tokens format sample state to trtllm samper new_tokens format (#4401)" (#5474)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-06-25 20:56:04 -07:00
Emma Qiao
32d1573c43
[Infra] - Add timeout setting for long tests found in post-merge (#5501)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-06-26 11:31:39 +08:00
Venky
d9b75f83fd
[CI] Waive test_fp8_block_scales_4gpus[ep4-mtp_nextn=0-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False] (#5494)
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
2025-06-25 20:17:12 -07:00
ChristinaZ
d135f5993d
Add unit test for routing kernels (#5405)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-06-26 09:49:11 +08:00
jmydurant
578dbc8d9a
feat: chunked prefill for MLA (Blackwell) (#4651)
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2025-06-26 09:01:00 +08:00
Yukun He
3fc57543e2
[5356427] fix: Remove the seq_len of 4096 from FP8 block scale MoE tuning configs. (#5485)
The seq_len of 4096 will cause some unknown CUDA illegal memory access issue if run with some other tests consecutively.
Put a saturated upper bound for any sequence length larger than it.
2025-06-26 08:38:35 +08:00
HuiGao-NV
74ae15a26b
CI: enable test cases on single device type (#5484)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-26 08:03:44 +08:00
Xianjie Qiao
1e4fa13d33
Add sleep function for disagg gen-only benchmarking (#5398)
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
2025-06-26 07:32:16 +08:00
QI JUN
feaf789342
CI: reduce BF16 test cases in B200 (#5482)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-06-26 07:18:20 +08:00
Omer Ullman Argov
bdc8dfebc3
[fix][ci] dont build wheel for cpp tests (#5443)
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-26 00:13:47 +03:00
Omer Ullman Argov
61bb71fd1b
[fix][test] remove test in global scope (#5470)
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-25 23:42:26 +03:00
QI JUN
3a2c4ca77b
chore: split _build_model method for TorchLlm and TrtLlm (#5418)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-06-26 04:32:46 +08:00
Mike Iovine
5bc8c894f7
[chore] Disable block reuse when draft model speculation is being used (#5448)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-26 03:51:20 +08:00