Commit Graph

235 Commits

Author SHA1 Message Date
Emma Qiao
d3df3f6feb
[None][infra] Waive failed cases and disable a stage on 02/02 (#11177)
Signed-off-by: qqiao <qqiao@nvidia.com>
2026-02-02 13:28:53 +08:00
Matt Lefebvre
97ab014bdb
[TRTINFRA-7548][infra] Update GB200 test configs to use frontend SLURM platforms (#11085)
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
2026-01-30 14:07:47 -08:00
Yiqing Yan
6fcbf15fb8
[None][fix] No need to remove the original waive list (#11060)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2026-01-29 11:10:38 +08:00
Matt Lefebvre
c26a8f764c
[TRTINFRA-7379][infra] Change SLURM config access to use resolvePlatform (#11006)
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
2026-01-27 12:33:16 -08:00
Linda
ce556290c9
[None][chore] Removing pybind11 bindings and references (#10550)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2026-01-26 08:19:12 -05:00
Emma Qiao
9d65b8bf24
[None][infra] Fix TRT-LLM data scratch mount point for gb10x (#10880)
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2026-01-24 14:00:17 +08:00
Zhanrui Sun
df845a028b
[TRTLLM-9581][infra] Use /home/scratch.trt_llm_data_ci in computelab (#10616)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2026-01-19 00:40:40 -05:00
chenfeiz0326
e97af45556
[TRTLLM-10300][feat] Upload regression info to artifactory (#10599)
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2026-01-19 10:16:31 +08:00
Yanchao Lu
0096b50ba0
[None][infra] Update upgrade related docs for release 1.2 (#10760) (#10773)
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Emma Qiao <qqiao@nvidia.com>
2026-01-18 00:14:27 +08:00
chenfeiz0326
56073f501a
[TRTLLM-8263][feat] Add Aggregated Perf Tests (#10598)
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2026-01-17 13:16:36 +08:00
Lucas Liebenwein
62050b2381
[None][infra] separate AutoDeploy tests into own stages (#10634)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2026-01-14 23:05:26 -05:00
Emma Qiao
01083b56bf
[TRTLLM-9849][infra] Update dependencies to 25.12 (#9818)
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Signed-off-by: xxi <xxi@nvidia.com>
Signed-off-by: xxi <95731198+xxi-nv@users.noreply.github.com>
Co-authored-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Co-authored-by: xxi <xxi@nvidia.com>
Co-authored-by: xxi <95731198+xxi-nv@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2026-01-14 21:54:04 +08:00
tburt-nv
7d41475954
[None][infra] try removing shared cache dir mount (#10609)
Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>
2026-01-13 15:07:12 +08:00
Yanchao Lu
80649a8b78
[None][ci] Workaround OCI-NRT slowdown issue (#10587)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2026-01-11 22:08:19 +08:00
Emma Qiao
43839c7d9b
[TRTLLM-9642][infra] Increase pytest verbosity for failed tests (#9657)
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
2026-01-08 02:33:48 -05:00
Yiqing Yan
5108a69fc0
[TRTLLM-9622][infra] Enable DGX_B300 multi-gpu testing in pre-merge pipeline (#9699)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2026-01-06 14:39:55 +08:00
chenfeiz0326
a65b0d4efa
[None][fix] Decrease Pre Merge Perf Tests (#10390)
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2026-01-04 12:21:34 -05:00
Yanchao Lu
c4f27fa4c0
[None][ci] Some tweaks for the CI pipeline (#10359)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2026-01-04 11:10:47 -05:00
yuanjingx87
5bd37ce41e
[None][infra] add retry logic to get slurm sbatch job log when ssh dropped (#9167)
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2026-01-04 10:11:37 +08:00
chenfeiz0326
5e0e48144f
[None][fix] Minor updates on Perf Test System (#10375)
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2026-01-02 17:17:42 +08:00
chenfeiz0326
a23c6f1092
[TRTLLM-9834][feat] Transfer to TRTLLM-INFRA Database and Fail post-merge tests if regression (#10282)
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-12-31 21:44:59 +08:00
Yiqing Yan
fdc03684cc
[TRTLLM-10016][infra] Use SlurmPatition attribute time as timeout threshold (#10254)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-12-31 15:02:24 +08:00
Emma Qiao
fb05cd769a
[None][infra] Enable single-gpu CI on spark (#9304)
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Signed-off-by: Jenny Liu <JennyLiu-nv+JennyLiu@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-12-30 17:22:14 +08:00
Yanchao Lu
965578ca21
[None][infra] Some improvements for Slurm execution path in the CI (#10316)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-12-29 06:49:44 -05:00
Yanchao Lu
270be801aa
[None][ci] Move remaining DGX-B200 tests to LBD (#9876)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-12-28 13:55:39 +08:00
chenfeiz0326
d70aeddc7f
[TRTLLM-8952][feat] Support Multi-Node Disagg Perf Test in CI (#9138)
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-12-26 22:50:53 +08:00
Emma Qiao
16fd781e42
[TRTLLM-9862][infra] Move single-gpu tests on rtxpro6000d to pre-merge (#9897)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-12-24 21:45:33 -05:00
shuyixiong
f4f0fe85e9
[TRTLLM-9737][chore] Add rl perf reproduce script and enhance the robustness of Ray tests (#9939)
Signed-off-by: Shuyi Xiong <219646547+shuyixiong@users.noreply.github.com>
2025-12-24 15:27:01 +08:00
chenfeiz0326
48c875f8ea
[None][fix] Add OpenSearch URL in slurm_launch.sh for Multinode Perf Sanity Test (#9990)
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-12-23 16:02:38 +08:00
JunyiXu-nv
356ad4fe3a
[https://nvbugs/5722653][fix] Address port conflict by assigning different port section in the same node. (#10035)
Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>
2025-12-19 15:34:04 +08:00
yuanjingx87
df15be3fad
[None][infra] Fix slurm job does not catch cancelled jobs (#9722)
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
Signed-off-by: yuanjingx87 <197832395+yuanjingx87@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-12-18 00:32:43 -08:00
QI JUN
4ce35eacf1
[TRTLLM-9794][ci] move more test cases to gb200 (#9994)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-12-15 19:50:41 -08:00
Matt Lefebvre
1375910f1b
[None][infra] Delete container before attempting import (#9967)
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
2025-12-14 00:09:33 -08:00
Matt Lefebvre
df1adfbb50
[TRTINFRA-7328][infra] - Move half B200 tests to lbd (#9853)
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
2025-12-10 04:24:30 -08:00
Matt Lefebvre
8fefa2c9d1
[None][infra] Fail fast if SLURM entrypoint fails (#9744)
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
2025-12-10 02:31:29 -08:00
Matt Lefebvre
5de4e3f621
[TRTINFRA-7328][infra] Consume SlurmCluster scratchPath and cleanup mounts (#9600)
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
2025-12-09 13:34:09 -08:00
Jhao-Ting Chen
0a09465089
[https://nvbugs/5567586][feat] Ampere xqa swa specdec for GPT-OSS Eagle3-one-model (#8383)
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
2025-12-08 11:16:05 -08:00
chenfeiz0326
383178c00a
[TRTLLM-9000][feat] Add multi-node Perf Tests into CI (#8800)
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-12-08 09:00:44 +08:00
Yanchao Lu
f59d64e6c7
[None][fix] Several minor fixes to CI setting (#9765)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-12-07 23:07:59 +08:00
Yiqing Yan
e834f04238
[TRTLLM-9579][infra] Set mergeWaiveList stage UNSTABLE when there is any issue (#9692)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-12-05 10:18:31 +08:00
Yiqing Yan
731b2eb4ef
[TRTLLM-5312][infra] Add triton trigger rules (#6440)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-12-05 07:35:04 +08:00
Yiqing Yan
47f650ca13
[TRTLLM-5093][infra] Write env variables to a file in the interactive debug session (#6792)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-12-04 11:41:27 +08:00
Yiqing Yan
e31142202e
[TRTLLM-7181][infra] Generate test results when pytest timeout happens (#9396)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-12-04 10:05:38 +08:00
Yiqing Yan
8c88454fa5
[TRTLLM-7101][infra] Reuse passed tests (#6894)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-12-03 10:07:23 +08:00
Chang Liu
73a543d78f
[None][fix] Extract GPU count from single-node stage names (#9599)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-12-02 20:58:16 +08:00
Eran Geva
1a46bb0d18
Lock the gpu clocks in L0 perf tests (#9585)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-12-02 18:13:45 +08:00
Emma Qiao
b024040df0
[None][infra] Update the pytest options after MI (#9579)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-12-02 00:11:30 +08:00
Yanchao Lu
078d3a576e
[None][ci] Minor change for Slurm scripts (#9561)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-12-01 22:52:08 +08:00
Enwei Zhu
34e2fa5c96
[https://nvbugs/5690172][fix] Fix Qwen3-235B ATP accuracy issue with PDL (#9530)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-12-01 09:10:21 +08:00
Yanchao Lu
694b60d92d
[None][ci] Split H100_PCIe-PyTorch-Post-Merge test stage (#9559)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-11-30 21:14:18 +08:00