Emma Qiao
d3df3f6feb
[None][infra] Waive failed cases and disable a stage on 02/02 ( #11177 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2026-02-02 13:28:53 +08:00
Matt Lefebvre
97ab014bdb
[TRTINFRA-7548][infra] Update GB200 test configs to use frontend SLURM platforms ( #11085 )
...
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
2026-01-30 14:07:47 -08:00
Yiqing Yan
6fcbf15fb8
[None][fix] No need to remove the original waive list ( #11060 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2026-01-29 11:10:38 +08:00
Matt Lefebvre
c26a8f764c
[TRTINFRA-7379][infra] Change SLURM config access to use resolvePlatform ( #11006 )
...
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
2026-01-27 12:33:16 -08:00
Linda
ce556290c9
[None][chore] Removing pybind11 bindings and references ( #10550 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2026-01-26 08:19:12 -05:00
Emma Qiao
9d65b8bf24
[None][infra] Fix TRT-LLM data scratch mount point for gb10x ( #10880 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2026-01-24 14:00:17 +08:00
Zhanrui Sun
df845a028b
[TRTLLM-9581][infra] Use /home/scratch.trt_llm_data_ci in computelab ( #10616 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2026-01-19 00:40:40 -05:00
chenfeiz0326
e97af45556
[TRTLLM-10300][feat] Upload regression info to artifactory ( #10599 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2026-01-19 10:16:31 +08:00
Yanchao Lu
0096b50ba0
[None][infra] Update upgrade related docs for release 1.2 ( #10760 ) ( #10773 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Emma Qiao <qqiao@nvidia.com>
2026-01-18 00:14:27 +08:00
chenfeiz0326
56073f501a
[TRTLLM-8263][feat] Add Aggregated Perf Tests ( #10598 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2026-01-17 13:16:36 +08:00
Lucas Liebenwein
62050b2381
[None][infra] separate AutoDeploy tests into own stages ( #10634 )
...
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2026-01-14 23:05:26 -05:00
Emma Qiao
01083b56bf
[TRTLLM-9849][infra] Update dependencies to 25.12 ( #9818 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Signed-off-by: xxi <xxi@nvidia.com>
Signed-off-by: xxi <95731198+xxi-nv@users.noreply.github.com>
Co-authored-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Co-authored-by: xxi <xxi@nvidia.com>
Co-authored-by: xxi <95731198+xxi-nv@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2026-01-14 21:54:04 +08:00
tburt-nv
7d41475954
[None][infra] try removing shared cache dir mount ( #10609 )
...
Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>
2026-01-13 15:07:12 +08:00
Yanchao Lu
80649a8b78
[None][ci] Workaround OCI-NRT slowdown issue ( #10587 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2026-01-11 22:08:19 +08:00
Emma Qiao
43839c7d9b
[TRTLLM-9642][infra] Increase pytest verbosity for failed tests ( #9657 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
2026-01-08 02:33:48 -05:00
Yiqing Yan
5108a69fc0
[TRTLLM-9622][infra] Enable DGX_B300 multi-gpu testing in pre-merge pipeline ( #9699 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2026-01-06 14:39:55 +08:00
chenfeiz0326
a65b0d4efa
[None][fix] Decrease Pre Merge Perf Tests ( #10390 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2026-01-04 12:21:34 -05:00
Yanchao Lu
c4f27fa4c0
[None][ci] Some tweaks for the CI pipeline ( #10359 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2026-01-04 11:10:47 -05:00
yuanjingx87
5bd37ce41e
[None][infra] add retry logic to get slurm sbatch job log when ssh dropped ( #9167 )
...
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2026-01-04 10:11:37 +08:00
chenfeiz0326
5e0e48144f
[None][fix] Minor updates on Perf Test System ( #10375 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2026-01-02 17:17:42 +08:00
chenfeiz0326
a23c6f1092
[TRTLLM-9834][feat] Transfer to TRTLLM-INFRA Database and Fail post-merge tests if regression ( #10282 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-12-31 21:44:59 +08:00
Yiqing Yan
fdc03684cc
[TRTLLM-10016][infra] Use SlurmPatition attribute time as timeout threshold ( #10254 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-12-31 15:02:24 +08:00
Emma Qiao
fb05cd769a
[None][infra] Enable single-gpu CI on spark ( #9304 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Signed-off-by: Jenny Liu <JennyLiu-nv+JennyLiu@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-12-30 17:22:14 +08:00
Yanchao Lu
965578ca21
[None][infra] Some improvements for Slurm execution path in the CI ( #10316 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-12-29 06:49:44 -05:00
Yanchao Lu
270be801aa
[None][ci] Move remaining DGX-B200 tests to LBD ( #9876 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-12-28 13:55:39 +08:00
chenfeiz0326
d70aeddc7f
[TRTLLM-8952][feat] Support Multi-Node Disagg Perf Test in CI ( #9138 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-12-26 22:50:53 +08:00
Emma Qiao
16fd781e42
[TRTLLM-9862][infra] Move single-gpu tests on rtxpro6000d to pre-merge ( #9897 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-12-24 21:45:33 -05:00
shuyixiong
f4f0fe85e9
[TRTLLM-9737][chore] Add rl perf reproduce script and enhance the robustness of Ray tests ( #9939 )
...
Signed-off-by: Shuyi Xiong <219646547+shuyixiong@users.noreply.github.com>
2025-12-24 15:27:01 +08:00
chenfeiz0326
48c875f8ea
[None][fix] Add OpenSearch URL in slurm_launch.sh for Multinode Perf Sanity Test ( #9990 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-12-23 16:02:38 +08:00
JunyiXu-nv
356ad4fe3a
[ https://nvbugs/5722653 ][fix] Address port conflict by assigning different port section in the same node. ( #10035 )
...
Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>
2025-12-19 15:34:04 +08:00
yuanjingx87
df15be3fad
[None][infra] Fix slurm job does not catch cancelled jobs ( #9722 )
...
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
Signed-off-by: yuanjingx87 <197832395+yuanjingx87@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-12-18 00:32:43 -08:00
QI JUN
4ce35eacf1
[TRTLLM-9794][ci] move more test cases to gb200 ( #9994 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-12-15 19:50:41 -08:00
Matt Lefebvre
1375910f1b
[None][infra] Delete container before attempting import ( #9967 )
...
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
2025-12-14 00:09:33 -08:00
Matt Lefebvre
df1adfbb50
[TRTINFRA-7328][infra] - Move half B200 tests to lbd ( #9853 )
...
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
2025-12-10 04:24:30 -08:00
Matt Lefebvre
8fefa2c9d1
[None][infra] Fail fast if SLURM entrypoint fails ( #9744 )
...
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
2025-12-10 02:31:29 -08:00
Matt Lefebvre
5de4e3f621
[TRTINFRA-7328][infra] Consume SlurmCluster scratchPath and cleanup mounts ( #9600 )
...
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
2025-12-09 13:34:09 -08:00
Jhao-Ting Chen
0a09465089
[ https://nvbugs/5567586 ][feat] Ampere xqa swa specdec for GPT-OSS Eagle3-one-model ( #8383 )
...
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
2025-12-08 11:16:05 -08:00
chenfeiz0326
383178c00a
[TRTLLM-9000][feat] Add multi-node Perf Tests into CI ( #8800 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-12-08 09:00:44 +08:00
Yanchao Lu
f59d64e6c7
[None][fix] Several minor fixes to CI setting ( #9765 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-12-07 23:07:59 +08:00
Yiqing Yan
e834f04238
[TRTLLM-9579][infra] Set mergeWaiveList stage UNSTABLE when there is any issue ( #9692 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-12-05 10:18:31 +08:00
Yiqing Yan
731b2eb4ef
[TRTLLM-5312][infra] Add triton trigger rules ( #6440 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-12-05 07:35:04 +08:00
Yiqing Yan
47f650ca13
[TRTLLM-5093][infra] Write env variables to a file in the interactive debug session ( #6792 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-12-04 11:41:27 +08:00
Yiqing Yan
e31142202e
[TRTLLM-7181][infra] Generate test results when pytest timeout happens ( #9396 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-12-04 10:05:38 +08:00
Yiqing Yan
8c88454fa5
[TRTLLM-7101][infra] Reuse passed tests ( #6894 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-12-03 10:07:23 +08:00
Chang Liu
73a543d78f
[None][fix] Extract GPU count from single-node stage names ( #9599 )
...
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-12-02 20:58:16 +08:00
Eran Geva
1a46bb0d18
Lock the gpu clocks in L0 perf tests ( #9585 )
...
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-12-02 18:13:45 +08:00
Emma Qiao
b024040df0
[None][infra] Update the pytest options after MI ( #9579 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-12-02 00:11:30 +08:00
Yanchao Lu
078d3a576e
[None][ci] Minor change for Slurm scripts ( #9561 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-12-01 22:52:08 +08:00
Enwei Zhu
34e2fa5c96
[ https://nvbugs/5690172 ][fix] Fix Qwen3-235B ATP accuracy issue with PDL ( #9530 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-12-01 09:10:21 +08:00
Yanchao Lu
694b60d92d
[None][ci] Split H100_PCIe-PyTorch-Post-Merge test stage ( #9559 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-11-30 21:14:18 +08:00