tburt-nv
7d41475954
[None][infra] try removing shared cache dir mount ( #10609 )
...
Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>
2026-01-13 15:07:12 +08:00
chenfeiz0326
54459377d2
[TRTLLM-10248][feat] Support Bot to Send Perf Regression Msg to Slack Channel ( #10489 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2026-01-12 14:23:23 +08:00
Yanchao Lu
80649a8b78
[None][ci] Workaround OCI-NRT slowdown issue ( #10587 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2026-01-11 22:08:19 +08:00
Emma Qiao
43839c7d9b
[TRTLLM-9642][infra] Increase pytest verbosity for failed tests ( #9657 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
2026-01-08 02:33:48 -05:00
Yiqing Yan
5108a69fc0
[TRTLLM-9622][infra] Enable DGX_B300 multi-gpu testing in pre-merge pipeline ( #9699 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2026-01-06 14:39:55 +08:00
chenfeiz0326
a65b0d4efa
[None][fix] Decrease Pre Merge Perf Tests ( #10390 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2026-01-04 12:21:34 -05:00
Yanchao Lu
c4f27fa4c0
[None][ci] Some tweaks for the CI pipeline ( #10359 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2026-01-04 11:10:47 -05:00
yuanjingx87
5bd37ce41e
[None][infra] add retry logic to get slurm sbatch job log when ssh dropped ( #9167 )
...
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2026-01-04 10:11:37 +08:00
chenfeiz0326
5e0e48144f
[None][fix] Minor updates on Perf Test System ( #10375 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2026-01-02 17:17:42 +08:00
chenfeiz0326
a23c6f1092
[TRTLLM-9834][feat] Transfer to TRTLLM-INFRA Database and Fail post-merge tests if regression ( #10282 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-12-31 21:44:59 +08:00
Yiqing Yan
fdc03684cc
[TRTLLM-10016][infra] Use SlurmPatition attribute time as timeout threshold ( #10254 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-12-31 15:02:24 +08:00
Emma Qiao
fb05cd769a
[None][infra] Enable single-gpu CI on spark ( #9304 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Signed-off-by: Jenny Liu <JennyLiu-nv+JennyLiu@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-12-30 17:22:14 +08:00
Yanchao Lu
965578ca21
[None][infra] Some improvements for Slurm execution path in the CI ( #10316 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-12-29 06:49:44 -05:00
Yanchao Lu
270be801aa
[None][ci] Move remaining DGX-B200 tests to LBD ( #9876 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-12-28 13:55:39 +08:00
chenfeiz0326
d70aeddc7f
[TRTLLM-8952][feat] Support Multi-Node Disagg Perf Test in CI ( #9138 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-12-26 22:50:53 +08:00
Iman Tabrizian
cd5cd60ee4
[None][infra] Move install_boost from install_triton.sh to install_base.sh ( #10055 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-12-25 08:09:55 -05:00
Emma Qiao
16fd781e42
[TRTLLM-9862][infra] Move single-gpu tests on rtxpro6000d to pre-merge ( #9897 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-12-24 21:45:33 -05:00
Yiqing Yan
69152c4e7c
[None][infra] Check GB200 coherent GPU mapping ( #10253 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-12-24 17:12:36 +08:00
shuyixiong
f4f0fe85e9
[TRTLLM-9737][chore] Add rl perf reproduce script and enhance the robustness of Ray tests ( #9939 )
...
Signed-off-by: Shuyi Xiong <219646547+shuyixiong@users.noreply.github.com>
2025-12-24 15:27:01 +08:00
chenfeiz0326
48c875f8ea
[None][fix] Add OpenSearch URL in slurm_launch.sh for Multinode Perf Sanity Test ( #9990 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-12-23 16:02:38 +08:00
JunyiXu-nv
356ad4fe3a
[ https://nvbugs/5722653 ][fix] Address port conflict by assigning different port section in the same node. ( #10035 )
...
Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>
2025-12-19 15:34:04 +08:00
Wangjue Yao
9f283f330b
[None][feat] Support Mooncake transfer engine as a cache transceiver backend ( #8309 )
...
Signed-off-by: wjueyao <wyao123@terpmail.umd.edu>
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-12-19 10:09:51 +08:00
yuanjingx87
df15be3fad
[None][infra] Fix slurm job does not catch cancelled jobs ( #9722 )
...
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
Signed-off-by: yuanjingx87 <197832395+yuanjingx87@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-12-18 00:32:43 -08:00
yuanjingx87
0a4c59136a
[None][infra] Fixing credential loading in lockfile generation pipeline ( #10020 )
...
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-12-16 15:38:29 +08:00
QI JUN
4ce35eacf1
[TRTLLM-9794][ci] move more test cases to gb200 ( #9994 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-12-15 19:50:41 -08:00
zackyoray
63e7a2fa70
[None][infra] Update ucx to 1.20.x ( #9977 )
...
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
Signed-off-by: Yoray Zack <62789610+zackyoray@users.noreply.github.com>
2025-12-16 00:31:48 +08:00
dominicshanshan
825025b137
[None][infra] Add multi gpu Ray tests into L0 merge change request list. ( #9996 )
...
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-12-15 15:55:54 +08:00
Matt Lefebvre
1375910f1b
[None][infra] Delete container before attempting import ( #9967 )
...
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
2025-12-14 00:09:33 -08:00
Yuxian Qiu
fcda1a1442
[None][fix] disable async pp send for ray cases. ( #9959 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-12-13 20:22:36 -08:00
yuanjingx87
246a877571
[None][infra] Remove generate lockfile schedule for 1.2.0rc4.post1 branch ( #9945 )
...
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-12-12 09:10:32 -08:00
zackyoray
d5b9ad91c9
[None][feat] Upgrade NIXL to v0.8.0 ( #9707 )
...
Signed-off-by: Yoray Zack <62789610+zackyoray@users.noreply.github.com>
Signed-off-by: zackyoray
Signed-off-by: Bo Deng
Co-authored-by: Bo Deng
2025-12-12 20:21:10 +08:00
yuanjingx87
eeb03f314a
[None][infra] Replace the deprecated github token ( #9915 )
...
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-12-11 22:46:14 -08:00
Chuang Zhu
bd441e9822
[None][infra] revert ucx to 1.19 ( #9936 )
...
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-12-12 11:37:19 +08:00
Yiteng Niu
3e39afea9a
[None][infra] update nspect version for api change ( #9899 )
...
Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com>
2025-12-12 11:27:42 +08:00
Yiqing Yan
5065b60cd1
[None][infra] Fix mergeWaiveList stage ( #9892 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-12-12 11:19:42 +08:00
Chuang Zhu
4670e0c297
[None][infra] update ucx to 1.20 ( #9786 )
...
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-12-12 09:49:46 +08:00
Matt Lefebvre
df1adfbb50
[TRTINFRA-7328][infra] - Move half B200 tests to lbd ( #9853 )
...
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
2025-12-10 04:24:30 -08:00
Matt Lefebvre
8fefa2c9d1
[None][infra] Fail fast if SLURM entrypoint fails ( #9744 )
...
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
2025-12-10 02:31:29 -08:00
Guoming Zhang
12693a526b
[None][chore] Enable L0 multi-gpus testing for Qwen3-next ( #9789 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-12-10 17:11:32 +08:00
Zhanrui Sun
49fe089470
[TRTLLM-9811][infra] Update urllib3 version >= 2.6.0 to fix high vulnerability issue ( #9823 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-12-10 00:18:11 -08:00
Matt Lefebvre
5de4e3f621
[TRTINFRA-7328][infra] Consume SlurmCluster scratchPath and cleanup mounts ( #9600 )
...
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
2025-12-09 13:34:09 -08:00
Yiqing Yan
2ddcb45b2a
[None][chore] Generate lock file for release/1.2.0rc4.post1 branch automatically ( #9829 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-12-09 16:34:17 +08:00
Shi Xiaowei
b050804b63
[TRTLLM-6537][infra] extend multi-gpu tests related file list ( #9614 )
...
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-12-09 12:54:53 +08:00
Jhao-Ting Chen
0a09465089
[ https://nvbugs/5567586 ][feat] Ampere xqa swa specdec for GPT-OSS Eagle3-one-model ( #8383 )
...
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
2025-12-08 11:16:05 -08:00
Zheng Duan
e7395c6607
[None][infra] update mooncake in docker images ( #9584 )
...
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
2025-12-08 16:56:40 +08:00
chenfeiz0326
383178c00a
[TRTLLM-9000][feat] Add multi-node Perf Tests into CI ( #8800 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-12-08 09:00:44 +08:00
Yanchao Lu
f59d64e6c7
[None][fix] Several minor fixes to CI setting ( #9765 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-12-07 23:07:59 +08:00
Yiqing Yan
e834f04238
[TRTLLM-9579][infra] Set mergeWaiveList stage UNSTABLE when there is any issue ( #9692 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-12-05 10:18:31 +08:00
Yiqing Yan
731b2eb4ef
[TRTLLM-5312][infra] Add triton trigger rules ( #6440 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-12-05 07:35:04 +08:00
zackyoray
398d24232d
[None][feat] Add NIXL-LIBFABRIC support ( #9225 )
...
Signed-off-by: Yoray Zack <62789610+zackyoray@users.noreply.github.com>
Signed-off-by: zackyoray <yorayz@nvidia.com>
2025-12-04 15:38:06 +08:00