yuanjingx87
|
0a4c59136a
|
[None][infra] Fixing credential loading in lockfile generation pipeline (#10020)
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
|
2025-12-16 15:38:29 +08:00 |
|
QI JUN
|
4ce35eacf1
|
[TRTLLM-9794][ci] move more test cases to gb200 (#9994)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
|
2025-12-15 19:50:41 -08:00 |
|
zackyoray
|
63e7a2fa70
|
[None][infra] Update ucx to 1.20.x (#9977)
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
Signed-off-by: Yoray Zack <62789610+zackyoray@users.noreply.github.com>
|
2025-12-16 00:31:48 +08:00 |
|
dominicshanshan
|
825025b137
|
[None][infra] Add multi gpu Ray tests into L0 merge change request list. (#9996)
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
|
2025-12-15 15:55:54 +08:00 |
|
Matt Lefebvre
|
1375910f1b
|
[None][infra] Delete container before attempting import (#9967)
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
|
2025-12-14 00:09:33 -08:00 |
|
Yuxian Qiu
|
fcda1a1442
|
[None][fix] disable async pp send for ray cases. (#9959)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
|
2025-12-13 20:22:36 -08:00 |
|
yuanjingx87
|
246a877571
|
[None][infra] Remove generate lockfile schedule for 1.2.0rc4.post1 branch (#9945)
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
|
2025-12-12 09:10:32 -08:00 |
|
zackyoray
|
d5b9ad91c9
|
[None][feat] Upgrade NIXL to v0.8.0 (#9707)
Signed-off-by: Yoray Zack <62789610+zackyoray@users.noreply.github.com>
Signed-off-by: zackyoray
Signed-off-by: Bo Deng
Co-authored-by: Bo Deng
|
2025-12-12 20:21:10 +08:00 |
|
yuanjingx87
|
eeb03f314a
|
[None][infra] Replace the deprecated github token (#9915)
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
|
2025-12-11 22:46:14 -08:00 |
|
Chuang Zhu
|
bd441e9822
|
[None][infra] revert ucx to 1.19 (#9936)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
|
2025-12-12 11:37:19 +08:00 |
|
Yiteng Niu
|
3e39afea9a
|
[None][infra] update nspect version for api change (#9899)
Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com>
|
2025-12-12 11:27:42 +08:00 |
|
Yiqing Yan
|
5065b60cd1
|
[None][infra] Fix mergeWaiveList stage (#9892)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
|
2025-12-12 11:19:42 +08:00 |
|
Chuang Zhu
|
4670e0c297
|
[None][infra] update ucx to 1.20 (#9786)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
|
2025-12-12 09:49:46 +08:00 |
|
Matt Lefebvre
|
df1adfbb50
|
[TRTINFRA-7328][infra] - Move half B200 tests to lbd (#9853)
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
|
2025-12-10 04:24:30 -08:00 |
|
Matt Lefebvre
|
8fefa2c9d1
|
[None][infra] Fail fast if SLURM entrypoint fails (#9744)
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
|
2025-12-10 02:31:29 -08:00 |
|
Guoming Zhang
|
12693a526b
|
[None][chore] Enable L0 multi-gpus testing for Qwen3-next (#9789)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
|
2025-12-10 17:11:32 +08:00 |
|
Zhanrui Sun
|
49fe089470
|
[TRTLLM-9811][infra] Update urllib3 version >= 2.6.0 to fix high vulnerability issue (#9823)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
|
2025-12-10 00:18:11 -08:00 |
|
Matt Lefebvre
|
5de4e3f621
|
[TRTINFRA-7328][infra] Consume SlurmCluster scratchPath and cleanup mounts (#9600)
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
|
2025-12-09 13:34:09 -08:00 |
|
Yiqing Yan
|
2ddcb45b2a
|
[None][chore] Generate lock file for release/1.2.0rc4.post1 branch automatically (#9829)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
|
2025-12-09 16:34:17 +08:00 |
|
Shi Xiaowei
|
b050804b63
|
[TRTLLM-6537][infra] extend multi-gpu tests related file list (#9614)
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
|
2025-12-09 12:54:53 +08:00 |
|
Jhao-Ting Chen
|
0a09465089
|
[https://nvbugs/5567586][feat] Ampere xqa swa specdec for GPT-OSS Eagle3-one-model (#8383)
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
|
2025-12-08 11:16:05 -08:00 |
|
Zheng Duan
|
e7395c6607
|
[None][infra] update mooncake in docker images (#9584)
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
|
2025-12-08 16:56:40 +08:00 |
|
chenfeiz0326
|
383178c00a
|
[TRTLLM-9000][feat] Add multi-node Perf Tests into CI (#8800)
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
|
2025-12-08 09:00:44 +08:00 |
|
Yanchao Lu
|
f59d64e6c7
|
[None][fix] Several minor fixes to CI setting (#9765)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
|
2025-12-07 23:07:59 +08:00 |
|
Yiqing Yan
|
e834f04238
|
[TRTLLM-9579][infra] Set mergeWaiveList stage UNSTABLE when there is any issue (#9692)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
|
2025-12-05 10:18:31 +08:00 |
|
Yiqing Yan
|
731b2eb4ef
|
[TRTLLM-5312][infra] Add triton trigger rules (#6440)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
|
2025-12-05 07:35:04 +08:00 |
|
zackyoray
|
398d24232d
|
[None][feat] Add NIXL-LIBFABRIC support (#9225)
Signed-off-by: Yoray Zack <62789610+zackyoray@users.noreply.github.com>
Signed-off-by: zackyoray <yorayz@nvidia.com>
|
2025-12-04 15:38:06 +08:00 |
|
Yiqing Yan
|
47f650ca13
|
[TRTLLM-5093][infra] Write env variables to a file in the interactive debug session (#6792)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
|
2025-12-04 11:41:27 +08:00 |
|
Yiqing Yan
|
e31142202e
|
[TRTLLM-7181][infra] Generate test results when pytest timeout happens (#9396)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
|
2025-12-04 10:05:38 +08:00 |
|
Yiqing Yan
|
8c88454fa5
|
[TRTLLM-7101][infra] Reuse passed tests (#6894)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
|
2025-12-03 10:07:23 +08:00 |
|
Chang Liu
|
73a543d78f
|
[None][fix] Extract GPU count from single-node stage names (#9599)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
|
2025-12-02 20:58:16 +08:00 |
|
Eran Geva
|
1a46bb0d18
|
Lock the gpu clocks in L0 perf tests (#9585)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
|
2025-12-02 18:13:45 +08:00 |
|
Emma Qiao
|
b024040df0
|
[None][infra] Update the pytest options after MI (#9579)
Signed-off-by: qqiao <qqiao@nvidia.com>
|
2025-12-02 00:11:30 +08:00 |
|
Yiqing Yan
|
c72919980a
|
[TRTLLM-6768][infra] Fix params for not updating github status (#6747)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
|
2025-12-01 23:51:21 +08:00 |
|
Yanchao Lu
|
078d3a576e
|
[None][ci] Minor change for Slurm scripts (#9561)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
|
2025-12-01 22:52:08 +08:00 |
|
Martin Marciniszyn Mehringer
|
974ad56515
|
[None][chore] reduce the layers of the devel docker image (#9077)
Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>
|
2025-12-01 03:56:30 -08:00 |
|
Enwei Zhu
|
34e2fa5c96
|
[https://nvbugs/5690172][fix] Fix Qwen3-235B ATP accuracy issue with PDL (#9530)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
|
2025-12-01 09:10:21 +08:00 |
|
Yanchao Lu
|
694b60d92d
|
[None][ci] Split H100_PCIe-PyTorch-Post-Merge test stage (#9559)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
|
2025-11-30 21:14:18 +08:00 |
|
Yanchao Lu
|
0398875d55
|
[None][ci] Split H100_PCIe-PyTorch-Post-Merge test stage (#9558)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
|
2025-11-30 20:27:13 +08:00 |
|
Yanchao Lu
|
f03641808b
|
[None][infra] - Request idle time exemption for OCI jobs (#9528)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
|
2025-11-30 13:34:09 +08:00 |
|
Zhanrui Sun
|
930cdad054
|
[TRTLLM-9541][infra] Use artifactory mirror for download.pytorch.org (#9477)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
|
2025-11-28 18:31:50 +08:00 |
|
Emma Qiao
|
658d9fc0c5
|
[TRTLLM-8970][infra] Fix generate report when has isolation test result (#8861)
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
|
2025-11-28 11:26:06 +08:00 |
|
Yiqing Yan
|
1c9158fde3
|
[TRTLLM-7288][infra] Download merged waive list in slurm script (#8999)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
|
2025-11-27 21:48:40 +08:00 |
|
yuanjingx87
|
3ada0bfc65
|
[None][infra] Fix Slurm job script (#9508)
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
|
2025-11-27 16:41:01 +08:00 |
|
Emma Qiao
|
a21be43677
|
[TRTLLM-9279][infra] Use flexcache for gh200 nodes since they locate in Austin (#9405)
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
|
2025-11-27 15:42:38 +08:00 |
|
Jiagan Cheng
|
14762e0287
|
[None][fix] Replace PYTORCH_CUDA_ALLOC_CONF with PYTORCH_ALLOC_CONF to fix deprecation warning (#9294)
Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>
|
2025-11-27 12:22:01 +08:00 |
|
yuanjingx87
|
356f67c1cb
|
[None][infra] Fail the pipeline when slurm ssh dropped (#9157)
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
|
2025-11-26 09:35:04 -08:00 |
|
Yanchao Lu
|
ff02e0f05c
|
[None][ci] Move more test stages to use OCI machines (#9395)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Matt Lefebvre <matthewelefebvre@gmail.com>
|
2025-11-25 15:59:13 +08:00 |
|
Matt Lefebvre
|
fefa02fa95
|
[TRTINFRA-7326][infra] - Consume SlurmCluster sshPort for clusters with custom SSH port (#9313)
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
|
2025-11-21 18:58:00 -08:00 |
|
Yiqing Yan
|
2a27166b59
|
[TRTLLM-9183][infra] Add --waives-file in rerun pytest command (#8971)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
|
2025-11-21 13:40:45 +08:00 |
|