Commit Graph

167 Commits

Author SHA1 Message Date
Yiteng Niu
88d1bde4d3 [None][infra] update nspect version (#7552)
Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com>
2025-09-06 18:16:55 +08:00
Yanchao Lu
2cb5b9f31b [None][ci] Increase the number of retries in docker image generation (#7557)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-06 18:16:36 +08:00
Yuxian Qiu
559762f185
[https://nvbugs/5448754][fix] Download HF model for all nodes. (#6824)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-09-01 16:00:43 +08:00
HuiGao-NV
df80b1e128
[https://nvbugs/5473789][bug] install cuda-toolkit to fix sanity check (#7159)
Signed-off-by: Hui Gao <huig@nvidia.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-08-26 18:51:21 +08:00
Yanchao Lu
6fda8ddac9
[None][infra] Cherry-pick #6836 from main branch and improve SSH connection (#6971)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-08-19 01:11:11 +08:00
Yanchao Lu
c39454c617
[None][infra] Avoid intermittent access broken to nvcr.io (#6715)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-08-12 11:48:59 +08:00
Yiqing Yan
3e41e6c077
[TRTLLM-6892][infra] Run guardwords scan first in Release Check stage (#6659)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-06 23:00:15 -04:00
Yanchao Lu
b7347ce7d1
[https://nvbugs/5433581][fix] Revert deep_gemm installation workaround for SBSA (#6666)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-06 18:50:53 +08:00
Yiqing Yan
98424f3186
[TRTLLM-5633][infra] Change the TOT repo to default-llm-repo for merge waive list (#6605)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-06 06:19:03 -04:00
Zhanrui Sun
6a9b4b11be
[https://nvbugs/5433581][infra] Temporarily disable Docker Image use wheel from build stage (#6630)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-08-05 09:33:11 -04:00
Emma Qiao
78a75c2990
[None][Infra] - Split gb200 stages for each test (#6594)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-08-05 07:10:00 -04:00
Zhanrui Sun
7cbe30e17d
[TRTLLM-6893][infra] fix Build Docker Image tag issue (#6555)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-08-05 04:33:36 -04:00
Chuang Zhu
4d040b50b7
[None][chore] ucx establish connection with zmq (#6090)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-08-05 02:50:45 -04:00
Yanchao Lu
d53cc2374b
[https://nvbugs/5433581][infra] Update install docs and CI script for SBSA deep_gemm workaround (#6607)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-04 23:36:38 -04:00
Yiqing Yan
4763e94156
[TRTLLM-5563][infra] Move test_rerun.py to script folder (#6571)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-08-04 13:26:04 +08:00
Yiqing Yan
3f7abf87bc
[TRTLLM-6224][infra] Upgrade dependencies to DLFW 25.06 and CUDA 12.9.1 (#5678)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-08-03 11:18:59 +08:00
Yiqing Yan
d38c26bb78
[Infra][TRTLLM-5633] - Fix merge waive list (#6504)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-07-31 14:57:51 +08:00
Yiqing Yan
0cf2f6f154
[TRTLLM-5633] - Merge current waive list with the TOT waive list (#5198)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-07-30 17:50:05 +08:00
Zhanrui Sun
c3729dbd7d
infra: [TRTLLM-5873] Use build stage wheels to speed up docker release image build (#4939)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-07-29 12:54:38 -04:00
Zhanrui Sun
64ba483656
infra: [TRTLLM-6499] Split L0_Test into two pipeline by single GPU and multi GPU(For SBSA) (#6132)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-07-28 22:54:37 -04:00
yuanjingx87
608ed89f96
[None][infra]Update slurm config keys (#6370)
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-07-28 11:56:37 -07:00
Yiqing Yan
d97419805b
[TRTLLM-5312] - Add bot run rules for triton tests (#4988)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-07-25 10:31:12 +08:00
yuanjingx87
ef4878db05
set NVIDIA_IMEX_CHANNELS for dlcluster slurm job only (#6234)
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-07-22 11:27:54 -07:00
Lizhi Zhou
3e1a0fbac4
[TRTLLM-6537][infra] extend multi-gpu tests related file list (#6139)
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2025-07-22 16:57:06 +08:00
Yi Zhang
f9b0a911fb
test: Enable GB200 torch compile multi gpu tests (#6145)
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-07-21 22:17:13 +08:00
Zhanrui Sun
3cbc23f783
infra: [TRTLLM-5250] Add sanity check stage for ngc-release images (Build wheels for devel image) (#4656)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-07-21 16:06:43 +08:00
Linda
3efad2e58c
feat: nanobind bindings (#6185)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-21 08:56:57 +01:00
Venky
22d4a8c48a
enh: Add script to map tests <-> jenkins stages & vice-versa (#5177)
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-07-19 00:50:40 +08:00
Zhanrui Sun
8454640ee1
infra: fix single-GPU stage failed will not raise error (#6165)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-07-18 22:39:32 +08:00
Iman Tabrizian
b75e53ab69
Revert "feat: nanobind bindings (#5961)" (#6160)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-18 10:12:54 +08:00
ixlmar
d71c6fe526
[fix] Update jenkins container images (#6094)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-07-17 16:22:25 +01:00
Linda
5bff317abf
feat: nanobind bindings (#5961)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-17 22:42:52 +08:00
Emma Qiao
1cc49494fe
[Infra] - Add wiave list for pytest when using slurm (#6130)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-07-17 16:53:15 +08:00
QI JUN
e821c68611
CI: update multi gpu test trigger file list (#6131)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-07-17 14:48:23 +08:00
Zhanrui Sun
4c364b9a73
infra: fix SBSA test stage (#6113)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-07-17 11:56:03 +08:00
Zhanrui Sun
e42f5a9581
infra: [TRTLLM-5879] Spilt single GPU test and multi GPU test into 2 pipelines (#5199)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-07-16 18:04:04 +08:00
Bo Deng
ec3ebae43e
[TRTLLM-6471] Infra: Upgrade NIXL to 0.3.1 (#5991)
Signed-off-by: Rabia Loulou <174243936+rabial-nv@users.noreply.github.com>
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: Bo Deng <deemod@nvidia.com>
Co-authored-by: Rabia Loulou <174243936+rabial-nv@users.noreply.github.com>
Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-07-16 13:54:42 +08:00
Iman Tabrizian
665b4469b3
[fix] Fix Triton build (#6076)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-16 11:17:22 +08:00
Yiteng Niu
9e871ca582
[infra] add more log on reuse-uploading (#6036)
Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-07-15 17:18:38 +08:00
Zhanrui Sun
d811843a08
infra: [TRTLLM-6313] Fix the package sanity stage 'Host Node Name' in… (#5945)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-07-15 15:39:31 +09:00
Yiqing Yan
6b35afaf1b
[Infra][TRTLLM-6013] - Fix stage name in single stage test rerun report (#5672)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-07-15 12:27:21 +09:00
Zhanrui Sun
01b2def5ef
infra: [TRTLLM-6331] Support show all stage name list when stage name check failed (#5946)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-07-15 12:06:03 +09:00
Alex Zhang
6c30d78b78
[TRTLLM-5653][infra] Run docs build only if PR contains only doc changes (#5184)
Signed-off-by: Alex Zhang <13271672+zhanga5@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Alex Zhang <13271672+zhanga5@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-07-14 21:40:33 +08:00
Zhanrui Sun
3a0ef73414
infra: [TRTLLM-6242] install cuda-toolkit to fix sanity check (#5709)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-07-14 18:52:13 +09:00
Yi Zhang
e5e87ecf34 test: Move some of the test from post merge to pre-merge, update dgx b200 test case (#5640)
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
Zhanrui Sun
67a39dbd63
infra: [TRTLLM-6054][TRTLLM-5804] Fix two known NSPECT high vulnerability issues and reduce image size (#5434)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-07-10 23:24:46 +09:00
ixlmar
10e686466e
fix: use current_image_tags.properties in rename_docker_images.py (#5846)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-07-09 17:07:52 +09:00
xavier-nvidia
b6013da198
Fix GEMM+AR fusion on blackwell (#5563)
Signed-off-by: xsimmons <xsimmons@nvidia.com>
2025-07-09 08:48:47 +08:00
Yiteng Niu
3079e8cf0c
[TRTLLM-5878] update nspect version (#5832)
Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com>
2025-07-08 22:00:09 +08:00
Tailing Yuan
035155df7c
Fix: ignore nvshmem_src_*.txz from confidentiality-scan (#5831)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2025-07-08 17:17:29 +09:00