Yanchao Lu
2cb5b9f31b
[None][ci] Increase the number of retries in docker image generation ( #7557 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-06 18:16:36 +08:00
Yuxian Qiu
559762f185
[ https://nvbugs/5448754 ][fix] Download HF model for all nodes. ( #6824 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-09-01 16:00:43 +08:00
HuiGao-NV
df80b1e128
[ https://nvbugs/5473789 ][bug] install cuda-toolkit to fix sanity check ( #7159 )
...
Signed-off-by: Hui Gao <huig@nvidia.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-08-26 18:51:21 +08:00
Yanchao Lu
6fda8ddac9
[None][infra] Cherry-pick #6836 from main branch and improve SSH connection ( #6971 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-08-19 01:11:11 +08:00
Yanchao Lu
c39454c617
[None][infra] Avoid intermittent access broken to nvcr.io ( #6715 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-08-12 11:48:59 +08:00
Yiqing Yan
3e41e6c077
[TRTLLM-6892][infra] Run guardwords scan first in Release Check stage ( #6659 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-06 23:00:15 -04:00
Yanchao Lu
b7347ce7d1
[ https://nvbugs/5433581 ][fix] Revert deep_gemm installation workaround for SBSA ( #6666 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-06 18:50:53 +08:00
Yiqing Yan
98424f3186
[TRTLLM-5633][infra] Change the TOT repo to default-llm-repo for merge waive list ( #6605 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-06 06:19:03 -04:00
Zhanrui Sun
6a9b4b11be
[ https://nvbugs/5433581 ][infra] Temporarily disable Docker Image use wheel from build stage ( #6630 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-08-05 09:33:11 -04:00
Emma Qiao
78a75c2990
[None][Infra] - Split gb200 stages for each test ( #6594 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-08-05 07:10:00 -04:00
Zhanrui Sun
7cbe30e17d
[TRTLLM-6893][infra] fix Build Docker Image tag issue ( #6555 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-08-05 04:33:36 -04:00
Chuang Zhu
4d040b50b7
[None][chore] ucx establish connection with zmq ( #6090 )
...
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-08-05 02:50:45 -04:00
Yanchao Lu
d53cc2374b
[ https://nvbugs/5433581 ][infra] Update install docs and CI script for SBSA deep_gemm workaround ( #6607 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-04 23:36:38 -04:00
Yiqing Yan
4763e94156
[TRTLLM-5563][infra] Move test_rerun.py to script folder ( #6571 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-08-04 13:26:04 +08:00
Yiqing Yan
3f7abf87bc
[TRTLLM-6224][infra] Upgrade dependencies to DLFW 25.06 and CUDA 12.9.1 ( #5678 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-08-03 11:18:59 +08:00
Yiqing Yan
d38c26bb78
[Infra][TRTLLM-5633] - Fix merge waive list ( #6504 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-07-31 14:57:51 +08:00
Yiqing Yan
0cf2f6f154
[TRTLLM-5633] - Merge current waive list with the TOT waive list ( #5198 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-07-30 17:50:05 +08:00
Zhanrui Sun
c3729dbd7d
infra: [TRTLLM-5873] Use build stage wheels to speed up docker release image build ( #4939 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-07-29 12:54:38 -04:00
Zhanrui Sun
64ba483656
infra: [TRTLLM-6499] Split L0_Test into two pipeline by single GPU and multi GPU(For SBSA) ( #6132 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-07-28 22:54:37 -04:00
yuanjingx87
608ed89f96
[None][infra]Update slurm config keys ( #6370 )
...
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-07-28 11:56:37 -07:00
Yiqing Yan
d97419805b
[TRTLLM-5312] - Add bot run rules for triton tests ( #4988 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-07-25 10:31:12 +08:00
yuanjingx87
ef4878db05
set NVIDIA_IMEX_CHANNELS for dlcluster slurm job only ( #6234 )
...
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-07-22 11:27:54 -07:00
Lizhi Zhou
3e1a0fbac4
[TRTLLM-6537][infra] extend multi-gpu tests related file list ( #6139 )
...
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2025-07-22 16:57:06 +08:00
Yi Zhang
f9b0a911fb
test: Enable GB200 torch compile multi gpu tests ( #6145 )
...
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-07-21 22:17:13 +08:00
Zhanrui Sun
3cbc23f783
infra: [TRTLLM-5250] Add sanity check stage for ngc-release images (Build wheels for devel image) ( #4656 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-07-21 16:06:43 +08:00
Linda
3efad2e58c
feat: nanobind bindings ( #6185 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-21 08:56:57 +01:00
Venky
22d4a8c48a
enh: Add script to map tests <-> jenkins stages & vice-versa ( #5177 )
...
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-07-19 00:50:40 +08:00
Zhanrui Sun
8454640ee1
infra: fix single-GPU stage failed will not raise error ( #6165 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-07-18 22:39:32 +08:00
Iman Tabrizian
b75e53ab69
Revert "feat: nanobind bindings ( #5961 )" ( #6160 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-18 10:12:54 +08:00
ixlmar
d71c6fe526
[fix] Update jenkins container images ( #6094 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-07-17 16:22:25 +01:00
Linda
5bff317abf
feat: nanobind bindings ( #5961 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-17 22:42:52 +08:00
Emma Qiao
1cc49494fe
[Infra] - Add wiave list for pytest when using slurm ( #6130 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-07-17 16:53:15 +08:00
QI JUN
e821c68611
CI: update multi gpu test trigger file list ( #6131 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-07-17 14:48:23 +08:00
Zhanrui Sun
4c364b9a73
infra: fix SBSA test stage ( #6113 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-07-17 11:56:03 +08:00
Zhanrui Sun
e42f5a9581
infra: [TRTLLM-5879] Spilt single GPU test and multi GPU test into 2 pipelines ( #5199 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-07-16 18:04:04 +08:00
Bo Deng
ec3ebae43e
[TRTLLM-6471] Infra: Upgrade NIXL to 0.3.1 ( #5991 )
...
Signed-off-by: Rabia Loulou <174243936+rabial-nv@users.noreply.github.com>
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: Bo Deng <deemod@nvidia.com>
Co-authored-by: Rabia Loulou <174243936+rabial-nv@users.noreply.github.com>
Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-07-16 13:54:42 +08:00
Iman Tabrizian
665b4469b3
[fix] Fix Triton build ( #6076 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-16 11:17:22 +08:00
Yiteng Niu
9e871ca582
[infra] add more log on reuse-uploading ( #6036 )
...
Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-07-15 17:18:38 +08:00
Zhanrui Sun
d811843a08
infra: [TRTLLM-6313] Fix the package sanity stage 'Host Node Name' in… ( #5945 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-07-15 15:39:31 +09:00
Yiqing Yan
6b35afaf1b
[Infra][TRTLLM-6013] - Fix stage name in single stage test rerun report ( #5672 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-07-15 12:27:21 +09:00
Zhanrui Sun
01b2def5ef
infra: [TRTLLM-6331] Support show all stage name list when stage name check failed ( #5946 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-07-15 12:06:03 +09:00
Alex Zhang
6c30d78b78
[TRTLLM-5653][infra] Run docs build only if PR contains only doc changes ( #5184 )
...
Signed-off-by: Alex Zhang <13271672+zhanga5@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Alex Zhang <13271672+zhanga5@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-07-14 21:40:33 +08:00
Zhanrui Sun
3a0ef73414
infra: [TRTLLM-6242] install cuda-toolkit to fix sanity check ( #5709 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-07-14 18:52:13 +09:00
Yi Zhang
e5e87ecf34
test: Move some of the test from post merge to pre-merge, update dgx b200 test case ( #5640 )
...
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
Zhanrui Sun
67a39dbd63
infra: [TRTLLM-6054][TRTLLM-5804] Fix two known NSPECT high vulnerability issues and reduce image size ( #5434 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-07-10 23:24:46 +09:00
ixlmar
10e686466e
fix: use current_image_tags.properties in rename_docker_images.py ( #5846 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-07-09 17:07:52 +09:00
xavier-nvidia
b6013da198
Fix GEMM+AR fusion on blackwell ( #5563 )
...
Signed-off-by: xsimmons <xsimmons@nvidia.com>
2025-07-09 08:48:47 +08:00
Yiteng Niu
3079e8cf0c
[TRTLLM-5878] update nspect version ( #5832 )
...
Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com>
2025-07-08 22:00:09 +08:00
Tailing Yuan
035155df7c
Fix: ignore nvshmem_src_*.txz from confidentiality-scan ( #5831 )
...
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2025-07-08 17:17:29 +09:00
Tailing Yuan
85b4a6808d
Refactor: move DeepEP from Docker images to wheel building ( #5534 )
...
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2025-07-07 22:57:03 +09:00