Commit Graph

229 Commits

Author SHA1 Message Date
Xiwen Yu
2e61526d12 fix
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-10 10:34:18 +08:00
Xiwen Yu
5f508b7d43 Merge remote-tracking branch 'origin/main' into feat/b300_cu13
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-10 07:46:25 +08:00
QI JUN
a0e1604898
[None][ci] add DGX_H100-2_GPUs-PyTorch-Others-1 pipeline (#7629)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-09 11:06:32 -04:00
Xiwen Yu
a8b630f178 Merge remote-tracking branch 'origin/main' into feat/b300_cu13
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-09 14:34:27 +08:00
Zhanrui Sun
7a62df5f0b
[TRTLLM-4366][infra] Don't call reinstall_rockylinux_cuda when the base CUDA image is up to dated (#5980)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-09 02:15:39 -04:00
Tomer Shmilovich
ecc0e687c6
[None][feat] Nixl support for GDS (#5488)
Signed-off-by: Tomer Shmilovich <tshmilovich@nvidia.com>
Signed-off-by: Guy Lev <glev@nvidia.com>
Co-authored-by: Guy Lev <glev@nvidia.com>
2025-09-09 13:00:38 +08:00
Zhanrui Sun
b573e07f3e
[None][infra] Disable CU12 build to save build time (cost > 5 hours on SBSA) (#7633)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-09-09 11:38:34 +08:00
Yiqing Yan
5c616da2fd
[TRTLLM-5877][infra] Add fmha tests and auto trigger rules (#6050)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-09 11:33:09 +08:00
yuanjingx87
1d243a8503
[None][infra] Try to fix docker container failed to be killed issue (#7388)
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-09-08 11:28:01 -07:00
Xiwen Yu
4cf9fed1e7 Merge commit 'ed27a72bcf71f7ab0e7137f7999988c9de82386f' into feat/b300_cu13
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-08 21:58:43 +08:00
Emma Qiao
dd9627d9f9
[None][infra] Add back rtx-pro-6000 stages since the node is available (#7601)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-08 05:45:11 -04:00
Yanchao Lu
ed27a72bcf [None][ci] Fix a typo in the Slurm command
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-08 17:07:09 +08:00
Xiwen Yu
fdaf4e2985 Merge remote-tracking branch 'origin/main' into feat/b300_cu13
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-08 15:14:54 +08:00
Xiwen Yu
d4d9e778a1 reset build memory
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-08 12:04:30 +08:00
Xiwen Yu
caea58aba4 increase build memory
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-08 11:28:39 +08:00
Xiwen Yu
d42201e235 remove waivers and cleanup
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-08 10:24:52 +08:00
Xiwen Yu
77657de972 fix build args
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-08 09:52:41 +08:00
BatshevaBlack
7c76dde76d
[TRTLLM-7187][fix] Build wheel with NIXL (#7472)
Signed-off-by: BatshevaBlack <132911331+BatshevaBlack@users.noreply.github.com>
2025-09-07 19:05:37 -04:00
Yanchao Lu
045d2cf761
[None][ci] Block some nodes to avoid unstable network access (#7593)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-08 00:25:38 +08:00
Emma Qiao
5c4711fb2b
[None][infra] Skip RTX Pro 6000 test stages due to HW are offline (#7592)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-07 09:49:06 -04:00
Emma Qiao
aea8ac1649
[TRTLLM-5950][infra] Removing remaining turtle keywords from the code base (#7086)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-07 14:26:18 +08:00
Xiwen Yu
322db710dc Merge remote-tracking branch 'origin/main' into feat/b300_cu13
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-06 23:58:04 +08:00
Xiwen Yu
d12eb4b2cc fix CI build archs
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-06 18:41:28 +08:00
Yanchao Lu
caf9b9cd42
[None][ci] Improve SSH connection stability (#7567)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-06 17:08:19 +08:00
Xiwen Yu
2c3f4cbeee Merge remote-tracking branch 'origin/main' into feat/b300_cu13
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-05 15:53:43 +08:00
Zhanrui Sun
5ca3376d6f Support DLFW sanity check use CU13 image
Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com>
2025-09-05 00:04:22 -07:00
Yiteng Niu
163b1fc84f
[None][infra] update nspect version (#7552)
Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com>
2025-09-05 14:59:22 +08:00
Yanchao Lu
4195010e13
[None][ci] Increase the number of retries in docker image generation (#7557)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-05 14:47:14 +08:00
Zhanrui Sun
0de3f83805
[TRTLLM-6893][infra] Disable the x86 / SBSA build stage when run BuildDockerImage (#6729)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-04 07:20:15 -04:00
Yanchao Lu
c622f61609
[None][fix] Fix a typo in the Slurm CI codes (#7485)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-04 01:56:27 -04:00
Emma Qiao
931816fee1
[TRTLLM-6199][infra] Update for using open driver from BSL (#7430)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-04 11:47:40 +08:00
Yanchao Lu
a07bb163f7
[None][ci] Correct docker args for GPU devices and remove some stale CI codes (#7417)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-02 04:06:51 -04:00
Yiqing Yan
ff2439ff48
[None][infra] Using local variables in rerun function (#7198)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-09-02 13:55:26 +08:00
Xiwen Yu
62a78973a8 Merge remote-tracking branch 'origin/main' into user/xiweny/merge_0901
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-02 10:12:30 +08:00
yuanjingx87
2b286ae613
[None][infra] Disable GB200-PyTorch-1 due to OOM issue (#7386)
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-09-01 01:56:31 -04:00
Xiwen Yu
38ef850552 Merge remote-tracking branch 'gitlab/main' into user/xiweny/merge_0901
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-01 11:46:44 +08:00
Yanchao Lu
c5148f52d5
[None][ci] Some improvements for Slurm CI setup (#7407)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-01 10:57:36 +08:00
Pengbo Wang @ NVIDIA
62459d533d
[None][chore] Update pre-merge test to add DeepSeek/LLaMA and gpt-oss (#7192)
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
Signed-off-by: Pengbo Wang @ NVIDIA <221450789+pengbowang-nv@users.noreply.github.com>
Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>
2025-08-29 17:03:46 +08:00
Yiqing Yan
3c06303542 [TRTLLM-7755][infra] Add DGX_B300 and GB300 tests in CI
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-08-28 22:45:00 -07:00
Yanchao Lu
460a34c671
[None][chore] Some improvements for CI stability (#7199)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-28 16:19:20 -04:00
Zhanrui Sun
ee37589c8c infra: update DLFW 25.08 GA, triton 25.08 GA
Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com>
2025-08-27 20:17:56 -07:00
Martin Marciniszyn Mehringer
7cfa475e05
[None][fix] Remove the wheel from intermediate docker storage (#7175)
Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>
2025-08-27 11:32:17 -04:00
QI JUN
baef70e67e
[None][ci] move qwen3 tests from b200 to gb200 (#7257)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-08-26 11:50:53 -04:00
Emma Qiao
a142c0c4de
[None][infra] Add retry 3 times if ssh cluster failed (#6859)
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-26 05:11:50 -04:00
Xiwen Yu
ab7febd4d8 Merge commit '31979aefacbf80d2742c98ef30385db162788c84' into feat/b300_cu13
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-26 10:31:35 +08:00
Yiqing Yan
486bc763c3
[None][infra] Split DGX_B200 stage into multiple parts and pre-/post-merge (#7074)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-24 21:09:04 -04:00
Robin Kobus
31979aefac
[None] [ci] Reorganize CMake and Python integration test infrastructure for C++ tests (#6754)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-08-24 20:53:17 +02:00
Yanchao Lu
ec35481b0a
[None][infra] Prepare for single GPU GB200 test pipeline (#7073)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-24 21:46:39 +08:00
Xiwen Yu
808059da34 Merge remote-tracking branch 'gitlab/main' into user/xiweny/merge_main_0819
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-23 16:13:30 +08:00
Xiwen Yu
b7cc06cd6a disable merge waive list stage
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-23 15:17:57 +08:00