Commit Graph

219 Commits

Author SHA1 Message Date
Xiwen Yu
4cf9fed1e7 Merge commit 'ed27a72bcf71f7ab0e7137f7999988c9de82386f' into feat/b300_cu13
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-08 21:58:43 +08:00
Yanchao Lu
ed27a72bcf [None][ci] Fix a typo in the Slurm command
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-08 17:07:09 +08:00
Xiwen Yu
fdaf4e2985 Merge remote-tracking branch 'origin/main' into feat/b300_cu13
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-08 15:14:54 +08:00
Xiwen Yu
d4d9e778a1 reset build memory
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-08 12:04:30 +08:00
Xiwen Yu
caea58aba4 increase build memory
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-08 11:28:39 +08:00
Xiwen Yu
d42201e235 remove waivers and cleanup
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-08 10:24:52 +08:00
Xiwen Yu
77657de972 fix build args
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-08 09:52:41 +08:00
BatshevaBlack
7c76dde76d
[TRTLLM-7187][fix] Build wheel with NIXL (#7472)
Signed-off-by: BatshevaBlack <132911331+BatshevaBlack@users.noreply.github.com>
2025-09-07 19:05:37 -04:00
Yanchao Lu
045d2cf761
[None][ci] Block some nodes to avoid unstable network access (#7593)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-08 00:25:38 +08:00
Emma Qiao
5c4711fb2b
[None][infra] Skip RTX Pro 6000 test stages due to HW are offline (#7592)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-07 09:49:06 -04:00
Emma Qiao
aea8ac1649
[TRTLLM-5950][infra] Removing remaining turtle keywords from the code base (#7086)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-07 14:26:18 +08:00
Xiwen Yu
322db710dc Merge remote-tracking branch 'origin/main' into feat/b300_cu13
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-06 23:58:04 +08:00
Xiwen Yu
d12eb4b2cc fix CI build archs
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-06 18:41:28 +08:00
Yanchao Lu
caf9b9cd42
[None][ci] Improve SSH connection stability (#7567)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-06 17:08:19 +08:00
Xiwen Yu
2c3f4cbeee Merge remote-tracking branch 'origin/main' into feat/b300_cu13
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-05 15:53:43 +08:00
Zhanrui Sun
5ca3376d6f Support DLFW sanity check use CU13 image
Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com>
2025-09-05 00:04:22 -07:00
Yiteng Niu
163b1fc84f
[None][infra] update nspect version (#7552)
Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com>
2025-09-05 14:59:22 +08:00
Yanchao Lu
4195010e13
[None][ci] Increase the number of retries in docker image generation (#7557)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-05 14:47:14 +08:00
Zhanrui Sun
0de3f83805
[TRTLLM-6893][infra] Disable the x86 / SBSA build stage when run BuildDockerImage (#6729)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-04 07:20:15 -04:00
Yanchao Lu
c622f61609
[None][fix] Fix a typo in the Slurm CI codes (#7485)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-04 01:56:27 -04:00
Emma Qiao
931816fee1
[TRTLLM-6199][infra] Update for using open driver from BSL (#7430)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-04 11:47:40 +08:00
Yanchao Lu
a07bb163f7
[None][ci] Correct docker args for GPU devices and remove some stale CI codes (#7417)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-02 04:06:51 -04:00
Yiqing Yan
ff2439ff48
[None][infra] Using local variables in rerun function (#7198)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-09-02 13:55:26 +08:00
Xiwen Yu
62a78973a8 Merge remote-tracking branch 'origin/main' into user/xiweny/merge_0901
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-02 10:12:30 +08:00
yuanjingx87
2b286ae613
[None][infra] Disable GB200-PyTorch-1 due to OOM issue (#7386)
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-09-01 01:56:31 -04:00
Xiwen Yu
38ef850552 Merge remote-tracking branch 'gitlab/main' into user/xiweny/merge_0901
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-01 11:46:44 +08:00
Yanchao Lu
c5148f52d5
[None][ci] Some improvements for Slurm CI setup (#7407)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-01 10:57:36 +08:00
Pengbo Wang @ NVIDIA
62459d533d
[None][chore] Update pre-merge test to add DeepSeek/LLaMA and gpt-oss (#7192)
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
Signed-off-by: Pengbo Wang @ NVIDIA <221450789+pengbowang-nv@users.noreply.github.com>
Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>
2025-08-29 17:03:46 +08:00
Yiqing Yan
3c06303542 [TRTLLM-7755][infra] Add DGX_B300 and GB300 tests in CI
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-08-28 22:45:00 -07:00
Yanchao Lu
460a34c671
[None][chore] Some improvements for CI stability (#7199)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-28 16:19:20 -04:00
Zhanrui Sun
ee37589c8c infra: update DLFW 25.08 GA, triton 25.08 GA
Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com>
2025-08-27 20:17:56 -07:00
Martin Marciniszyn Mehringer
7cfa475e05
[None][fix] Remove the wheel from intermediate docker storage (#7175)
Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>
2025-08-27 11:32:17 -04:00
QI JUN
baef70e67e
[None][ci] move qwen3 tests from b200 to gb200 (#7257)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-08-26 11:50:53 -04:00
Emma Qiao
a142c0c4de
[None][infra] Add retry 3 times if ssh cluster failed (#6859)
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-26 05:11:50 -04:00
Xiwen Yu
ab7febd4d8 Merge commit '31979aefacbf80d2742c98ef30385db162788c84' into feat/b300_cu13
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-26 10:31:35 +08:00
Yiqing Yan
486bc763c3
[None][infra] Split DGX_B200 stage into multiple parts and pre-/post-merge (#7074)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-24 21:09:04 -04:00
Robin Kobus
31979aefac
[None] [ci] Reorganize CMake and Python integration test infrastructure for C++ tests (#6754)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-08-24 20:53:17 +02:00
Yanchao Lu
ec35481b0a
[None][infra] Prepare for single GPU GB200 test pipeline (#7073)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-24 21:46:39 +08:00
Xiwen Yu
808059da34 Merge remote-tracking branch 'gitlab/main' into user/xiweny/merge_main_0819
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-23 16:13:30 +08:00
Xiwen Yu
b7cc06cd6a disable merge waive list stage
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-23 15:17:57 +08:00
Xiwen Yu
f4de8840ec Merge remote-tracking branch 'gitlab/main' into user/xiweny/merge_main_0819
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-23 15:17:48 +08:00
QI JUN
1388e84793
[None][ci] move all B200 TensorRT test cases to post merge (#7165)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-08-22 06:47:23 -04:00
Linda
898f37faa0
[None][feat] Enable nanobind as the default binding library (#6608)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-08-22 09:48:41 +02:00
Emma Qiao
a49cf684f8
[TRTLLM-5801][infra] Add more RTX Pro 6000 test stages (#5126)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-08-22 03:12:02 -04:00
Yuan Tong
90bfc8cc29
[https://nvbugs/5453827][fix] Fix RPATH of th_common shared library to find pip-installed NCCL (#6984)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-08-21 17:58:30 +08:00
BatshevaBlack
9f51f8d20c
[None][infra] Upgrade UCX to v1.19.x and NIXL to 0.5.0 (#7024)
Signed-off-by: Batsheva Black <132911331+BatshevaBlack@users.noreply.github.com>
Signed-off-by: Bo Deng <deemod@nvidia.com>
Co-authored-by: Bo Deng <deemod@nvidia.com>
2025-08-20 22:49:55 -04:00
QI JUN
a918de710a
[None][ci] move some tests of b200 to post merge (#7093)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-08-20 19:43:40 -04:00
Xiwen Yu
8b532363ce Merge remote-tracking branch 'gitlab/main' into user/xiweny/merge_main_0819
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-19 17:02:34 +08:00
Fanrong Li
816a120af6
[TRTLLM-6991][chore] add DeepSeek-R1 FP8 accuracy tests on Blackwell (#6710)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-08-19 00:03:03 -04:00
Yanchao Lu
d1d17dbeba
[None][infra] Cherry-pick #6836 from main branch and improve SSH connection (#6971) (#7005)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-08-19 01:35:30 +08:00