Xiwen Yu
2c3f4cbeee
Merge remote-tracking branch 'origin/main' into feat/b300_cu13
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-05 15:53:43 +08:00
Zhanrui Sun
5ca3376d6f
Support DLFW sanity check use CU13 image
...
Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com>
2025-09-05 00:04:22 -07:00
Yiteng Niu
163b1fc84f
[None][infra] update nspect version ( #7552 )
...
Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com>
2025-09-05 14:59:22 +08:00
Yanchao Lu
4195010e13
[None][ci] Increase the number of retries in docker image generation ( #7557 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-05 14:47:14 +08:00
Zhanrui Sun
0de3f83805
[TRTLLM-6893][infra] Disable the x86 / SBSA build stage when run BuildDockerImage ( #6729 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-04 07:20:15 -04:00
Yanchao Lu
c622f61609
[None][fix] Fix a typo in the Slurm CI codes ( #7485 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-04 01:56:27 -04:00
Emma Qiao
931816fee1
[TRTLLM-6199][infra] Update for using open driver from BSL ( #7430 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-04 11:47:40 +08:00
Yanchao Lu
a07bb163f7
[None][ci] Correct docker args for GPU devices and remove some stale CI codes ( #7417 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-02 04:06:51 -04:00
Yiqing Yan
ff2439ff48
[None][infra] Using local variables in rerun function ( #7198 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-09-02 13:55:26 +08:00
Xiwen Yu
62a78973a8
Merge remote-tracking branch 'origin/main' into user/xiweny/merge_0901
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-02 10:12:30 +08:00
yuanjingx87
2b286ae613
[None][infra] Disable GB200-PyTorch-1 due to OOM issue ( #7386 )
...
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-09-01 01:56:31 -04:00
Xiwen Yu
38ef850552
Merge remote-tracking branch 'gitlab/main' into user/xiweny/merge_0901
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-01 11:46:44 +08:00
Yanchao Lu
c5148f52d5
[None][ci] Some improvements for Slurm CI setup ( #7407 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-01 10:57:36 +08:00
Pengbo Wang @ NVIDIA
62459d533d
[None][chore] Update pre-merge test to add DeepSeek/LLaMA and gpt-oss ( #7192 )
...
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
Signed-off-by: Pengbo Wang @ NVIDIA <221450789+pengbowang-nv@users.noreply.github.com>
Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>
2025-08-29 17:03:46 +08:00
Yiqing Yan
3c06303542
[TRTLLM-7755][infra] Add DGX_B300 and GB300 tests in CI
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-08-28 22:45:00 -07:00
Yanchao Lu
460a34c671
[None][chore] Some improvements for CI stability ( #7199 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-28 16:19:20 -04:00
Zhanrui Sun
ee37589c8c
infra: update DLFW 25.08 GA, triton 25.08 GA
...
Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com>
2025-08-27 20:17:56 -07:00
Martin Marciniszyn Mehringer
7cfa475e05
[None][fix] Remove the wheel from intermediate docker storage ( #7175 )
...
Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>
2025-08-27 11:32:17 -04:00
QI JUN
baef70e67e
[None][ci] move qwen3 tests from b200 to gb200 ( #7257 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-08-26 11:50:53 -04:00
Emma Qiao
a142c0c4de
[None][infra] Add retry 3 times if ssh cluster failed ( #6859 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-26 05:11:50 -04:00
Xiwen Yu
ab7febd4d8
Merge commit '31979aefacbf80d2742c98ef30385db162788c84' into feat/b300_cu13
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-26 10:31:35 +08:00
Yiqing Yan
486bc763c3
[None][infra] Split DGX_B200 stage into multiple parts and pre-/post-merge ( #7074 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-24 21:09:04 -04:00
Robin Kobus
31979aefac
[None] [ci] Reorganize CMake and Python integration test infrastructure for C++ tests ( #6754 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-08-24 20:53:17 +02:00
Yanchao Lu
ec35481b0a
[None][infra] Prepare for single GPU GB200 test pipeline ( #7073 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-24 21:46:39 +08:00
Xiwen Yu
808059da34
Merge remote-tracking branch 'gitlab/main' into user/xiweny/merge_main_0819
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-23 16:13:30 +08:00
Xiwen Yu
b7cc06cd6a
disable merge waive list stage
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-23 15:17:57 +08:00
Xiwen Yu
f4de8840ec
Merge remote-tracking branch 'gitlab/main' into user/xiweny/merge_main_0819
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-23 15:17:48 +08:00
QI JUN
1388e84793
[None][ci] move all B200 TensorRT test cases to post merge ( #7165 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-08-22 06:47:23 -04:00
Linda
898f37faa0
[None][feat] Enable nanobind as the default binding library ( #6608 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-08-22 09:48:41 +02:00
Emma Qiao
a49cf684f8
[TRTLLM-5801][infra] Add more RTX Pro 6000 test stages ( #5126 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-08-22 03:12:02 -04:00
Yuan Tong
90bfc8cc29
[ https://nvbugs/5453827 ][fix] Fix RPATH of th_common shared library to find pip-installed NCCL ( #6984 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-08-21 17:58:30 +08:00
BatshevaBlack
9f51f8d20c
[None][infra] Upgrade UCX to v1.19.x and NIXL to 0.5.0 ( #7024 )
...
Signed-off-by: Batsheva Black <132911331+BatshevaBlack@users.noreply.github.com>
Signed-off-by: Bo Deng <deemod@nvidia.com>
Co-authored-by: Bo Deng <deemod@nvidia.com>
2025-08-20 22:49:55 -04:00
QI JUN
a918de710a
[None][ci] move some tests of b200 to post merge ( #7093 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-08-20 19:43:40 -04:00
Xiwen Yu
8b532363ce
Merge remote-tracking branch 'gitlab/main' into user/xiweny/merge_main_0819
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-19 17:02:34 +08:00
Fanrong Li
816a120af6
[TRTLLM-6991][chore] add DeepSeek-R1 FP8 accuracy tests on Blackwell ( #6710 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-08-19 00:03:03 -04:00
Yanchao Lu
d1d17dbeba
[None][infra] Cherry-pick #6836 from main branch and improve SSH connection ( #6971 ) ( #7005 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-08-19 01:35:30 +08:00
Zhanrui Sun
8c998533af
infra: Support build for both CU12 and CU13
...
Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com>
2025-08-17 22:16:57 -07:00
Xiwen Yu
0bf6a18627
Fix and waive to clean L0
...
Signed-off-by: Xiwen Yu <xiweny@nvidia.com>
2025-08-15 04:37:43 -07:00
Yanchao Lu
3a987891d8
[TRTLLM-7141][infra] Use repo mirrors to avoid intermittent network failures ( #6836 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-08-15 11:16:07 +08:00
Wanli Jiang
9a133e9b41
[ https://nvbugs/5415862 ][fix] Update cublas as 12.9.1 and cuda memory alignment as 256 ( #6501 )
...
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-08-15 11:10:59 +08:00
Zhanrui Sun
ebec4ea5ee
infra: upgrade to DLFW 25.08-pre and TRT 10.13.2.4
...
Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com>
2025-08-11 19:27:09 -07:00
Yiqing Yan
62d6c98d68
[TRTLLM-5633][infra] Force set changed file diff to empty string for post-merge CI ( #6777 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-08-11 02:38:05 -04:00
Yiqing Yan
3e41e6c077
[TRTLLM-6892][infra] Run guardwords scan first in Release Check stage ( #6659 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-06 23:00:15 -04:00
Xiwen Yu
97a3788dcf
update triton image
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-07 00:16:24 +08:00
Yanchao Lu
b7347ce7d1
[ https://nvbugs/5433581 ][fix] Revert deep_gemm installation workaround for SBSA ( #6666 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-06 18:50:53 +08:00
Yiqing Yan
98424f3186
[TRTLLM-5633][infra] Change the TOT repo to default-llm-repo for merge waive list ( #6605 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-06 06:19:03 -04:00
Xiwen Yu
303604f82d
upgrade to base image and new TRT, fix many dependency issues
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-06 14:24:37 +08:00
Zhanrui Sun
6a9b4b11be
[ https://nvbugs/5433581 ][infra] Temporarily disable Docker Image use wheel from build stage ( #6630 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-08-05 09:33:11 -04:00
Emma Qiao
78a75c2990
[None][Infra] - Split gb200 stages for each test ( #6594 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-08-05 07:10:00 -04:00
Zhanrui Sun
7cbe30e17d
[TRTLLM-6893][infra] fix Build Docker Image tag issue ( #6555 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-08-05 04:33:36 -04:00