Emma Qiao
b024040df0
[None][infra] Update the pytest options after MI ( #9579 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-12-02 00:11:30 +08:00
Yiqing Yan
c72919980a
[TRTLLM-6768][infra] Fix params for not updating github status ( #6747 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-12-01 23:51:21 +08:00
Yanchao Lu
078d3a576e
[None][ci] Minor change for Slurm scripts ( #9561 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-12-01 22:52:08 +08:00
Martin Marciniszyn Mehringer
974ad56515
[None][chore] reduce the layers of the devel docker image ( #9077 )
...
Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>
2025-12-01 03:56:30 -08:00
Enwei Zhu
34e2fa5c96
[ https://nvbugs/5690172 ][fix] Fix Qwen3-235B ATP accuracy issue with PDL ( #9530 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-12-01 09:10:21 +08:00
Yanchao Lu
694b60d92d
[None][ci] Split H100_PCIe-PyTorch-Post-Merge test stage ( #9559 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-11-30 21:14:18 +08:00
Yanchao Lu
0398875d55
[None][ci] Split H100_PCIe-PyTorch-Post-Merge test stage ( #9558 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-11-30 20:27:13 +08:00
Yanchao Lu
f03641808b
[None][infra] - Request idle time exemption for OCI jobs ( #9528 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-11-30 13:34:09 +08:00
Zhanrui Sun
930cdad054
[TRTLLM-9541][infra] Use artifactory mirror for download.pytorch.org ( #9477 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-11-28 18:31:50 +08:00
Emma Qiao
658d9fc0c5
[TRTLLM-8970][infra] Fix generate report when has isolation test result ( #8861 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
2025-11-28 11:26:06 +08:00
Yiqing Yan
1c9158fde3
[TRTLLM-7288][infra] Download merged waive list in slurm script ( #8999 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-11-27 21:48:40 +08:00
yuanjingx87
3ada0bfc65
[None][infra] Fix Slurm job script ( #9508 )
...
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-11-27 16:41:01 +08:00
Emma Qiao
a21be43677
[TRTLLM-9279][infra] Use flexcache for gh200 nodes since they locate in Austin ( #9405 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-11-27 15:42:38 +08:00
Jiagan Cheng
14762e0287
[None][fix] Replace PYTORCH_CUDA_ALLOC_CONF with PYTORCH_ALLOC_CONF to fix deprecation warning ( #9294 )
...
Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>
2025-11-27 12:22:01 +08:00
yuanjingx87
356f67c1cb
[None][infra] Fail the pipeline when slurm ssh dropped ( #9157 )
...
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-11-26 09:35:04 -08:00
Yanchao Lu
ff02e0f05c
[None][ci] Move more test stages to use OCI machines ( #9395 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Matt Lefebvre <matthewelefebvre@gmail.com>
2025-11-25 15:59:13 +08:00
Matt Lefebvre
fefa02fa95
[TRTINFRA-7326][infra] - Consume SlurmCluster sshPort for clusters with custom SSH port ( #9313 )
...
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
2025-11-21 18:58:00 -08:00
Yiqing Yan
2a27166b59
[TRTLLM-9183][infra] Add --waives-file in rerun pytest command ( #8971 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-11-21 13:40:45 +08:00
Zhanrui Sun
5138ef3227
[None][infra] Add fallback when get wheel from build stage is fail ( #9290 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-11-21 13:26:20 +08:00
Simeng Liu
9286223288
[ https://nvbugs/5515753 ][ci] Add NCCL_DEBUG=INFO flag to collect more info with CI failure. ( #8440 )
...
Signed-off-by: Simeng Liu <simengl@nvidia.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
2025-11-20 12:43:13 -05:00
Bo Deng
2128f73d58
[TRTLLM-9247][infra] Upgrade NIXL to 0.7.1 ( #9055 )
...
Signed-off-by: Bo Deng <deemod@nvidia.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Co-authored-by: jthomson04 <jwillthomson19@gmail.com>
2025-11-20 11:01:02 +08:00
Kanghwan
41e5870a70
[ #8476 ][chore] Update license ( #8807 )
...
Signed-off-by: Kanghwan Jang <861393+karljang@users.noreply.github.com>
2025-11-19 15:05:25 -08:00
Matt Lefebvre
470d777744
[TRTINFRA-7280][infra] Support enroot/pyxis clusters in multi-node SLURM and enable oci-hsg GB200 in post-merge ( #9117 )
...
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
2025-11-17 10:59:30 -08:00
Yiqing Yan
24f5cd7493
[TRTLLM-8000][infra] Catch error in merge waive list stage ( #7289 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-11-17 13:28:50 +08:00
Kaiyu Xie
04be5a704e
[None] [fix] Fix missing ActivationType issue ( #9171 )
...
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
2025-11-17 10:43:25 +08:00
Zhanrui Sun
bdcf837784
[TRTLLM-9079][infra] upgrade tritonserver DLFW 25.10 ( #8929 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-11-14 20:22:10 -08:00
yuanjingx87
05b5336ab6
[None][infra] Lock generation pipeline update ( #9084 )
...
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-11-14 10:12:25 -08:00
Bo Deng
0b9bc5aae8
[None][infra] install mooncake in docker images ( #8447 )
...
Signed-off-by: Bo Deng <deemod@nvidia.com>
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
Co-authored-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-11-11 13:34:27 +08:00
Emma Qiao
183778d58a
[None][infra] Waive failed tests for main 11/07 ( #9008 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-11-08 08:51:35 -08:00
Emma Qiao
2af6a537ad
[TRTLLM-8999][infra] Reduce gb200 multi-node test stages ( #8778 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
2025-11-08 06:34:24 -08:00
yuanjingx87
18a4b985f1
[None][infra] allow to choose repo when generate lock files ( #8659 )
...
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-11-05 19:06:29 -08:00
Yiteng Niu
1ce83582f9
[None][infra] update github token name ( #8907 )
2025-11-05 00:55:28 -08:00
Zhanrui Sun
4de31bece2
[TRTLLM-8994][infra] upgrade to DLFW 25.10 and pytorch 2.9.0 / triton 3.5.0 ( #8838 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-11-04 18:59:34 +08:00
Matt Lefebvre
0f6763680a
[TRTINFRA-7215][infra] - Move half of the DGX H100 premerge tests to SLURM ( #8849 )
...
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
2025-11-04 00:11:26 +08:00
Emma Qiao
14bc8571ae
[TRTLLM-8435][infra] Test existing rtxpro6000 stages on rtxpro6000d ( #8319 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-11-03 05:26:17 -08:00
chenfeiz0326
cc4ab8d9d1
[TRTLLM-8825][feat] Support Pytest Perf Results uploading to Database ( #8653 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-11-03 16:23:13 +08:00
Yanchao Lu
da73410d3b
[None][fix] WAR for tensorrt depending on the archived nvidia-cuda-runtime-cu13 package ( #8857 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-11-02 09:57:37 +08:00
dongxuy04
bba2519726
[TRTLLM-7008][fix] Enable GDRCopy and unwaive online eplb tests ( #8720 )
...
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-10-31 16:39:51 -07:00
Matt Lefebvre
da2dca58aa
[TRTINFRA-7215][infra] Add support for enroot SLURM clusters ( #8770 )
...
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-10-31 12:22:21 -07:00
Zhanrui Sun
a6a3de8e35
[TRTLLM-9003][infra] Add python OpenSearchDB query / push. ( #8506 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-10-30 19:43:51 -07:00
Zhanrui Sun
547d799111
[TRTLLM-8930][infra] Force Blossom perf test stages to use 'tensorrt/test_type: perf' in the K8S template ( #8752 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-10-30 06:30:10 -07:00
yuanjingx87
e689a73c83
[None][infra] fix slurm results path ( #8751 )
...
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-10-30 13:09:46 +08:00
Bo Li
9c4432f8a4
[TRTLLM-7318][feat] MnnvlThroughput AlltoAll implementation. ( #7499 )
...
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-10-27 13:23:06 -04:00
QI JUN
cc5b8b6d28
[None][ci] move some time-consuming benchmark test cases to post merge ( #8641 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-10-26 22:47:17 -04:00
Yiqing Yan
602b059180
[None][chore] Disable GB300 stages due to nodes will be offline temporarily ( #8643 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-10-24 05:32:05 -04:00
yuanjingx87
e7ad5e4d6a
[None][infra] enable lfs for generateLockFile pipeline ( #8547 )
...
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-10-24 12:59:27 +08:00
Emma Qiao
ee21ea3e91
[None][infra] Disable rtxpro6000 stages due to nodes will be offline ( #8613 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-10-23 10:24:05 -04:00
Emma Qiao
7c1bca4563
[None][infra] Fix slurm exitcode ( #8585 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
2025-10-23 09:46:00 -04:00
Emma Qiao
2b4e812aea
[None][infra] Let CI continue running other isolation tests when an isolation test get hanging ( #8471 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-10-22 00:07:35 -04:00
chenfeiz0326
6cf1c3fba4
[TRTLLM-8260][feat] Add Server-Client Perf Test in pytest for B200 and B300 ( #7985 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-10-22 10:17:22 +08:00
Emma Qiao
c72f6d1dcc
[None][infra] Add split algorithm for slurm ( #8516 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-10-21 02:56:22 -04:00
QI JUN
0acd10e3de
[None][ci] rebalance H100 stages ( #8491 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-10-21 02:03:48 -04:00
yuanjingx87
1e3e1474c6
[TRTLLM-6055][infra] Slurm Test refactor ( #7176 )
...
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-10-20 09:46:44 -07:00
QI JUN
d05079ba4b
[None][ci] move some test cases from H100 to A10 ( #8449 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-10-20 01:58:34 -04:00
zhhuang-nv
7a2bab93f0
[None][test] Add post merge test for Seed-OSS-36B-Instruct ( #8321 )
...
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
2025-10-17 02:30:33 -07:00
Yanchao Lu
e72ade33c2
[None][chore] Update commit msg for adding lock files ( #8448 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-10-17 00:24:26 -07:00
yuanjingx87
3481d03470
[None][infra] Fix for generate lockfile pipeline ( #7820 )
...
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-10-16 14:17:18 -07:00
Zhanrui Sun
19241626d0
[ https://nvbugs/5563653 ][infra] reduce docker image layers ( #8250 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-10-16 22:46:19 +08:00
Emma Qiao
493da020c1
[TRTLLM-7351][infra] Add isolate marker for L0 ( #7497 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-10-14 16:58:14 -07:00
Emma Qiao
fe17e78f27
[None][infra] Add back gb200 multi-node test stage to pre-merge ( #8281 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-10-12 23:56:07 -07:00
Zhanrui Sun
5798a12199
[None][infra] Remove WAR code for GH200 node ( #8266 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-10-11 20:33:14 -07:00
Zhenhuan Chen
84d2f12818
[TRTLLM-6748][feat] add PDL support for more kernels ( #7977 )
...
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-10-11 08:32:05 +08:00
Jonas Yang CN
88ea2c4ee9
[TRTLLM-7349][feat] Adding new orchestrator type -- ray ( #7520 )
...
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-10-04 08:12:24 +08:00
Nikita Korobov
9b3d7cc3e6
[None][feat] Update TRT-LLM Gen MoE kernels ( #7970 )
...
Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>
2025-10-03 09:22:45 +08:00
mpikulski
fc7f78c400
[TRTLLM-8269][test] do not explicitly pass temperature=0 to select greedy sampling ( #8110 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-10-02 10:20:32 +02:00
Cheng Hang
cdce68c3e0
[TRTLLM-6741][fix] Add heuristics for lm head tp size when enable_lm_head_tp_in_adp=True ( #7891 )
...
Signed-off-by: Cheng Hang <chang@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-30 09:24:35 +08:00
HuiGao-NV
1339beb04e
[None][ci] Disable tensorRT cases in post-merge ( #8028 )
...
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-09-29 14:21:52 +08:00
Eran Geva
9cea6bfb30
[ #7288 ][feat] Added AutoDeploy backend support to test_perf.py ( #7588 )
...
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-09-28 21:21:27 -07:00
Iman Tabrizian
33282351a2
[TRTLLM-6106][feat] Add support for KVCache transfer from KVCache reuse path ( #6348 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-09-27 19:29:30 -04:00
Yiqing Yan
108248ece1
[TRTLLM-7999][infra] Add B300/GB300 single gpu test ( #7951 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-09-26 09:59:11 +08:00
Yanchao Lu
7e2521a7f0
[None][chore] Some clean-ups for CUDA 13.0 dependencies ( #7979 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-26 08:46:11 +08:00
Tracin
1f2761e67b
[None][feat] Enable gpt oss on DGX H100. ( #6775 )
...
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-09-23 09:35:19 -07:00
Pengbo Wang
a4b4ed4535
[None][fix] Fix and add test for TRTLLM MoE backend ( #7755 )
...
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
2025-09-23 11:26:25 +08:00
Bo Deng
8cf95681e6
[TRTLLM-7989][infra] Bundle UCX and NIXL libs in the TRTLLM python package ( #7766 )
...
Signed-off-by: Bo Deng <deemod@nvidia.com>
2025-09-22 16:43:35 +08:00
Yuxian Qiu
2d46dda6a7
[ https://nvbugs/5448754 ][fix] Download HF model for all nodes. ( #6824 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
yuanjingx87
eeb89a167c
[None][infra] Add nightly pipeline to generate lock files ( #5798 )
...
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-09-16 15:00:03 -07:00
Yanchao Lu
e5cead1eb9
[TRTLLM-6295][test] Exit as early as possible and propagate exit status correctly for multi-node testing ( #7739 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-16 09:59:18 +08:00
xiweny
c076a02b38
[TRTLLM-4629] [feat] Add support of CUDA13 and sm103 devices ( #7568 )
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>
Signed-off-by: Daniel Stokes <dastokes@nvidia.com>
Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com>
Signed-off-by: Xiwen Yu <xiweny@nvidia.com>
Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Signed-off-by: Bo Deng <deemod@nvidia.com>
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: xiweny <13230610+VALLIS-NERIA@users.noreply.github.com>
Co-authored-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>
Co-authored-by: Daniel Stokes <dastokes@nvidia.com>
Co-authored-by: Zhanrui Sun <zhanruis@nvidia.com>
Co-authored-by: Jiagan Cheng <jiaganc@nvidia.com>
Co-authored-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Bo Deng <deemod@nvidia.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-09-16 09:56:18 +08:00
QI JUN
44d5ccfdd9
[None][ci] move qwen3 tests from GB200 to B200 ( #7733 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-16 08:12:28 +08:00
Yanchao Lu
70aa4e28c1
[None][ci] Test waives for the main branch 09/14 ( #7698 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-14 23:48:04 +08:00
Yanchao Lu
89fc136972
[None][ci] Some improvements for Slurm CI ( #7689 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-14 16:56:32 +08:00
Zhanrui Sun
1f43854496
[TRTLLM-6791][infra] Add check for uploading stage name and avoid overriding test result tar file ( #6742 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-13 01:15:33 +08:00
Zhanrui Sun
7d73a89ad0
[TRTLLM-7169][infra] Fix Slurm multi-node test showing "Submit Test Results" in the test name ( #6856 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-09-12 18:46:19 +08:00
v-shobhit
0652514c6d
[None][feat] Use a shell context to install dependancies ( #7383 )
...
Signed-off-by: Shobhit Verma <shobhitv@nvidia.com>
Signed-off-by: v-shobhit <161510941+v-shobhit@users.noreply.github.com>
Co-authored-by: Zhihan Jiang <68881590+nvzhihanj@users.noreply.github.com>
2025-09-10 09:57:37 -07:00
QI JUN
a0e1604898
[None][ci] add DGX_H100-2_GPUs-PyTorch-Others-1 pipeline ( #7629 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-09 11:06:32 -04:00
Zhanrui Sun
7a62df5f0b
[TRTLLM-4366][infra] Don't call reinstall_rockylinux_cuda when the base CUDA image is up to dated ( #5980 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-09 02:15:39 -04:00
Tomer Shmilovich
ecc0e687c6
[None][feat] Nixl support for GDS ( #5488 )
...
Signed-off-by: Tomer Shmilovich <tshmilovich@nvidia.com>
Signed-off-by: Guy Lev <glev@nvidia.com>
Co-authored-by: Guy Lev <glev@nvidia.com>
2025-09-09 13:00:38 +08:00
Yiqing Yan
5c616da2fd
[TRTLLM-5877][infra] Add fmha tests and auto trigger rules ( #6050 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-09 11:33:09 +08:00
yuanjingx87
1d243a8503
[None][infra] Try to fix docker container failed to be killed issue ( #7388 )
...
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-09-08 11:28:01 -07:00
Emma Qiao
dd9627d9f9
[None][infra] Add back rtx-pro-6000 stages since the node is available ( #7601 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-08 05:45:11 -04:00
Yanchao Lu
ed27a72bcf
[None][ci] Fix a typo in the Slurm command
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-08 17:07:09 +08:00
BatshevaBlack
7c76dde76d
[TRTLLM-7187][fix] Build wheel with NIXL ( #7472 )
...
Signed-off-by: BatshevaBlack <132911331+BatshevaBlack@users.noreply.github.com>
2025-09-07 19:05:37 -04:00
Yanchao Lu
045d2cf761
[None][ci] Block some nodes to avoid unstable network access ( #7593 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-08 00:25:38 +08:00
Emma Qiao
5c4711fb2b
[None][infra] Skip RTX Pro 6000 test stages due to HW are offline ( #7592 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-07 09:49:06 -04:00
Emma Qiao
aea8ac1649
[TRTLLM-5950][infra] Removing remaining turtle keywords from the code base ( #7086 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-07 14:26:18 +08:00
Yanchao Lu
caf9b9cd42
[None][ci] Improve SSH connection stability ( #7567 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-06 17:08:19 +08:00
Yiteng Niu
163b1fc84f
[None][infra] update nspect version ( #7552 )
...
Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com>
2025-09-05 14:59:22 +08:00
Yanchao Lu
4195010e13
[None][ci] Increase the number of retries in docker image generation ( #7557 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-05 14:47:14 +08:00
Zhanrui Sun
0de3f83805
[TRTLLM-6893][infra] Disable the x86 / SBSA build stage when run BuildDockerImage ( #6729 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-04 07:20:15 -04:00
Yanchao Lu
c622f61609
[None][fix] Fix a typo in the Slurm CI codes ( #7485 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-04 01:56:27 -04:00