Pengbo Wang @ NVIDIA
ef0d06df58
[None][chore] Fix kernel launch param and add TRTLLM MoE backend test ( #7524 )
...
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
2025-09-09 23:45:35 +08:00
Yanchao Lu
bc90a34a0e
[None][ci] Fix a typo in the Slurm command
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-08 17:15:15 +08:00
Yanchao Lu
2d5f0e1038
[None][ci] Block some nodes to avoid unstable network access ( #7593 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-08 00:34:20 +08:00
Yanchao Lu
2b02dd7891
[None][ci] Improve SSH connection stability ( #7567 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-06 17:12:39 +08:00
Yanchao Lu
d1b0c87d41
[None][fix] Fix a typo in the Slurm CI codes ( #7485 ) ( #7538 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-04 21:49:18 +08:00
Yanchao Lu
c3f23462ab
[None][ci] Cherry-pick some improvements for Slurm CI setup from main branch ( #7479 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-03 18:42:28 -04:00
Pengbo Wang @ NVIDIA
62459d533d
[None][chore] Update pre-merge test to add DeepSeek/LLaMA and gpt-oss ( #7192 )
...
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
Signed-off-by: Pengbo Wang @ NVIDIA <221450789+pengbowang-nv@users.noreply.github.com>
Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>
2025-08-29 17:03:46 +08:00
Yanchao Lu
460a34c671
[None][chore] Some improvements for CI stability ( #7199 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-28 16:19:20 -04:00
QI JUN
baef70e67e
[None][ci] move qwen3 tests from b200 to gb200 ( #7257 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-08-26 11:50:53 -04:00
Emma Qiao
a142c0c4de
[None][infra] Add retry 3 times if ssh cluster failed ( #6859 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-26 05:11:50 -04:00
Yiqing Yan
486bc763c3
[None][infra] Split DGX_B200 stage into multiple parts and pre-/post-merge ( #7074 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-24 21:09:04 -04:00
Yanchao Lu
ec35481b0a
[None][infra] Prepare for single GPU GB200 test pipeline ( #7073 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-24 21:46:39 +08:00
QI JUN
1388e84793
[None][ci] move all B200 TensorRT test cases to post merge ( #7165 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-08-22 06:47:23 -04:00
Linda
898f37faa0
[None][feat] Enable nanobind as the default binding library ( #6608 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-08-22 09:48:41 +02:00
Emma Qiao
a49cf684f8
[TRTLLM-5801][infra] Add more RTX Pro 6000 test stages ( #5126 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-08-22 03:12:02 -04:00
Yuan Tong
90bfc8cc29
[ https://nvbugs/5453827 ][fix] Fix RPATH of th_common shared library to find pip-installed NCCL ( #6984 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-08-21 17:58:30 +08:00
QI JUN
a918de710a
[None][ci] move some tests of b200 to post merge ( #7093 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-08-20 19:43:40 -04:00
Fanrong Li
816a120af6
[TRTLLM-6991][chore] add DeepSeek-R1 FP8 accuracy tests on Blackwell ( #6710 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-08-19 00:03:03 -04:00
Yanchao Lu
d1d17dbeba
[None][infra] Cherry-pick #6836 from main branch and improve SSH connection ( #6971 ) ( #7005 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-08-19 01:35:30 +08:00
Yanchao Lu
3a987891d8
[TRTLLM-7141][infra] Use repo mirrors to avoid intermittent network failures ( #6836 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-08-15 11:16:07 +08:00
Yanchao Lu
b7347ce7d1
[ https://nvbugs/5433581 ][fix] Revert deep_gemm installation workaround for SBSA ( #6666 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-06 18:50:53 +08:00
Emma Qiao
78a75c2990
[None][Infra] - Split gb200 stages for each test ( #6594 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-08-05 07:10:00 -04:00
Yanchao Lu
d53cc2374b
[ https://nvbugs/5433581 ][infra] Update install docs and CI script for SBSA deep_gemm workaround ( #6607 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-04 23:36:38 -04:00
Yiqing Yan
4763e94156
[TRTLLM-5563][infra] Move test_rerun.py to script folder ( #6571 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-08-04 13:26:04 +08:00
Yiqing Yan
3f7abf87bc
[TRTLLM-6224][infra] Upgrade dependencies to DLFW 25.06 and CUDA 12.9.1 ( #5678 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-08-03 11:18:59 +08:00
Yiqing Yan
0cf2f6f154
[TRTLLM-5633] - Merge current waive list with the TOT waive list ( #5198 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-07-30 17:50:05 +08:00
Zhanrui Sun
64ba483656
infra: [TRTLLM-6499] Split L0_Test into two pipeline by single GPU and multi GPU(For SBSA) ( #6132 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-07-28 22:54:37 -04:00
yuanjingx87
608ed89f96
[None][infra]Update slurm config keys ( #6370 )
...
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-07-28 11:56:37 -07:00
Yiqing Yan
d97419805b
[TRTLLM-5312] - Add bot run rules for triton tests ( #4988 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-07-25 10:31:12 +08:00
yuanjingx87
ef4878db05
set NVIDIA_IMEX_CHANNELS for dlcluster slurm job only ( #6234 )
...
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-07-22 11:27:54 -07:00
Yi Zhang
f9b0a911fb
test: Enable GB200 torch compile multi gpu tests ( #6145 )
...
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-07-21 22:17:13 +08:00
Zhanrui Sun
3cbc23f783
infra: [TRTLLM-5250] Add sanity check stage for ngc-release images (Build wheels for devel image) ( #4656 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-07-21 16:06:43 +08:00
Linda
3efad2e58c
feat: nanobind bindings ( #6185 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-21 08:56:57 +01:00
Venky
22d4a8c48a
enh: Add script to map tests <-> jenkins stages & vice-versa ( #5177 )
...
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-07-19 00:50:40 +08:00
Iman Tabrizian
b75e53ab69
Revert "feat: nanobind bindings ( #5961 )" ( #6160 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-18 10:12:54 +08:00
Linda
5bff317abf
feat: nanobind bindings ( #5961 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-17 22:42:52 +08:00
Emma Qiao
1cc49494fe
[Infra] - Add wiave list for pytest when using slurm ( #6130 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-07-17 16:53:15 +08:00
Zhanrui Sun
4c364b9a73
infra: fix SBSA test stage ( #6113 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-07-17 11:56:03 +08:00
Zhanrui Sun
e42f5a9581
infra: [TRTLLM-5879] Spilt single GPU test and multi GPU test into 2 pipelines ( #5199 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-07-16 18:04:04 +08:00
Zhanrui Sun
d811843a08
infra: [TRTLLM-6313] Fix the package sanity stage 'Host Node Name' in… ( #5945 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-07-15 15:39:31 +09:00
Yiqing Yan
6b35afaf1b
[Infra][TRTLLM-6013] - Fix stage name in single stage test rerun report ( #5672 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-07-15 12:27:21 +09:00
Zhanrui Sun
01b2def5ef
infra: [TRTLLM-6331] Support show all stage name list when stage name check failed ( #5946 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-07-15 12:06:03 +09:00
Alex Zhang
6c30d78b78
[TRTLLM-5653][infra] Run docs build only if PR contains only doc changes ( #5184 )
...
Signed-off-by: Alex Zhang <13271672+zhanga5@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Alex Zhang <13271672+zhanga5@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-07-14 21:40:33 +08:00
Zhanrui Sun
3a0ef73414
infra: [TRTLLM-6242] install cuda-toolkit to fix sanity check ( #5709 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-07-14 18:52:13 +09:00
Yi Zhang
e5e87ecf34
test: Move some of the test from post merge to pre-merge, update dgx b200 test case ( #5640 )
...
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
xavier-nvidia
b6013da198
Fix GEMM+AR fusion on blackwell ( #5563 )
...
Signed-off-by: xsimmons <xsimmons@nvidia.com>
2025-07-09 08:48:47 +08:00
Tailing Yuan
85b4a6808d
Refactor: move DeepEP from Docker images to wheel building ( #5534 )
...
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2025-07-07 22:57:03 +09:00
Yanchao Lu
2013034948
[Test] - Waive or fix few known test failures ( #5769 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-07-06 21:14:16 +08:00
Yuan Tong
32b244af38
feat: reduce unnecessary kernel generation ( #5476 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-07-04 14:37:49 +08:00
Yi Zhang
73d30a23c7
test: add more tests for GB200 with 8 GPUs/2 nodes in L0 tests ( #5397 )
...
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-07-04 13:14:13 +08:00