Commit Graph

121 Commits

Author SHA1 Message Date
ixlmar
10e686466e
fix: use current_image_tags.properties in rename_docker_images.py (#5846)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-07-09 17:07:52 +09:00
xavier-nvidia
b6013da198
Fix GEMM+AR fusion on blackwell (#5563)
Signed-off-by: xsimmons <xsimmons@nvidia.com>
2025-07-09 08:48:47 +08:00
Yiteng Niu
3079e8cf0c
[TRTLLM-5878] update nspect version (#5832)
Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com>
2025-07-08 22:00:09 +08:00
Tailing Yuan
035155df7c
Fix: ignore nvshmem_src_*.txz from confidentiality-scan (#5831)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2025-07-08 17:17:29 +09:00
Tailing Yuan
85b4a6808d
Refactor: move DeepEP from Docker images to wheel building (#5534)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2025-07-07 22:57:03 +09:00
Yanchao Lu
092e0eb86a
[Infra] - Fix a syntax issue in the image check (#5775)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-07-07 11:19:59 +09:00
Yiteng Niu
66f299a205
[TRTLLM-5878] add stage for image registration to nspect (#5699)
Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-07-06 23:52:54 +08:00
Yanchao Lu
2013034948
[Test] - Waive or fix few known test failures (#5769)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-07-06 21:14:16 +08:00
Yanchao Lu
d95ae1378b
[Infra] - Always use x86 image for the Jenkins agent and few clean-ups (#5753)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-07-06 10:25:57 +08:00
Yuan Tong
32b244af38
feat: reduce unnecessary kernel generation (#5476)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-07-04 14:37:49 +08:00
Yi Zhang
73d30a23c7 test: add more tests for GB200 with 8 GPUs/2 nodes in L0 tests (#5397)
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-07-04 13:14:13 +08:00
Yiqing Yan
de0b522dfd
[Infra] - Fix test stage check for the package sanity check stage (#5694)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-07-03 16:39:46 +08:00
ixlmar
04fa6c0cfc
[TRTLLM-6143] feat: Improve dev container tagging (#5551)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-07-02 14:56:34 +02:00
Emma Qiao
31699cbeb1
[Infra] - Set default timeout to 1hr and remove some specific settings (#5667)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-07-02 08:37:54 -04:00
Void
7992869798
perf: better heuristic for allreduce (#5432)
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
2025-07-01 22:56:06 -04:00
ixlmar
48eee338bf fix: constrain grepping in docker/Makefile (#5493)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-07-01 20:12:55 +08:00
Omer Ullman Argov
3b19634a5c
[fix][ci] missing class names in post-merge test reports (#5603)
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-30 22:13:29 +08:00
Emma Qiao
b8a568d3c6
[Infra][main] Cherry-pick from release/0.21: Update nccl to 2.27.5 (#5539) (#5587)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-06-30 18:12:08 +08:00
amirkl94
a985c0b7e6
tests: Move stress tests to be Post-Merge only (#5166)
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
2025-06-29 09:44:47 +03:00
Iman Tabrizian
49af791f66
Add testing for trtllm-llmapi-launch with tritonserver (#5528)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-06-27 11:19:52 +08:00
Omer Ullman Argov
fa0ea92dfd
[fix][ci] trigger multigpu tests for deepseek changes (#5423)
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-26 14:30:00 +08:00
Emma Qiao
32d1573c43
[Infra] - Add timeout setting for long tests found in post-merge (#5501)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-06-26 11:31:39 +08:00
QI JUN
478f668dcc
CI: update multi gpu test triggering file list (#5466)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-25 15:51:02 +08:00
Emma Qiao
7f68de3e3f
Refactor test timeout for individual long case (#4757)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-06-19 13:52:11 +08:00
yunruis
b3e886074e
Fix CI build time increase (#5337)
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
2025-06-19 13:49:42 +08:00
Robin Kobus
1a7c6e7974
ci: Split long running jobs into multiple jobs (#5268)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-19 06:24:29 +08:00
yuanjingx87
a1c5704055
[feat] Multi-node CI testing support via Slurm (#4771)
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
Signed-off-by: yuanjingx87 <197832395+yuanjingx87@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-06-19 01:11:12 +08:00
Yiqing Yan
a3a48410f3
Fix rerun step (#5319)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-06-18 16:38:45 +08:00
QI JUN
9ea7bb67a4
CI: fix TensorRT H200 tests (#5301)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-18 14:40:57 +08:00
Emma Qiao
ff32caf4d7
[Infra] - Update dependencies with NGC PyTorch 25.05 and TRT 10.11 (#4885)
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Co-authored-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-06-17 23:48:34 +08:00
Yiteng Niu
dcf18c4bcf
infra[TRTLLM-5635] remove package stage in CI build (#5075)
Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com>
2025-06-17 23:44:47 +08:00
Yanchao Lu
f4cdbfcdf0
None - Some clean-ups for the automation pipeline (#5245)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-06-17 21:08:24 +08:00
QI JUN
ccd9adbe33
CI: move multi-gpu test cases of tensorrt backend to h200 (#5272)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-17 17:37:37 +08:00
QI JUN
517c1ecf72
move some test cases of TensorRT backend back (#5232)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-17 17:03:11 +08:00
Tailing Yuan
0b60da2c45
feat: large-scale EP(part 7: DeepEP integration) (#4792)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-06-14 19:12:38 +08:00
QI JUN
952f33dcad
CI: move all test cases of TensorRT backend into post merge (#5186)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-13 20:48:48 +08:00
yunruis
30c5b4183a
refactoring: port customized kernels with public cutlass version (#5027)
Signed-off-by: yunruis 

Merge this to unblock others since the full CI has been run through
2025-06-13 16:19:31 +08:00
Zhanrui Sun
a97f4581d2
infra: upload imageTag info to artifactory and add ngc_staging to save ngc image (#4764)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-06-12 15:38:47 +08:00
Yiqing Yan
a90dd571f8
[TRTLLM-5082] - Add a bot run option for detailed logs (#4390)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-06-11 15:44:57 +08:00
Omer Ullman Argov
8731f5f14f
chore: Mass integration of release/0.20 (#4898)
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Hui Gao <huig@nvidia.com>
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
Signed-off-by: moraxu <mguzek@nvidia.com>
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: HuiGao-NV <huig@nvidia.com>
Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com>
Co-authored-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
Co-authored-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Co-authored-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
Co-authored-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Faraz <58580514+farazkh80@users.noreply.github.com>
Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com>
2025-06-08 23:26:26 +08:00
Yanchao Lu
9e05613679
[Infra] - Update JNLP container config (#5008)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-06-08 16:44:09 +08:00
dongxuy04
1e369658f1
feat: large-scale EP(part 6: Online EP load balancer integration for GB200 nvfp4) (#4818)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Co-authored-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-06-08 10:25:18 +08:00
Yiteng Niu
d2c311c9d3
infra: update jnlp version in container image (#4944) 2025-06-05 22:36:10 +08:00
Iman Tabrizian
141467d4b6
Add pre-merge Triton backend tests (#4842)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-06-03 00:47:58 -04:00
Yanchao Lu
8166649d03
[Infra] - Minor clean-up and test Ubuntu mirrors (#4829)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-02 20:18:20 +08:00
tomeras91
bf9cd11fd4
[TRTLLM-4783][feat] Mamba2 kernel updates for Nemotron-H (#4494)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-06-01 13:56:44 +03:00
amirkl94
8039ef45d3
CI: Performance regression tests update (#3531) 2025-06-01 09:47:55 +03:00
Emma Qiao
202813f054
Check test names in waive list (#4292)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-06-01 14:39:30 +08:00
Emma Qiao
c945e92fdb
[Infra]Remove some old keyword (#4552)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-05-31 13:50:45 +08:00
yuanjingx87
2c48ff5898
[feat] add b200 support via slurm (#4709)
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-05-29 14:49:46 +08:00