TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
yuanjingx87	3ada0bfc65	[None][infra] Fix Slurm job script (#9508 ) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>	2025-11-27 16:41:01 +08:00
Emma Qiao	a21be43677	[TRTLLM-9279][infra] Use flexcache for gh200 nodes since they locate in Austin (#9405 ) Signed-off-by: qqiao <qqiao@nvidia.com> Signed-off-by: Emma Qiao <qqiao@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-11-27 15:42:38 +08:00
Jiagan Cheng	14762e0287	[None][fix] Replace PYTORCH_CUDA_ALLOC_CONF with PYTORCH_ALLOC_CONF to fix deprecation warning (#9294 ) Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>	2025-11-27 12:22:01 +08:00
yuanjingx87	356f67c1cb	[None][infra] Fail the pipeline when slurm ssh dropped (#9157 ) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>	2025-11-26 09:35:04 -08:00
Yanchao Lu	ff02e0f05c	[None][ci] Move more test stages to use OCI machines (#9395 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Matt Lefebvre <matthewelefebvre@gmail.com>	2025-11-25 15:59:13 +08:00
Matt Lefebvre	fefa02fa95	[TRTINFRA-7326][infra] - Consume SlurmCluster sshPort for clusters with custom SSH port (#9313 ) Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>	2025-11-21 18:58:00 -08:00
Yiqing Yan	2a27166b59	[TRTLLM-9183][infra] Add --waives-file in rerun pytest command (#8971 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-11-21 13:40:45 +08:00
Zhanrui Sun	5138ef3227	[None][infra] Add fallback when get wheel from build stage is fail (#9290 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-11-21 13:26:20 +08:00
Simeng Liu	9286223288	[https://nvbugs/5515753 ][ci] Add NCCL_DEBUG=INFO flag to collect more info with CI failure. (#8440 ) Signed-off-by: Simeng Liu <simengl@nvidia.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com>	2025-11-20 12:43:13 -05:00
Bo Deng	2128f73d58	[TRTLLM-9247][infra] Upgrade NIXL to 0.7.1 (#9055 ) Signed-off-by: Bo Deng <deemod@nvidia.com> Signed-off-by: jthomson04 <jwillthomson19@gmail.com> Co-authored-by: jthomson04 <jwillthomson19@gmail.com>	2025-11-20 11:01:02 +08:00
Kanghwan	41e5870a70	[#8476 ][chore] Update license (#8807 ) Signed-off-by: Kanghwan Jang <861393+karljang@users.noreply.github.com>	2025-11-19 15:05:25 -08:00
Matt Lefebvre	470d777744	[TRTINFRA-7280][infra] Support enroot/pyxis clusters in multi-node SLURM and enable oci-hsg GB200 in post-merge (#9117 ) Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>	2025-11-17 10:59:30 -08:00
Yiqing Yan	24f5cd7493	[TRTLLM-8000][infra] Catch error in merge waive list stage (#7289 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-11-17 13:28:50 +08:00
Kaiyu Xie	04be5a704e	[None] [fix] Fix missing ActivationType issue (#9171 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com> Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Co-authored-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>	2025-11-17 10:43:25 +08:00
Zhanrui Sun	bdcf837784	[TRTLLM-9079][infra] upgrade tritonserver DLFW 25.10 (#8929 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-11-14 20:22:10 -08:00
yuanjingx87	05b5336ab6	[None][infra] Lock generation pipeline update (#9084 ) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>	2025-11-14 10:12:25 -08:00
Bo Deng	0b9bc5aae8	[None][infra] install mooncake in docker images (#8447 ) Signed-off-by: Bo Deng <deemod@nvidia.com> Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com> Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com> Co-authored-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>	2025-11-11 13:34:27 +08:00
Emma Qiao	183778d58a	[None][infra] Waive failed tests for main 11/07 (#9008 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-11-08 08:51:35 -08:00
Emma Qiao	2af6a537ad	[TRTLLM-8999][infra] Reduce gb200 multi-node test stages (#8778 ) Signed-off-by: qqiao <qqiao@nvidia.com> Signed-off-by: Emma Qiao <qqiao@nvidia.com>	2025-11-08 06:34:24 -08:00
yuanjingx87	18a4b985f1	[None][infra] allow to choose repo when generate lock files (#8659 ) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>	2025-11-05 19:06:29 -08:00
Yiteng Niu	1ce83582f9	[None][infra] update github token name (#8907 )	2025-11-05 00:55:28 -08:00
Zhanrui Sun	4de31bece2	[TRTLLM-8994][infra] upgrade to DLFW 25.10 and pytorch 2.9.0 / triton 3.5.0 (#8838 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-11-04 18:59:34 +08:00
Matt Lefebvre	0f6763680a	[TRTINFRA-7215][infra] - Move half of the DGX H100 premerge tests to SLURM (#8849 ) Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>	2025-11-04 00:11:26 +08:00
Emma Qiao	14bc8571ae	[TRTLLM-8435][infra] Test existing rtxpro6000 stages on rtxpro6000d (#8319 ) Signed-off-by: qqiao <qqiao@nvidia.com> Signed-off-by: Emma Qiao <qqiao@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-11-03 05:26:17 -08:00
chenfeiz0326	cc4ab8d9d1	[TRTLLM-8825][feat] Support Pytest Perf Results uploading to Database (#8653 ) Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>	2025-11-03 16:23:13 +08:00
Yanchao Lu	da73410d3b	[None][fix] WAR for tensorrt depending on the archived nvidia-cuda-runtime-cu13 package (#8857 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-11-02 09:57:37 +08:00
dongxuy04	bba2519726	[TRTLLM-7008][fix] Enable GDRCopy and unwaive online eplb tests (#8720 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-10-31 16:39:51 -07:00
Matt Lefebvre	da2dca58aa	[TRTINFRA-7215][infra] Add support for enroot SLURM clusters (#8770 ) Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-10-31 12:22:21 -07:00
Zhanrui Sun	a6a3de8e35	[TRTLLM-9003][infra] Add python OpenSearchDB query / push. (#8506 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-10-30 19:43:51 -07:00
Zhanrui Sun	547d799111	[TRTLLM-8930][infra] Force Blossom perf test stages to use 'tensorrt/test_type: perf' in the K8S template (#8752 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-10-30 06:30:10 -07:00
yuanjingx87	e689a73c83	[None][infra] fix slurm results path (#8751 ) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>	2025-10-30 13:09:46 +08:00
Bo Li	9c4432f8a4	[TRTLLM-7318][feat] MnnvlThroughput AlltoAll implementation. (#7499 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-10-27 13:23:06 -04:00
QI JUN	cc5b8b6d28	[None][ci] move some time-consuming benchmark test cases to post merge (#8641 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-10-26 22:47:17 -04:00
Yiqing Yan	602b059180	[None][chore] Disable GB300 stages due to nodes will be offline temporarily (#8643 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-10-24 05:32:05 -04:00
yuanjingx87	e7ad5e4d6a	[None][infra] enable lfs for generateLockFile pipeline (#8547 ) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>	2025-10-24 12:59:27 +08:00
Emma Qiao	ee21ea3e91	[None][infra] Disable rtxpro6000 stages due to nodes will be offline (#8613 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-10-23 10:24:05 -04:00
Emma Qiao	7c1bca4563	[None][infra] Fix slurm exitcode (#8585 ) Signed-off-by: qqiao <qqiao@nvidia.com> Signed-off-by: Emma Qiao <qqiao@nvidia.com>	2025-10-23 09:46:00 -04:00
Emma Qiao	2b4e812aea	[None][infra] Let CI continue running other isolation tests when an isolation test get hanging (#8471 ) Signed-off-by: qqiao <qqiao@nvidia.com> Signed-off-by: Emma Qiao <qqiao@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-10-22 00:07:35 -04:00
chenfeiz0326	6cf1c3fba4	[TRTLLM-8260][feat] Add Server-Client Perf Test in pytest for B200 and B300 (#7985 ) Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>	2025-10-22 10:17:22 +08:00
Emma Qiao	c72f6d1dcc	[None][infra] Add split algorithm for slurm (#8516 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-10-21 02:56:22 -04:00
QI JUN	0acd10e3de	[None][ci] rebalance H100 stages (#8491 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-10-21 02:03:48 -04:00
yuanjingx87	1e3e1474c6	[TRTLLM-6055][infra] Slurm Test refactor (#7176 ) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-10-20 09:46:44 -07:00
QI JUN	d05079ba4b	[None][ci] move some test cases from H100 to A10 (#8449 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-10-20 01:58:34 -04:00
zhhuang-nv	7a2bab93f0	[None][test] Add post merge test for Seed-OSS-36B-Instruct (#8321 ) Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>	2025-10-17 02:30:33 -07:00
Yanchao Lu	e72ade33c2	[None][chore] Update commit msg for adding lock files (#8448 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-10-17 00:24:26 -07:00
yuanjingx87	3481d03470	[None][infra] Fix for generate lockfile pipeline (#7820 ) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>	2025-10-16 14:17:18 -07:00
Zhanrui Sun	19241626d0	[https://nvbugs/5563653 ][infra] reduce docker image layers (#8250 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-10-16 22:46:19 +08:00
Emma Qiao	493da020c1	[TRTLLM-7351][infra] Add isolate marker for L0 (#7497 ) Signed-off-by: qqiao <qqiao@nvidia.com> Signed-off-by: Emma Qiao <qqiao@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-10-14 16:58:14 -07:00
Emma Qiao	fe17e78f27	[None][infra] Add back gb200 multi-node test stage to pre-merge (#8281 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-10-12 23:56:07 -07:00
Zhanrui Sun	5798a12199	[None][infra] Remove WAR code for GH200 node (#8266 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-10-11 20:33:14 -07:00

1 2 3 4 5 ...

274 Commits