TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-02-14 06:53:50 +08:00

Author	SHA1	Message	Date
JunyiXu-nv	356ad4fe3a	[https://nvbugs/5722653 ][fix] Address port conflict by assigning different port section in the same node. (#10035 ) Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>	2025-12-19 15:34:04 +08:00
yuanjingx87	df15be3fad	[None][infra] Fix slurm job does not catch cancelled jobs (#9722 ) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com> Signed-off-by: yuanjingx87 <197832395+yuanjingx87@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-12-18 00:32:43 -08:00
QI JUN	4ce35eacf1	[TRTLLM-9794][ci] move more test cases to gb200 (#9994 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-12-15 19:50:41 -08:00
Matt Lefebvre	1375910f1b	[None][infra] Delete container before attempting import (#9967 ) Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>	2025-12-14 00:09:33 -08:00
Matt Lefebvre	df1adfbb50	[TRTINFRA-7328][infra] - Move half B200 tests to lbd (#9853 ) Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>	2025-12-10 04:24:30 -08:00
Matt Lefebvre	8fefa2c9d1	[None][infra] Fail fast if SLURM entrypoint fails (#9744 ) Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>	2025-12-10 02:31:29 -08:00
Matt Lefebvre	5de4e3f621	[TRTINFRA-7328][infra] Consume SlurmCluster scratchPath and cleanup mounts (#9600 ) Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>	2025-12-09 13:34:09 -08:00
Jhao-Ting Chen	0a09465089	[https://nvbugs/5567586 ][feat] Ampere xqa swa specdec for GPT-OSS Eagle3-one-model (#8383 ) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>	2025-12-08 11:16:05 -08:00
chenfeiz0326	383178c00a	[TRTLLM-9000][feat] Add multi-node Perf Tests into CI (#8800 ) Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>	2025-12-08 09:00:44 +08:00
Yanchao Lu	f59d64e6c7	[None][fix] Several minor fixes to CI setting (#9765 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-12-07 23:07:59 +08:00
Yiqing Yan	e834f04238	[TRTLLM-9579][infra] Set mergeWaiveList stage UNSTABLE when there is any issue (#9692 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-12-05 10:18:31 +08:00
Yiqing Yan	731b2eb4ef	[TRTLLM-5312][infra] Add triton trigger rules (#6440 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-12-05 07:35:04 +08:00
Yiqing Yan	47f650ca13	[TRTLLM-5093][infra] Write env variables to a file in the interactive debug session (#6792 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-12-04 11:41:27 +08:00
Yiqing Yan	e31142202e	[TRTLLM-7181][infra] Generate test results when pytest timeout happens (#9396 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-12-04 10:05:38 +08:00
Yiqing Yan	8c88454fa5	[TRTLLM-7101][infra] Reuse passed tests (#6894 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-12-03 10:07:23 +08:00
Chang Liu	73a543d78f	[None][fix] Extract GPU count from single-node stage names (#9599 ) Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>	2025-12-02 20:58:16 +08:00
Eran Geva	1a46bb0d18	Lock the gpu clocks in L0 perf tests (#9585 ) Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>	2025-12-02 18:13:45 +08:00
Emma Qiao	b024040df0	[None][infra] Update the pytest options after MI (#9579 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-12-02 00:11:30 +08:00
Yanchao Lu	078d3a576e	[None][ci] Minor change for Slurm scripts (#9561 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-12-01 22:52:08 +08:00
Enwei Zhu	34e2fa5c96	[https://nvbugs/5690172 ][fix] Fix Qwen3-235B ATP accuracy issue with PDL (#9530 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-12-01 09:10:21 +08:00
Yanchao Lu	694b60d92d	[None][ci] Split H100_PCIe-PyTorch-Post-Merge test stage (#9559 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-11-30 21:14:18 +08:00
Yanchao Lu	0398875d55	[None][ci] Split H100_PCIe-PyTorch-Post-Merge test stage (#9558 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-11-30 20:27:13 +08:00
Yanchao Lu	f03641808b	[None][infra] - Request idle time exemption for OCI jobs (#9528 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-11-30 13:34:09 +08:00
Zhanrui Sun	930cdad054	[TRTLLM-9541][infra] Use artifactory mirror for download.pytorch.org (#9477 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-11-28 18:31:50 +08:00
Emma Qiao	658d9fc0c5	[TRTLLM-8970][infra] Fix generate report when has isolation test result (#8861 ) Signed-off-by: qqiao <qqiao@nvidia.com> Signed-off-by: Emma Qiao <qqiao@nvidia.com>	2025-11-28 11:26:06 +08:00
Yiqing Yan	1c9158fde3	[TRTLLM-7288][infra] Download merged waive list in slurm script (#8999 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-11-27 21:48:40 +08:00
yuanjingx87	3ada0bfc65	[None][infra] Fix Slurm job script (#9508 ) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>	2025-11-27 16:41:01 +08:00
Emma Qiao	a21be43677	[TRTLLM-9279][infra] Use flexcache for gh200 nodes since they locate in Austin (#9405 ) Signed-off-by: qqiao <qqiao@nvidia.com> Signed-off-by: Emma Qiao <qqiao@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-11-27 15:42:38 +08:00
yuanjingx87	356f67c1cb	[None][infra] Fail the pipeline when slurm ssh dropped (#9157 ) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>	2025-11-26 09:35:04 -08:00
Yanchao Lu	ff02e0f05c	[None][ci] Move more test stages to use OCI machines (#9395 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Matt Lefebvre <matthewelefebvre@gmail.com>	2025-11-25 15:59:13 +08:00
Matt Lefebvre	fefa02fa95	[TRTINFRA-7326][infra] - Consume SlurmCluster sshPort for clusters with custom SSH port (#9313 ) Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>	2025-11-21 18:58:00 -08:00
Yiqing Yan	2a27166b59	[TRTLLM-9183][infra] Add --waives-file in rerun pytest command (#8971 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-11-21 13:40:45 +08:00
Simeng Liu	9286223288	[https://nvbugs/5515753 ][ci] Add NCCL_DEBUG=INFO flag to collect more info with CI failure. (#8440 ) Signed-off-by: Simeng Liu <simengl@nvidia.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com>	2025-11-20 12:43:13 -05:00
Matt Lefebvre	470d777744	[TRTINFRA-7280][infra] Support enroot/pyxis clusters in multi-node SLURM and enable oci-hsg GB200 in post-merge (#9117 ) Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>	2025-11-17 10:59:30 -08:00
Zhanrui Sun	bdcf837784	[TRTLLM-9079][infra] upgrade tritonserver DLFW 25.10 (#8929 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-11-14 20:22:10 -08:00
Emma Qiao	183778d58a	[None][infra] Waive failed tests for main 11/07 (#9008 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-11-08 08:51:35 -08:00
Emma Qiao	2af6a537ad	[TRTLLM-8999][infra] Reduce gb200 multi-node test stages (#8778 ) Signed-off-by: qqiao <qqiao@nvidia.com> Signed-off-by: Emma Qiao <qqiao@nvidia.com>	2025-11-08 06:34:24 -08:00
Zhanrui Sun	4de31bece2	[TRTLLM-8994][infra] upgrade to DLFW 25.10 and pytorch 2.9.0 / triton 3.5.0 (#8838 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-11-04 18:59:34 +08:00
Matt Lefebvre	0f6763680a	[TRTINFRA-7215][infra] - Move half of the DGX H100 premerge tests to SLURM (#8849 ) Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>	2025-11-04 00:11:26 +08:00
Emma Qiao	14bc8571ae	[TRTLLM-8435][infra] Test existing rtxpro6000 stages on rtxpro6000d (#8319 ) Signed-off-by: qqiao <qqiao@nvidia.com> Signed-off-by: Emma Qiao <qqiao@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-11-03 05:26:17 -08:00
Yanchao Lu	da73410d3b	[None][fix] WAR for tensorrt depending on the archived nvidia-cuda-runtime-cu13 package (#8857 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-11-02 09:57:37 +08:00
dongxuy04	bba2519726	[TRTLLM-7008][fix] Enable GDRCopy and unwaive online eplb tests (#8720 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-10-31 16:39:51 -07:00
Matt Lefebvre	da2dca58aa	[TRTINFRA-7215][infra] Add support for enroot SLURM clusters (#8770 ) Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-10-31 12:22:21 -07:00
Zhanrui Sun	a6a3de8e35	[TRTLLM-9003][infra] Add python OpenSearchDB query / push. (#8506 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-10-30 19:43:51 -07:00
Zhanrui Sun	547d799111	[TRTLLM-8930][infra] Force Blossom perf test stages to use 'tensorrt/test_type: perf' in the K8S template (#8752 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-10-30 06:30:10 -07:00
yuanjingx87	e689a73c83	[None][infra] fix slurm results path (#8751 ) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>	2025-10-30 13:09:46 +08:00
QI JUN	cc5b8b6d28	[None][ci] move some time-consuming benchmark test cases to post merge (#8641 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-10-26 22:47:17 -04:00
Yiqing Yan	602b059180	[None][chore] Disable GB300 stages due to nodes will be offline temporarily (#8643 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-10-24 05:32:05 -04:00
Emma Qiao	ee21ea3e91	[None][infra] Disable rtxpro6000 stages due to nodes will be offline (#8613 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-10-23 10:24:05 -04:00
Emma Qiao	2b4e812aea	[None][infra] Let CI continue running other isolation tests when an isolation test get hanging (#8471 ) Signed-off-by: qqiao <qqiao@nvidia.com> Signed-off-by: Emma Qiao <qqiao@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-10-22 00:07:35 -04:00

1 2 3 4 5

206 Commits