TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-02-14 06:53:50 +08:00

Author	SHA1	Message	Date
Emma Qiao	d3df3f6feb	[None][infra] Waive failed cases and disable a stage on 02/02 (#11177 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2026-02-02 13:28:53 +08:00
Matt Lefebvre	97ab014bdb	[TRTINFRA-7548][infra] Update GB200 test configs to use frontend SLURM platforms (#11085 ) Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>	2026-01-30 14:07:47 -08:00
Yiqing Yan	6fcbf15fb8	[None][fix] No need to remove the original waive list (#11060 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2026-01-29 11:10:38 +08:00
Matt Lefebvre	c26a8f764c	[TRTINFRA-7379][infra] Change SLURM config access to use resolvePlatform (#11006 ) Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>	2026-01-27 12:33:16 -08:00
Linda	ce556290c9	[None][chore] Removing pybind11 bindings and references (#10550 ) Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>	2026-01-26 08:19:12 -05:00
Emma Qiao	9d65b8bf24	[None][infra] Fix TRT-LLM data scratch mount point for gb10x (#10880 ) Signed-off-by: qqiao <qqiao@nvidia.com> Signed-off-by: Emma Qiao <qqiao@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2026-01-24 14:00:17 +08:00
Zhanrui Sun	df845a028b	[TRTLLM-9581][infra] Use /home/scratch.trt_llm_data_ci in computelab (#10616 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>	2026-01-19 00:40:40 -05:00
chenfeiz0326	e97af45556	[TRTLLM-10300][feat] Upload regression info to artifactory (#10599 ) Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>	2026-01-19 10:16:31 +08:00
Yanchao Lu	0096b50ba0	[None][infra] Update upgrade related docs for release 1.2 (#10760 ) (#10773 ) Signed-off-by: qqiao <qqiao@nvidia.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Emma Qiao <qqiao@nvidia.com>	2026-01-18 00:14:27 +08:00
chenfeiz0326	56073f501a	[TRTLLM-8263][feat] Add Aggregated Perf Tests (#10598 ) Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>	2026-01-17 13:16:36 +08:00
Lucas Liebenwein	62050b2381	[None][infra] separate AutoDeploy tests into own stages (#10634 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2026-01-14 23:05:26 -05:00
Emma Qiao	01083b56bf	[TRTLLM-9849][infra] Update dependencies to 25.12 (#9818 ) Signed-off-by: qqiao <qqiao@nvidia.com> Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> Signed-off-by: Emma Qiao <qqiao@nvidia.com> Signed-off-by: xxi <xxi@nvidia.com> Signed-off-by: xxi <95731198+xxi-nv@users.noreply.github.com> Co-authored-by: Bo Li <22713281+bobboli@users.noreply.github.com> Co-authored-by: xxi <xxi@nvidia.com> Co-authored-by: xxi <95731198+xxi-nv@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2026-01-14 21:54:04 +08:00
tburt-nv	7d41475954	[None][infra] try removing shared cache dir mount (#10609 ) Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>	2026-01-13 15:07:12 +08:00
Yanchao Lu	80649a8b78	[None][ci] Workaround OCI-NRT slowdown issue (#10587 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2026-01-11 22:08:19 +08:00
Emma Qiao	43839c7d9b	[TRTLLM-9642][infra] Increase pytest verbosity for failed tests (#9657 ) Signed-off-by: qqiao <qqiao@nvidia.com> Signed-off-by: Emma Qiao <qqiao@nvidia.com>	2026-01-08 02:33:48 -05:00
Yiqing Yan	5108a69fc0	[TRTLLM-9622][infra] Enable DGX_B300 multi-gpu testing in pre-merge pipeline (#9699 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2026-01-06 14:39:55 +08:00
chenfeiz0326	a65b0d4efa	[None][fix] Decrease Pre Merge Perf Tests (#10390 ) Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2026-01-04 12:21:34 -05:00
Yanchao Lu	c4f27fa4c0	[None][ci] Some tweaks for the CI pipeline (#10359 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2026-01-04 11:10:47 -05:00
yuanjingx87	5bd37ce41e	[None][infra] add retry logic to get slurm sbatch job log when ssh dropped (#9167 ) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>	2026-01-04 10:11:37 +08:00
chenfeiz0326	5e0e48144f	[None][fix] Minor updates on Perf Test System (#10375 ) Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>	2026-01-02 17:17:42 +08:00
chenfeiz0326	a23c6f1092	[TRTLLM-9834][feat] Transfer to TRTLLM-INFRA Database and Fail post-merge tests if regression (#10282 ) Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>	2025-12-31 21:44:59 +08:00
Yiqing Yan	fdc03684cc	[TRTLLM-10016][infra] Use SlurmPatition attribute time as timeout threshold (#10254 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-12-31 15:02:24 +08:00
Emma Qiao	fb05cd769a	[None][infra] Enable single-gpu CI on spark (#9304 ) Signed-off-by: qqiao <qqiao@nvidia.com> Signed-off-by: Emma Qiao <qqiao@nvidia.com> Signed-off-by: Jenny Liu <JennyLiu-nv+JennyLiu@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-12-30 17:22:14 +08:00
Yanchao Lu	965578ca21	[None][infra] Some improvements for Slurm execution path in the CI (#10316 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-12-29 06:49:44 -05:00
Yanchao Lu	270be801aa	[None][ci] Move remaining DGX-B200 tests to LBD (#9876 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-12-28 13:55:39 +08:00
chenfeiz0326	d70aeddc7f	[TRTLLM-8952][feat] Support Multi-Node Disagg Perf Test in CI (#9138 ) Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>	2025-12-26 22:50:53 +08:00
Emma Qiao	16fd781e42	[TRTLLM-9862][infra] Move single-gpu tests on rtxpro6000d to pre-merge (#9897 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-12-24 21:45:33 -05:00
shuyixiong	f4f0fe85e9	[TRTLLM-9737][chore] Add rl perf reproduce script and enhance the robustness of Ray tests (#9939 ) Signed-off-by: Shuyi Xiong <219646547+shuyixiong@users.noreply.github.com>	2025-12-24 15:27:01 +08:00
chenfeiz0326	48c875f8ea	[None][fix] Add OpenSearch URL in slurm_launch.sh for Multinode Perf Sanity Test (#9990 ) Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>	2025-12-23 16:02:38 +08:00
JunyiXu-nv	356ad4fe3a	[https://nvbugs/5722653 ][fix] Address port conflict by assigning different port section in the same node. (#10035 ) Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>	2025-12-19 15:34:04 +08:00
yuanjingx87	df15be3fad	[None][infra] Fix slurm job does not catch cancelled jobs (#9722 ) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com> Signed-off-by: yuanjingx87 <197832395+yuanjingx87@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-12-18 00:32:43 -08:00
QI JUN	4ce35eacf1	[TRTLLM-9794][ci] move more test cases to gb200 (#9994 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-12-15 19:50:41 -08:00
Matt Lefebvre	1375910f1b	[None][infra] Delete container before attempting import (#9967 ) Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>	2025-12-14 00:09:33 -08:00
Matt Lefebvre	df1adfbb50	[TRTINFRA-7328][infra] - Move half B200 tests to lbd (#9853 ) Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>	2025-12-10 04:24:30 -08:00
Matt Lefebvre	8fefa2c9d1	[None][infra] Fail fast if SLURM entrypoint fails (#9744 ) Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>	2025-12-10 02:31:29 -08:00
Matt Lefebvre	5de4e3f621	[TRTINFRA-7328][infra] Consume SlurmCluster scratchPath and cleanup mounts (#9600 ) Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>	2025-12-09 13:34:09 -08:00
Jhao-Ting Chen	0a09465089	[https://nvbugs/5567586 ][feat] Ampere xqa swa specdec for GPT-OSS Eagle3-one-model (#8383 ) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>	2025-12-08 11:16:05 -08:00
chenfeiz0326	383178c00a	[TRTLLM-9000][feat] Add multi-node Perf Tests into CI (#8800 ) Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>	2025-12-08 09:00:44 +08:00
Yanchao Lu	f59d64e6c7	[None][fix] Several minor fixes to CI setting (#9765 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-12-07 23:07:59 +08:00
Yiqing Yan	e834f04238	[TRTLLM-9579][infra] Set mergeWaiveList stage UNSTABLE when there is any issue (#9692 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-12-05 10:18:31 +08:00
Yiqing Yan	731b2eb4ef	[TRTLLM-5312][infra] Add triton trigger rules (#6440 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-12-05 07:35:04 +08:00
Yiqing Yan	47f650ca13	[TRTLLM-5093][infra] Write env variables to a file in the interactive debug session (#6792 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-12-04 11:41:27 +08:00
Yiqing Yan	e31142202e	[TRTLLM-7181][infra] Generate test results when pytest timeout happens (#9396 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-12-04 10:05:38 +08:00
Yiqing Yan	8c88454fa5	[TRTLLM-7101][infra] Reuse passed tests (#6894 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-12-03 10:07:23 +08:00
Chang Liu	73a543d78f	[None][fix] Extract GPU count from single-node stage names (#9599 ) Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>	2025-12-02 20:58:16 +08:00
Eran Geva	1a46bb0d18	Lock the gpu clocks in L0 perf tests (#9585 ) Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>	2025-12-02 18:13:45 +08:00
Emma Qiao	b024040df0	[None][infra] Update the pytest options after MI (#9579 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-12-02 00:11:30 +08:00
Yanchao Lu	078d3a576e	[None][ci] Minor change for Slurm scripts (#9561 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-12-01 22:52:08 +08:00
Enwei Zhu	34e2fa5c96	[https://nvbugs/5690172 ][fix] Fix Qwen3-235B ATP accuracy issue with PDL (#9530 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-12-01 09:10:21 +08:00
Yanchao Lu	694b60d92d	[None][ci] Split H100_PCIe-PyTorch-Post-Merge test stage (#9559 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-11-30 21:14:18 +08:00

1 2 3 4 5

235 Commits