TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Yiqing Yan	25ec125726	[None][chore] Disable GB300 stages in release branch due to nodes will be offline temporarily (#8645 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-10-24 05:21:14 -04:00
Emma Qiao	4e11e0bd20	[None][infra] Disable rtxpro6000 stages due to nodes will be offline temporarily (#8616 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-10-23 10:21:21 -04:00
Simeng Liu	1375b9f074	[https://nvbugs/5515753 ][ci] Add NCCL_DEBUG=INFO flag to collect more info with CI failure. (#8440 ) Signed-off-by: Simeng Liu <simengl@nvidia.com>	2025-10-21 18:12:05 -07:00
Jin Li	3860a674d5	[https://nvbugs/5543770 ][fix] Update to Cutlass v4.2.1 (#8055 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-10-13 22:39:25 -07:00
Zhanrui Sun	02080e199d	[https://nvbugs/5563653 ][infra] reduce docker image layers (#8250 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-10-13 01:38:27 -07:00
Zhanrui Sun	4c36bba2ec	[None][infra] Remove WAR code for GH200 node (#8267 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-10-11 20:40:16 -07:00
Yiqing Yan	108248ece1	[TRTLLM-7999][infra] Add B300/GB300 single gpu test (#7951 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-09-26 09:59:11 +08:00
Yanchao Lu	7e2521a7f0	[None][chore] Some clean-ups for CUDA 13.0 dependencies (#7979 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-26 08:46:11 +08:00
Tracin	1f2761e67b	[None][feat] Enable gpt oss on DGX H100. (#6775 ) Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>	2025-09-23 09:35:19 -07:00
Pengbo Wang	a4b4ed4535	[None][fix] Fix and add test for TRTLLM MoE backend (#7755 ) Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>	2025-09-23 11:26:25 +08:00
Bo Deng	8cf95681e6	[TRTLLM-7989][infra] Bundle UCX and NIXL libs in the TRTLLM python package (#7766 ) Signed-off-by: Bo Deng <deemod@nvidia.com>	2025-09-22 16:43:35 +08:00
Yuxian Qiu	2d46dda6a7	[https://nvbugs/5448754 ][fix] Download HF model for all nodes. (#6824 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-22 14:28:38 +08:00
yuanjingx87	eeb89a167c	[None][infra] Add nightly pipeline to generate lock files (#5798 ) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>	2025-09-16 15:00:03 -07:00
Yanchao Lu	e5cead1eb9	[TRTLLM-6295][test] Exit as early as possible and propagate exit status correctly for multi-node testing (#7739 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-16 09:59:18 +08:00
xiweny	c076a02b38	[TRTLLM-4629] [feat] Add support of CUDA13 and sm103 devices (#7568 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com> Signed-off-by: Daniel Stokes <dastokes@nvidia.com> Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com> Signed-off-by: Xiwen Yu <xiweny@nvidia.com> Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com> Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Signed-off-by: Bo Deng <deemod@nvidia.com> Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: xiweny <13230610+VALLIS-NERIA@users.noreply.github.com> Co-authored-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com> Co-authored-by: Daniel Stokes <dastokes@nvidia.com> Co-authored-by: Zhanrui Sun <zhanruis@nvidia.com> Co-authored-by: Jiagan Cheng <jiaganc@nvidia.com> Co-authored-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Bo Deng <deemod@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-09-16 09:56:18 +08:00
QI JUN	44d5ccfdd9	[None][ci] move qwen3 tests from GB200 to B200 (#7733 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-09-16 08:12:28 +08:00
Yanchao Lu	70aa4e28c1	[None][ci] Test waives for the main branch 09/14 (#7698 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-14 23:48:04 +08:00
Yanchao Lu	89fc136972	[None][ci] Some improvements for Slurm CI (#7689 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-14 16:56:32 +08:00
Zhanrui Sun	1f43854496	[TRTLLM-6791][infra] Add check for uploading stage name and avoid overriding test result tar file (#6742 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-13 01:15:33 +08:00
Zhanrui Sun	7d73a89ad0	[TRTLLM-7169][infra] Fix Slurm multi-node test showing "Submit Test Results" in the test name (#6856 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-09-12 18:46:19 +08:00
v-shobhit	0652514c6d	[None][feat] Use a shell context to install dependancies (#7383 ) Signed-off-by: Shobhit Verma <shobhitv@nvidia.com> Signed-off-by: v-shobhit <161510941+v-shobhit@users.noreply.github.com> Co-authored-by: Zhihan Jiang <68881590+nvzhihanj@users.noreply.github.com>	2025-09-10 09:57:37 -07:00
QI JUN	a0e1604898	[None][ci] add DGX_H100-2_GPUs-PyTorch-Others-1 pipeline (#7629 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-09-09 11:06:32 -04:00
Zhanrui Sun	7a62df5f0b	[TRTLLM-4366][infra] Don't call reinstall_rockylinux_cuda when the base CUDA image is up to dated (#5980 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-09 02:15:39 -04:00
Tomer Shmilovich	ecc0e687c6	[None][feat] Nixl support for GDS (#5488 ) Signed-off-by: Tomer Shmilovich <tshmilovich@nvidia.com> Signed-off-by: Guy Lev <glev@nvidia.com> Co-authored-by: Guy Lev <glev@nvidia.com>	2025-09-09 13:00:38 +08:00
Yiqing Yan	5c616da2fd	[TRTLLM-5877][infra] Add fmha tests and auto trigger rules (#6050 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-09 11:33:09 +08:00
yuanjingx87	1d243a8503	[None][infra] Try to fix docker container failed to be killed issue (#7388 ) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>	2025-09-08 11:28:01 -07:00
Emma Qiao	dd9627d9f9	[None][infra] Add back rtx-pro-6000 stages since the node is available (#7601 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-09-08 05:45:11 -04:00
Yanchao Lu	ed27a72bcf	[None][ci] Fix a typo in the Slurm command Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-08 17:07:09 +08:00
BatshevaBlack	7c76dde76d	[TRTLLM-7187][fix] Build wheel with NIXL (#7472 ) Signed-off-by: BatshevaBlack <132911331+BatshevaBlack@users.noreply.github.com>	2025-09-07 19:05:37 -04:00
Yanchao Lu	045d2cf761	[None][ci] Block some nodes to avoid unstable network access (#7593 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-08 00:25:38 +08:00
Emma Qiao	5c4711fb2b	[None][infra] Skip RTX Pro 6000 test stages due to HW are offline (#7592 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-09-07 09:49:06 -04:00
Emma Qiao	aea8ac1649	[TRTLLM-5950][infra] Removing remaining turtle keywords from the code base (#7086 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-09-07 14:26:18 +08:00
Yanchao Lu	caf9b9cd42	[None][ci] Improve SSH connection stability (#7567 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-06 17:08:19 +08:00
Yiteng Niu	163b1fc84f	[None][infra] update nspect version (#7552 ) Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com>	2025-09-05 14:59:22 +08:00
Yanchao Lu	4195010e13	[None][ci] Increase the number of retries in docker image generation (#7557 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-05 14:47:14 +08:00
Zhanrui Sun	0de3f83805	[TRTLLM-6893][infra] Disable the x86 / SBSA build stage when run BuildDockerImage (#6729 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-04 07:20:15 -04:00
Yanchao Lu	c622f61609	[None][fix] Fix a typo in the Slurm CI codes (#7485 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-04 01:56:27 -04:00
Emma Qiao	931816fee1	[TRTLLM-6199][infra] Update for using open driver from BSL (#7430 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-09-04 11:47:40 +08:00
Yanchao Lu	a07bb163f7	[None][ci] Correct docker args for GPU devices and remove some stale CI codes (#7417 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-02 04:06:51 -04:00
Yiqing Yan	ff2439ff48	[None][infra] Using local variables in rerun function (#7198 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-09-02 13:55:26 +08:00
yuanjingx87	2b286ae613	[None][infra] Disable GB200-PyTorch-1 due to OOM issue (#7386 ) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>	2025-09-01 01:56:31 -04:00
Yanchao Lu	c5148f52d5	[None][ci] Some improvements for Slurm CI setup (#7407 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-01 10:57:36 +08:00
Pengbo Wang @ NVIDIA	62459d533d	[None][chore] Update pre-merge test to add DeepSeek/LLaMA and gpt-oss (#7192 ) Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com> Signed-off-by: Pengbo Wang @ NVIDIA <221450789+pengbowang-nv@users.noreply.github.com> Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>	2025-08-29 17:03:46 +08:00
Yanchao Lu	460a34c671	[None][chore] Some improvements for CI stability (#7199 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-08-28 16:19:20 -04:00
Martin Marciniszyn Mehringer	7cfa475e05	[None][fix] Remove the wheel from intermediate docker storage (#7175 ) Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>	2025-08-27 11:32:17 -04:00
QI JUN	baef70e67e	[None][ci] move qwen3 tests from b200 to gb200 (#7257 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-08-26 11:50:53 -04:00
Emma Qiao	a142c0c4de	[None][infra] Add retry 3 times if ssh cluster failed (#6859 ) Signed-off-by: qqiao <qqiao@nvidia.com> Signed-off-by: Emma Qiao <qqiao@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-08-26 05:11:50 -04:00
Yiqing Yan	486bc763c3	[None][infra] Split DGX_B200 stage into multiple parts and pre-/post-merge (#7074 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-08-24 21:09:04 -04:00
Robin Kobus	31979aefac	[None] [ci] Reorganize CMake and Python integration test infrastructure for C++ tests (#6754 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-08-24 20:53:17 +02:00
Yanchao Lu	ec35481b0a	[None][infra] Prepare for single GPU GB200 test pipeline (#7073 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-08-24 21:46:39 +08:00

1 2 3 4 5

222 Commits