TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Emma Qiao	c72f6d1dcc	[None][infra] Add split algorithm for slurm (#8516 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-10-21 02:56:22 -04:00
QI JUN	0acd10e3de	[None][ci] rebalance H100 stages (#8491 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-10-21 02:03:48 -04:00
yuanjingx87	1e3e1474c6	[TRTLLM-6055][infra] Slurm Test refactor (#7176 ) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-10-20 09:46:44 -07:00
QI JUN	d05079ba4b	[None][ci] move some test cases from H100 to A10 (#8449 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-10-20 01:58:34 -04:00
zhhuang-nv	7a2bab93f0	[None][test] Add post merge test for Seed-OSS-36B-Instruct (#8321 ) Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>	2025-10-17 02:30:33 -07:00
Yanchao Lu	e72ade33c2	[None][chore] Update commit msg for adding lock files (#8448 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-10-17 00:24:26 -07:00
yuanjingx87	3481d03470	[None][infra] Fix for generate lockfile pipeline (#7820 ) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>	2025-10-16 14:17:18 -07:00
Zhanrui Sun	19241626d0	[https://nvbugs/5563653 ][infra] reduce docker image layers (#8250 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-10-16 22:46:19 +08:00
Emma Qiao	493da020c1	[TRTLLM-7351][infra] Add isolate marker for L0 (#7497 ) Signed-off-by: qqiao <qqiao@nvidia.com> Signed-off-by: Emma Qiao <qqiao@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-10-14 16:58:14 -07:00
Emma Qiao	fe17e78f27	[None][infra] Add back gb200 multi-node test stage to pre-merge (#8281 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-10-12 23:56:07 -07:00
Zhanrui Sun	5798a12199	[None][infra] Remove WAR code for GH200 node (#8266 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-10-11 20:33:14 -07:00
Zhenhuan Chen	84d2f12818	[TRTLLM-6748][feat] add PDL support for more kernels (#7977 ) Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>	2025-10-11 08:32:05 +08:00
Jonas Yang CN	88ea2c4ee9	[TRTLLM-7349][feat] Adding new orchestrator type -- ray (#7520 ) Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com>	2025-10-04 08:12:24 +08:00
Nikita Korobov	9b3d7cc3e6	[None][feat] Update TRT-LLM Gen MoE kernels (#7970 ) Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>	2025-10-03 09:22:45 +08:00
mpikulski	fc7f78c400	[TRTLLM-8269][test] do not explicitly pass temperature=0 to select greedy sampling (#8110 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-10-02 10:20:32 +02:00
Cheng Hang	cdce68c3e0	[TRTLLM-6741][fix] Add heuristics for lm head tp size when `enable_lm_head_tp_in_adp=True` (#7891 ) Signed-off-by: Cheng Hang <chang@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-30 09:24:35 +08:00
HuiGao-NV	1339beb04e	[None][ci] Disable tensorRT cases in post-merge (#8028 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-09-29 14:21:52 +08:00
Eran Geva	9cea6bfb30	[#7288 ][feat] Added AutoDeploy backend support to test_perf.py (#7588 ) Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>	2025-09-28 21:21:27 -07:00
Iman Tabrizian	33282351a2	[TRTLLM-6106][feat] Add support for KVCache transfer from KVCache reuse path (#6348 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-09-27 19:29:30 -04:00
Yiqing Yan	108248ece1	[TRTLLM-7999][infra] Add B300/GB300 single gpu test (#7951 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-09-26 09:59:11 +08:00
Yanchao Lu	7e2521a7f0	[None][chore] Some clean-ups for CUDA 13.0 dependencies (#7979 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-26 08:46:11 +08:00
Tracin	1f2761e67b	[None][feat] Enable gpt oss on DGX H100. (#6775 ) Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>	2025-09-23 09:35:19 -07:00
Pengbo Wang	a4b4ed4535	[None][fix] Fix and add test for TRTLLM MoE backend (#7755 ) Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>	2025-09-23 11:26:25 +08:00
Bo Deng	8cf95681e6	[TRTLLM-7989][infra] Bundle UCX and NIXL libs in the TRTLLM python package (#7766 ) Signed-off-by: Bo Deng <deemod@nvidia.com>	2025-09-22 16:43:35 +08:00
Yuxian Qiu	2d46dda6a7	[https://nvbugs/5448754 ][fix] Download HF model for all nodes. (#6824 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-22 14:28:38 +08:00
yuanjingx87	eeb89a167c	[None][infra] Add nightly pipeline to generate lock files (#5798 ) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>	2025-09-16 15:00:03 -07:00
Yanchao Lu	e5cead1eb9	[TRTLLM-6295][test] Exit as early as possible and propagate exit status correctly for multi-node testing (#7739 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-16 09:59:18 +08:00
xiweny	c076a02b38	[TRTLLM-4629] [feat] Add support of CUDA13 and sm103 devices (#7568 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com> Signed-off-by: Daniel Stokes <dastokes@nvidia.com> Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com> Signed-off-by: Xiwen Yu <xiweny@nvidia.com> Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com> Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Signed-off-by: Bo Deng <deemod@nvidia.com> Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: xiweny <13230610+VALLIS-NERIA@users.noreply.github.com> Co-authored-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com> Co-authored-by: Daniel Stokes <dastokes@nvidia.com> Co-authored-by: Zhanrui Sun <zhanruis@nvidia.com> Co-authored-by: Jiagan Cheng <jiaganc@nvidia.com> Co-authored-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Bo Deng <deemod@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-09-16 09:56:18 +08:00
QI JUN	44d5ccfdd9	[None][ci] move qwen3 tests from GB200 to B200 (#7733 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-09-16 08:12:28 +08:00
Yanchao Lu	70aa4e28c1	[None][ci] Test waives for the main branch 09/14 (#7698 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-14 23:48:04 +08:00
Yanchao Lu	89fc136972	[None][ci] Some improvements for Slurm CI (#7689 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-14 16:56:32 +08:00
Zhanrui Sun	1f43854496	[TRTLLM-6791][infra] Add check for uploading stage name and avoid overriding test result tar file (#6742 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-13 01:15:33 +08:00
Zhanrui Sun	7d73a89ad0	[TRTLLM-7169][infra] Fix Slurm multi-node test showing "Submit Test Results" in the test name (#6856 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-09-12 18:46:19 +08:00
v-shobhit	0652514c6d	[None][feat] Use a shell context to install dependancies (#7383 ) Signed-off-by: Shobhit Verma <shobhitv@nvidia.com> Signed-off-by: v-shobhit <161510941+v-shobhit@users.noreply.github.com> Co-authored-by: Zhihan Jiang <68881590+nvzhihanj@users.noreply.github.com>	2025-09-10 09:57:37 -07:00
QI JUN	a0e1604898	[None][ci] add DGX_H100-2_GPUs-PyTorch-Others-1 pipeline (#7629 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-09-09 11:06:32 -04:00
Zhanrui Sun	7a62df5f0b	[TRTLLM-4366][infra] Don't call reinstall_rockylinux_cuda when the base CUDA image is up to dated (#5980 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-09 02:15:39 -04:00
Tomer Shmilovich	ecc0e687c6	[None][feat] Nixl support for GDS (#5488 ) Signed-off-by: Tomer Shmilovich <tshmilovich@nvidia.com> Signed-off-by: Guy Lev <glev@nvidia.com> Co-authored-by: Guy Lev <glev@nvidia.com>	2025-09-09 13:00:38 +08:00
Yiqing Yan	5c616da2fd	[TRTLLM-5877][infra] Add fmha tests and auto trigger rules (#6050 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-09 11:33:09 +08:00
yuanjingx87	1d243a8503	[None][infra] Try to fix docker container failed to be killed issue (#7388 ) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>	2025-09-08 11:28:01 -07:00
Emma Qiao	dd9627d9f9	[None][infra] Add back rtx-pro-6000 stages since the node is available (#7601 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-09-08 05:45:11 -04:00
Yanchao Lu	ed27a72bcf	[None][ci] Fix a typo in the Slurm command Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-08 17:07:09 +08:00
BatshevaBlack	7c76dde76d	[TRTLLM-7187][fix] Build wheel with NIXL (#7472 ) Signed-off-by: BatshevaBlack <132911331+BatshevaBlack@users.noreply.github.com>	2025-09-07 19:05:37 -04:00
Yanchao Lu	045d2cf761	[None][ci] Block some nodes to avoid unstable network access (#7593 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-08 00:25:38 +08:00
Emma Qiao	5c4711fb2b	[None][infra] Skip RTX Pro 6000 test stages due to HW are offline (#7592 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-09-07 09:49:06 -04:00
Emma Qiao	aea8ac1649	[TRTLLM-5950][infra] Removing remaining turtle keywords from the code base (#7086 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-09-07 14:26:18 +08:00
Yanchao Lu	caf9b9cd42	[None][ci] Improve SSH connection stability (#7567 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-06 17:08:19 +08:00
Yiteng Niu	163b1fc84f	[None][infra] update nspect version (#7552 ) Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com>	2025-09-05 14:59:22 +08:00
Yanchao Lu	4195010e13	[None][ci] Increase the number of retries in docker image generation (#7557 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-05 14:47:14 +08:00
Zhanrui Sun	0de3f83805	[TRTLLM-6893][infra] Disable the x86 / SBSA build stage when run BuildDockerImage (#6729 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-04 07:20:15 -04:00
Yanchao Lu	c622f61609	[None][fix] Fix a typo in the Slurm CI codes (#7485 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-04 01:56:27 -04:00
Emma Qiao	931816fee1	[TRTLLM-6199][infra] Update for using open driver from BSL (#7430 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-09-04 11:47:40 +08:00
Yanchao Lu	a07bb163f7	[None][ci] Correct docker args for GPU devices and remove some stale CI codes (#7417 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-02 04:06:51 -04:00
Yiqing Yan	ff2439ff48	[None][infra] Using local variables in rerun function (#7198 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-09-02 13:55:26 +08:00
yuanjingx87	2b286ae613	[None][infra] Disable GB200-PyTorch-1 due to OOM issue (#7386 ) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>	2025-09-01 01:56:31 -04:00
Yanchao Lu	c5148f52d5	[None][ci] Some improvements for Slurm CI setup (#7407 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-01 10:57:36 +08:00
Pengbo Wang @ NVIDIA	62459d533d	[None][chore] Update pre-merge test to add DeepSeek/LLaMA and gpt-oss (#7192 ) Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com> Signed-off-by: Pengbo Wang @ NVIDIA <221450789+pengbowang-nv@users.noreply.github.com> Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>	2025-08-29 17:03:46 +08:00
Yanchao Lu	460a34c671	[None][chore] Some improvements for CI stability (#7199 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-08-28 16:19:20 -04:00
Martin Marciniszyn Mehringer	7cfa475e05	[None][fix] Remove the wheel from intermediate docker storage (#7175 ) Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>	2025-08-27 11:32:17 -04:00
QI JUN	baef70e67e	[None][ci] move qwen3 tests from b200 to gb200 (#7257 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-08-26 11:50:53 -04:00
Emma Qiao	a142c0c4de	[None][infra] Add retry 3 times if ssh cluster failed (#6859 ) Signed-off-by: qqiao <qqiao@nvidia.com> Signed-off-by: Emma Qiao <qqiao@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-08-26 05:11:50 -04:00
Yiqing Yan	486bc763c3	[None][infra] Split DGX_B200 stage into multiple parts and pre-/post-merge (#7074 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-08-24 21:09:04 -04:00
Robin Kobus	31979aefac	[None] [ci] Reorganize CMake and Python integration test infrastructure for C++ tests (#6754 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-08-24 20:53:17 +02:00
Yanchao Lu	ec35481b0a	[None][infra] Prepare for single GPU GB200 test pipeline (#7073 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-08-24 21:46:39 +08:00
QI JUN	1388e84793	[None][ci] move all B200 TensorRT test cases to post merge (#7165 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-08-22 06:47:23 -04:00
Linda	898f37faa0	[None][feat] Enable nanobind as the default binding library (#6608 ) Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>	2025-08-22 09:48:41 +02:00
Emma Qiao	a49cf684f8	[TRTLLM-5801][infra] Add more RTX Pro 6000 test stages (#5126 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-08-22 03:12:02 -04:00
Yuan Tong	90bfc8cc29	[https://nvbugs/5453827 ][fix] Fix RPATH of th_common shared library to find pip-installed NCCL (#6984 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-08-21 17:58:30 +08:00
BatshevaBlack	9f51f8d20c	[None][infra] Upgrade UCX to v1.19.x and NIXL to 0.5.0 (#7024 ) Signed-off-by: Batsheva Black <132911331+BatshevaBlack@users.noreply.github.com> Signed-off-by: Bo Deng <deemod@nvidia.com> Co-authored-by: Bo Deng <deemod@nvidia.com>	2025-08-20 22:49:55 -04:00
QI JUN	a918de710a	[None][ci] move some tests of b200 to post merge (#7093 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-08-20 19:43:40 -04:00
Fanrong Li	816a120af6	[TRTLLM-6991][chore] add DeepSeek-R1 FP8 accuracy tests on Blackwell (#6710 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-08-19 00:03:03 -04:00
Yanchao Lu	d1d17dbeba	[None][infra] Cherry-pick #6836 from main branch and improve SSH connection (#6971 ) (#7005 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-08-19 01:35:30 +08:00
Yanchao Lu	3a987891d8	[TRTLLM-7141][infra] Use repo mirrors to avoid intermittent network failures (#6836 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-08-15 11:16:07 +08:00
Wanli Jiang	9a133e9b41	[https://nvbugs/5415862 ][fix] Update cublas as 12.9.1 and cuda memory alignment as 256 (#6501 ) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2025-08-15 11:10:59 +08:00
Yiqing Yan	62d6c98d68	[TRTLLM-5633][infra] Force set changed file diff to empty string for post-merge CI (#6777 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-08-11 02:38:05 -04:00
Yiqing Yan	3e41e6c077	[TRTLLM-6892][infra] Run guardwords scan first in Release Check stage (#6659 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-08-06 23:00:15 -04:00
Yanchao Lu	b7347ce7d1	[https://nvbugs/5433581 ][fix] Revert deep_gemm installation workaround for SBSA (#6666 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-08-06 18:50:53 +08:00
Yiqing Yan	98424f3186	[TRTLLM-5633][infra] Change the TOT repo to default-llm-repo for merge waive list (#6605 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-08-06 06:19:03 -04:00
Zhanrui Sun	6a9b4b11be	[https://nvbugs/5433581 ][infra] Temporarily disable Docker Image use wheel from build stage (#6630 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-08-05 09:33:11 -04:00
Emma Qiao	78a75c2990	[None][Infra] - Split gb200 stages for each test (#6594 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-08-05 07:10:00 -04:00
Zhanrui Sun	7cbe30e17d	[TRTLLM-6893][infra] fix Build Docker Image tag issue (#6555 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-08-05 04:33:36 -04:00
Chuang Zhu	4d040b50b7	[None][chore] ucx establish connection with zmq (#6090 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-08-05 02:50:45 -04:00
Yanchao Lu	d53cc2374b	[https://nvbugs/5433581 ][infra] Update install docs and CI script for SBSA deep_gemm workaround (#6607 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-08-04 23:36:38 -04:00
Yiqing Yan	4763e94156	[TRTLLM-5563][infra] Move test_rerun.py to script folder (#6571 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-08-04 13:26:04 +08:00
Yiqing Yan	3f7abf87bc	[TRTLLM-6224][infra] Upgrade dependencies to DLFW 25.06 and CUDA 12.9.1 (#5678 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-08-03 11:18:59 +08:00
Yiqing Yan	d38c26bb78	[Infra][TRTLLM-5633] - Fix merge waive list (#6504 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-07-31 14:57:51 +08:00
Yiqing Yan	0cf2f6f154	[TRTLLM-5633] - Merge current waive list with the TOT waive list (#5198 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-07-30 17:50:05 +08:00
Zhanrui Sun	c3729dbd7d	infra: [TRTLLM-5873] Use build stage wheels to speed up docker release image build (#4939 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-07-29 12:54:38 -04:00
Zhanrui Sun	64ba483656	infra: [TRTLLM-6499] Split L0_Test into two pipeline by single GPU and multi GPU(For SBSA) (#6132 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-07-28 22:54:37 -04:00
yuanjingx87	608ed89f96	[None][infra]Update slurm config keys (#6370 ) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>	2025-07-28 11:56:37 -07:00
Yiqing Yan	d97419805b	[TRTLLM-5312] - Add bot run rules for triton tests (#4988 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-07-25 10:31:12 +08:00
yuanjingx87	ef4878db05	set NVIDIA_IMEX_CHANNELS for dlcluster slurm job only (#6234 ) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>	2025-07-22 11:27:54 -07:00
Lizhi Zhou	3e1a0fbac4	[TRTLLM-6537][infra] extend multi-gpu tests related file list (#6139 ) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>	2025-07-22 16:57:06 +08:00
Yi Zhang	f9b0a911fb	test: Enable GB200 torch compile multi gpu tests (#6145 ) Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>	2025-07-21 22:17:13 +08:00
Zhanrui Sun	3cbc23f783	infra: [TRTLLM-5250] Add sanity check stage for ngc-release images (Build wheels for devel image) (#4656 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-07-21 16:06:43 +08:00
Linda	3efad2e58c	feat: nanobind bindings (#6185 ) Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>	2025-07-21 08:56:57 +01:00
Venky	22d4a8c48a	enh: Add script to map tests <-> jenkins stages & vice-versa (#5177 ) Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-07-19 00:50:40 +08:00
Zhanrui Sun	8454640ee1	infra: fix single-GPU stage failed will not raise error (#6165 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-07-18 22:39:32 +08:00
Iman Tabrizian	b75e53ab69	Revert "feat: nanobind bindings (#5961 )" (#6160 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-07-18 10:12:54 +08:00
ixlmar	d71c6fe526	[fix] Update jenkins container images (#6094 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-07-17 16:22:25 +01:00
Linda	5bff317abf	feat: nanobind bindings (#5961 ) Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>	2025-07-17 22:42:52 +08:00
Emma Qiao	1cc49494fe	[Infra] - Add wiave list for pytest when using slurm (#6130 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-07-17 16:53:15 +08:00
QI JUN	e821c68611	CI: update multi gpu test trigger file list (#6131 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-07-17 14:48:23 +08:00
Zhanrui Sun	4c364b9a73	infra: fix SBSA test stage (#6113 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-07-17 11:56:03 +08:00
Zhanrui Sun	e42f5a9581	infra: [TRTLLM-5879] Spilt single GPU test and multi GPU test into 2 pipelines (#5199 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-07-16 18:04:04 +08:00
Bo Deng	ec3ebae43e	[TRTLLM-6471] Infra: Upgrade NIXL to 0.3.1 (#5991 ) Signed-off-by: Rabia Loulou <174243936+rabial-nv@users.noreply.github.com> Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> Signed-off-by: Bo Deng <deemod@nvidia.com> Co-authored-by: Rabia Loulou <174243936+rabial-nv@users.noreply.github.com> Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>	2025-07-16 13:54:42 +08:00
Iman Tabrizian	665b4469b3	[fix] Fix Triton build (#6076 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-07-16 11:17:22 +08:00
Yiteng Niu	9e871ca582	[infra] add more log on reuse-uploading (#6036 ) Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-07-15 17:18:38 +08:00
Zhanrui Sun	d811843a08	infra: [TRTLLM-6313] Fix the package sanity stage 'Host Node Name' in… (#5945 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-07-15 15:39:31 +09:00
Yiqing Yan	6b35afaf1b	[Infra][TRTLLM-6013] - Fix stage name in single stage test rerun report (#5672 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-07-15 12:27:21 +09:00
Zhanrui Sun	01b2def5ef	infra: [TRTLLM-6331] Support show all stage name list when stage name check failed (#5946 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-07-15 12:06:03 +09:00
Alex Zhang	6c30d78b78	[TRTLLM-5653][infra] Run docs build only if PR contains only doc changes (#5184 ) Signed-off-by: Alex Zhang <13271672+zhanga5@users.noreply.github.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Alex Zhang <13271672+zhanga5@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-07-14 21:40:33 +08:00
Zhanrui Sun	3a0ef73414	infra: [TRTLLM-6242] install cuda-toolkit to fix sanity check (#5709 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-07-14 18:52:13 +09:00
Yi Zhang	e5e87ecf34	test: Move some of the test from post merge to pre-merge, update dgx b200 test case (#5640 ) Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>	2025-07-14 17:17:30 +08:00
Zhanrui Sun	67a39dbd63	infra: [TRTLLM-6054][TRTLLM-5804] Fix two known NSPECT high vulnerability issues and reduce image size (#5434 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-07-10 23:24:46 +09:00
ixlmar	10e686466e	fix: use current_image_tags.properties in rename_docker_images.py (#5846 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-07-09 17:07:52 +09:00
xavier-nvidia	b6013da198	Fix GEMM+AR fusion on blackwell (#5563 ) Signed-off-by: xsimmons <xsimmons@nvidia.com>	2025-07-09 08:48:47 +08:00
Yiteng Niu	3079e8cf0c	[TRTLLM-5878] update nspect version (#5832 ) Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com>	2025-07-08 22:00:09 +08:00
Tailing Yuan	035155df7c	Fix: ignore nvshmem_src_*.txz from `confidentiality-scan` (#5831 ) Signed-off-by: Tailing Yuan <yuantailing@gmail.com>	2025-07-08 17:17:29 +09:00
Tailing Yuan	85b4a6808d	Refactor: move DeepEP from Docker images to wheel building (#5534 ) Signed-off-by: Tailing Yuan <yuantailing@gmail.com>	2025-07-07 22:57:03 +09:00
Yanchao Lu	092e0eb86a	[Infra] - Fix a syntax issue in the image check (#5775 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-07-07 11:19:59 +09:00
Yiteng Niu	66f299a205	[TRTLLM-5878] add stage for image registration to nspect (#5699 ) Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-07-06 23:52:54 +08:00
Yanchao Lu	2013034948	[Test] - Waive or fix few known test failures (#5769 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-07-06 21:14:16 +08:00
Yanchao Lu	d95ae1378b	[Infra] - Always use x86 image for the Jenkins agent and few clean-ups (#5753 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-07-06 10:25:57 +08:00
Yuan Tong	32b244af38	feat: reduce unnecessary kernel generation (#5476 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-07-04 14:37:49 +08:00
Yi Zhang	73d30a23c7	test: add more tests for GB200 with 8 GPUs/2 nodes in L0 tests (#5397 ) Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>	2025-07-04 13:14:13 +08:00
Yiqing Yan	de0b522dfd	[Infra] - Fix test stage check for the package sanity check stage (#5694 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-07-03 16:39:46 +08:00
ixlmar	04fa6c0cfc	[TRTLLM-6143] feat: Improve dev container tagging (#5551 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-07-02 14:56:34 +02:00
Emma Qiao	31699cbeb1	[Infra] - Set default timeout to 1hr and remove some specific settings (#5667 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-07-02 08:37:54 -04:00
Void	7992869798	perf: better heuristic for allreduce (#5432 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>	2025-07-01 22:56:06 -04:00
ixlmar	48eee338bf	fix: constrain grepping in docker/Makefile (#5493 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-07-01 20:12:55 +08:00
Omer Ullman Argov	3b19634a5c	[fix][ci] missing class names in post-merge test reports (#5603 ) Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>	2025-06-30 22:13:29 +08:00
Emma Qiao	b8a568d3c6	[Infra][main] Cherry-pick from release/0.21: Update nccl to 2.27.5 (#5539 ) (#5587 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-06-30 18:12:08 +08:00
amirkl94	a985c0b7e6	tests: Move stress tests to be Post-Merge only (#5166 ) Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>	2025-06-29 09:44:47 +03:00
Iman Tabrizian	49af791f66	Add testing for trtllm-llmapi-launch with tritonserver (#5528 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-06-27 11:19:52 +08:00
Omer Ullman Argov	fa0ea92dfd	[fix][ci] trigger multigpu tests for deepseek changes (#5423 ) Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>	2025-06-26 14:30:00 +08:00
Emma Qiao	32d1573c43	[Infra] - Add timeout setting for long tests found in post-merge (#5501 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-06-26 11:31:39 +08:00
QI JUN	478f668dcc	CI: update multi gpu test triggering file list (#5466 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-25 15:51:02 +08:00
Emma Qiao	7f68de3e3f	Refactor test timeout for individual long case (#4757 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-06-19 13:52:11 +08:00
yunruis	b3e886074e	Fix CI build time increase (#5337 ) Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>	2025-06-19 13:49:42 +08:00
Robin Kobus	1a7c6e7974	ci: Split long running jobs into multiple jobs (#5268 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-19 06:24:29 +08:00
yuanjingx87	a1c5704055	[feat] Multi-node CI testing support via Slurm (#4771 ) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com> Signed-off-by: yuanjingx87 <197832395+yuanjingx87@users.noreply.github.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-06-19 01:11:12 +08:00
Yiqing Yan	a3a48410f3	Fix rerun step (#5319 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-06-18 16:38:45 +08:00
QI JUN	9ea7bb67a4	CI: fix TensorRT H200 tests (#5301 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-18 14:40:57 +08:00
Emma Qiao	ff32caf4d7	[Infra] - Update dependencies with NGC PyTorch 25.05 and TRT 10.11 (#4885 ) Signed-off-by: qqiao <qqiao@nvidia.com> Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Signed-off-by: Emma Qiao <qqiao@nvidia.com> Co-authored-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-06-17 23:48:34 +08:00
Yiteng Niu	dcf18c4bcf	infra[TRTLLM-5635] remove package stage in CI build (#5075 ) Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com>	2025-06-17 23:44:47 +08:00
Yanchao Lu	f4cdbfcdf0	None - Some clean-ups for the automation pipeline (#5245 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-06-17 21:08:24 +08:00
QI JUN	ccd9adbe33	CI: move multi-gpu test cases of tensorrt backend to h200 (#5272 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-17 17:37:37 +08:00
QI JUN	517c1ecf72	move some test cases of TensorRT backend back (#5232 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-17 17:03:11 +08:00
Tailing Yuan	0b60da2c45	feat: large-scale EP(part 7: DeepEP integration) (#4792 ) Signed-off-by: Tailing Yuan <yuantailing@gmail.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-06-14 19:12:38 +08:00
QI JUN	952f33dcad	CI: move all test cases of TensorRT backend into post merge (#5186 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-13 20:48:48 +08:00

1 2 3 4 5 ...

335 Commits