Commit Graph

4746 Commits

Author SHA1 Message Date
Xianjie Qiao
87073d1ce4
[None][fix] Fix copy start_logs in disagg slurm scripts (#10840)
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
2026-01-21 13:31:25 +08:00
Yibin Li
9116dfbacd
[https://nvbugs/5775021] [fix] Replace pickle.load with restricted Unpickler (#10622)
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2026-01-21 11:42:54 +08:00
TensorRT LLM
ffd2ed51dd [None][infra] Check in most recent lock file from nightly pipeline
Signed-off-by: TensorRT LLM <90828364+tensorrt-cicd@users.noreply.github.com>
2026-01-21 03:14:33 +00:00
Yanchao Lu
ccf4d79c6c
[None][chore] Revert NVIDIA/TensorRT-LLM#10847 (#10869) 2026-01-21 11:08:40 +08:00
shuyixiong
c381790d15
[https://nvbugs/5670458][chore] Unwaive reward model test (#10831)
Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com>
2026-01-21 10:34:01 +08:00
Daniel Stokes
2f3b2a3172
[None][fix] Add a timeout in MNNVL throughput to prevent hangs if one rank crashes (#9532)
Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Co-authored-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2026-01-21 10:14:39 +08:00
Yan Chunwei
3c39b1faa9
[https://nvbugs/5759698][fix] unwaive test_base_worker (#10669)
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
2026-01-20 21:14:03 -05:00
Zheng Duan
26c23cf99f
[https://nvbugs/5760737][test] only skip mooncake+indexerkcache test (#10266)
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2026-01-21 09:48:39 +08:00
Simeng Liu
3c8ed19440
[https://nvbugs/5670108][fix] Fix overlap scheduler race condition in… (#10610)
Signed-off-by: SimengLiu-nv <simengl@nvidia.com>
2026-01-20 10:56:56 -08:00
TensorRT LLM
c6163e2b70 [None][infra] Check in most recent lock file from nightly pipeline
Signed-off-by: TensorRT LLM <90828364+tensorrt-cicd@users.noreply.github.com>
2026-01-20 18:36:19 +00:00
Izzy Putterman
864b61cadd
[None][feat] Speculative One Model: FlashInfer sampling (#10284)
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
2026-01-20 12:56:43 -05:00
Lucas Liebenwein
66b239a9a9
[None][fix] fix duplicate entry in waives.txt (#10853)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2026-01-20 19:48:01 +02:00
jthomson04
2db3d7eeba
[None][chore] Async Transfer Manager (#9891)
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
2026-01-20 12:12:47 -05:00
Gal Hubara-Agam
e61c942d1f
[#10707][fix] AutoDeploy: Super accuracy test fixes (#10717)
Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
Signed-off-by: Gal Hubara-Agam <96368689+galagam@users.noreply.github.com>
2026-01-20 18:16:13 +02:00
Yanchao Lu
ae8f74b620
[None][chore] Reduce tedious logs (#10847)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2026-01-20 22:56:24 +08:00
Emma Qiao
3a894951e7
[None][infra] Waive failed cases for main branch on 01/20 (#10829)
Signed-off-by: qqiao <qqiao@nvidia.com>
2026-01-20 17:58:58 +08:00
Bo Deng
338b29d5ae
[None][infra] trigger multi-gpu tests when install_nixl/ucx.sh is mod… (#10624)
Signed-off-by: Bo Deng <deemod@nvidia.com>
2026-01-20 17:55:32 +08:00
Yuxian Qiu
c8a200486d
[https://nvbugs/5701445][chore] unwaive test. (#10806)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2026-01-20 16:30:32 +08:00
Grzegorz Kwasniewski
eb326073d8
[TRTLLM-10785][feat] Fix sharding dashboard errors (#10786)
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
2026-01-20 09:25:36 +01:00
Yi Zhang
58311b2345
[None][fix] Remove unused params in attn (#10652)
Signed-off-by: yizhang-nv <187001205+yizhang-nv@users.noreply.github.com>
2026-01-20 03:08:59 -05:00
xinhe-nv
47e0ec2527
[None][test] Update sanity test list (#10825)
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2026-01-20 02:11:42 -05:00
Yiqing Yan
99e8cb0999
[None][fix] Fix vulnerability urllib3 and nbconvert (#10551)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2026-01-20 14:51:36 +08:00
xinhe-nv
fc467d06c3
[TRTLLM-8638][fix] Add failed cases into waives.txt (#10787)
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2026-01-20 00:48:19 -05:00
benzh-2025
4c8468c5d3
[None][fix] default disable gemm+allreduce fusion (#10656) 2026-01-20 12:31:17 +08:00
xinhe-nv
26bc16842e
[None][chore] Add failed cases into waives.txt (#10776)
Signed-off-by: Jie Li <lijie@nvidia.com>
Co-authored-by: Jie Li <lijie@nvidia.com>
2026-01-19 22:45:40 -05:00
TensorRT LLM
44c5af88dc [None][infra] Check in most recent lock file from nightly pipeline
Signed-off-by: TensorRT LLM <90828364+tensorrt-cicd@users.noreply.github.com>
2026-01-20 03:15:53 +00:00
Bo Li
f3a985ce27
[TRTLLM-10296][fix] Fix the potential misaligned access due to vectorized ld/st instructions in NVLinkOneSided A2A. (#10539)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2026-01-20 11:08:04 +08:00
Liao Lanyu
dbb858ae0c
[TRTLLM-10029][scheduler] Re-implement MicroBatchScheduler and CapacityScheduler in Python (#10273)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
Co-authored-by: junq <22017000+QiJune@users.noreply.github.com>
Co-authored-by: Lanyu Liao <lancelly@users.noreply.github.com>
2026-01-20 10:31:13 +08:00
Lizhi Zhou
c6320d924d
[https://nvbugs/5776445][chore] unwaive test (#10667)
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2026-01-19 21:22:47 -05:00
Zhenhuan Chen
066fa4cd93
[None][chore] update config.yaml of slurm scripts to align with submit.py change (#10802)
Signed-off-by: Zhenhuan Chen <zhenhuanc@nvidia.com>
2026-01-19 14:46:23 -05:00
Jie Li
ed95e70150
[None][chore] Remove trt flow tests in NIM (#10731)
Signed-off-by: Jie Li <lijie@nvidia.com>
2026-01-19 05:25:39 -05:00
SamareshSingh
64ff5cac52
[None][chore] docs: clarify LoRA is not supported with --use_fp8_rowwise in Fp8RowwiseAttention (see #2603) (#10320)
Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Kanghwan <861393+karljang@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2026-01-19 04:38:00 -05:00
Shi Xiaowei
442d2e8a15
[None][test] adjust the dis-agg test timeout threshold (#10800)
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2026-01-19 17:02:00 +08:00
Xianjie Qiao
cc0bbde745
[None][feat] Update disagg slurm scripts (#10712)
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
2026-01-19 15:53:48 +08:00
Eran Geva
32ab809f36
[#10607][chore] Add Nemotron Nano v3 FP8 autodeploy perf test (#10603)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
Signed-off-by: Eran Geva <egeva@cw-dfw-cs-001-vscode-01.cm.cluster>
Co-authored-by: Eran Geva <egeva@cw-dfw-cs-001-vscode-01.cm.cluster>
2026-01-19 08:48:07 +02:00
TensorRT LLM
baa250d1d6 [None][infra] Check in most recent lock file from nightly pipeline
Signed-off-by: TensorRT LLM <90828364+tensorrt-cicd@users.noreply.github.com>
2026-01-19 06:21:05 +00:00
Emma Qiao
935c174283
[None][infra] Waive failed cases for main on 01/19 (#10794)
Signed-off-by: qqiao <qqiao@nvidia.com>
2026-01-19 00:55:26 -05:00
Zhanrui Sun
df845a028b
[TRTLLM-9581][infra] Use /home/scratch.trt_llm_data_ci in computelab (#10616)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2026-01-19 00:40:40 -05:00
Yiqing Yan
68ab1a47c4
[None][chore] Add release/1.2 branch into lockfile generation schedule (#10790)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2026-01-19 11:32:08 +08:00
chenfeiz0326
e97af45556
[TRTLLM-10300][feat] Upload regression info to artifactory (#10599)
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2026-01-19 10:16:31 +08:00
Lucas Liebenwein
a6a63f5a36
[https://nvbugs/5814247][fix] unwaive AutoDeploy multi-gpu unit tests (#10769)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2026-01-19 10:00:54 +08:00
Chuang Zhu
4f04532ce7
[https://nvbugs/5769890][fix] enable system memory to transfer active message in NIXL ucx (#10602)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2026-01-19 09:20:12 +08:00
Lucas Liebenwein
9879400479
[#10642][feat] AutoDeploy: optimized canonicalize_graph utilities [1/2] (#10675)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2026-01-18 13:42:30 -05:00
Eran Geva
4d2916d683
[#10688][fix] AutoDeploy Fix CUDA graph batch sizes exceeding max_batch_size (#10687)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2026-01-18 13:31:01 -05:00
Lucas Liebenwein
b64052539d
[https://nvbugs/5769712][fix] fix timeout in AutoDeploy llama accuracy test (#10461)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2026-01-18 13:20:55 -05:00
TensorRT LLM
3aaed62cfc [None][infra] Check in most recent lock file from nightly pipeline
Signed-off-by: TensorRT LLM <90828364+tensorrt-cicd@users.noreply.github.com>
2026-01-18 09:43:30 +00:00
yuanjingx87
e1cc8d2337
[None][infra] Add sonarqube scanning in lockfile generation pipeline (#10700)
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2026-01-18 01:11:28 -08:00
Eran Geva
a11f0dbd61
[#10696][fix] AutoDeploy prevent torch.export from specializing batch dimension when max_batch_size=1 (#10697)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2026-01-18 10:42:49 +02:00
Yanchao Lu
0af1a0e478
[None][test] Waive main post-merge test failures 1/18 (#10777)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2026-01-18 15:34:48 +08:00
TensorRT LLM
f8c26409f9 [None][infra] Check in most recent lock file from nightly pipeline
Signed-off-by: TensorRT LLM <90828364+tensorrt-cicd@users.noreply.github.com>
2026-01-18 03:07:08 +00:00