Commit Graph

2829 Commits

Author SHA1 Message Date
Ivy Zhang
6b33bcced2
[None][test] Add accuracy benchmark in stress test (#7561)
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-09-19 16:09:46 +08:00
dominicshanshan
451475e0dc
[None][ci] Waive llama3 auto dtype test bug in https://nvbugs/5527956. (#7853)
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-19 14:54:59 +08:00
Emma Qiao
ea079fa530
[None][infra] Waive failed tests in post-merge (#7859)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-19 14:16:12 +08:00
Kyungmin Lee
6fcc0540f0
[None][fix] fix load_model_on_cpu on qwen/convert_checkpoint.py (#2382)
Signed-off-by: lkm2835 <lkm2835@gmail.com>
Co-authored-by: Kanghwan <861393+karljang@users.noreply.github.com>
2025-09-18 21:54:26 -07:00
QI JUN
f1b362faac
[None][chore] polish error message in cute_dsl_utils.py (#7852)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-19 12:05:11 +08:00
ruodil
c5453103d6
[None][test] add deepseek r1/v3 model with chunked prefill cases (#7124)
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
2025-09-19 11:12:53 +08:00
HuiGao-NV
a6370fd143
[https://nvbugs/5481434][feat] cherry-pick fix to reuse pytorch memory segments occupied by cudagraph (#7747)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-09-19 10:25:21 +08:00
fredricz-20070104
fc4e6d3702
[TRTLLM-7183][test] Feature fix model issue for disagg serving (#7785)
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
2025-09-19 10:12:55 +08:00
Chuang Zhu
c98b9468af
[None][fix] get Local IP by connect remote (#7719)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-09-19 10:01:03 +08:00
xiweny
423e5f6a3c
[TRTLLM-6286] [feat] Update CUTLASS to 4.2 and enable SM103 group gemm (#7832)
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-19 09:50:54 +08:00
Yuxian Qiu
d6ebcf7c4a
[TRTLLM-6994][feat] FP8 Context MLA integration (Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/6059 from release/1.1.0rc2) (#7610)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-09-19 09:40:49 +08:00
Ziyi Xiong
420f0fbcf5
[https://nvbugs/5522851][fix] Correct the logic to update kv_lens_cuda (#7790)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-09-19 08:11:29 +08:00
QI JUN
7646da2d85
[None][ci] set TORCHINDUCTOR_COMPILE_THREADS correctly (#7800)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-19 07:19:50 +08:00
sunnyqgg
80dd8fe197
[TRTLLM-6746][feat] Enable two-model spec dec for MTP Eagle (#7001)
Signed-off-by: qgai <qgai@nvidia.com>
2025-09-18 12:05:36 -04:00
dongfengy
026f22eb50
[None][doc] Cherry-pick deployment guide update from 1.1.0rc2 branch to main branch (#7774)
Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
2025-09-18 22:50:26 +08:00
Li Min
d921fc3352
[TRTLLM-6898][feat] Add swapab, tileN64, cga sync support for cute dsl nvfp4 gemm (#7764)
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
2025-09-18 21:20:04 +08:00
bhsueh_NV
c65457db8a
[None][fix] Revert "Revert "[None][feat] support attention dp for qwen3 dense model"" (#7780)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-09-18 20:11:05 +08:00
QI JUN
7f87b278bc
[None][chore] remove generated fmha_cubin.h from source tree (#7836)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-18 20:10:04 +08:00
xinhe-nv
d3a907131a
[https://nvbugs/5519462][fix] Add failed cases into waives.txt (#7817)
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-09-18 20:01:06 +08:00
Wanli Jiang
fe104dc20d
[TRTLLM-7918][feat] Support kvcache reuse and chunk prefill for phi4mm (#7723)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-09-18 17:37:16 +08:00
xinhe-nv
d909f80379
[TRTLLM-7250][fix] Add failed cases into waives.txt (#7807)
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-09-18 17:13:07 +08:00
Stefan Niebler
a55251bf75
[None][fix] Add TP information in weight scale loading in WeightOnlyQuantLinearMethod (#7732)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-09-18 10:30:50 +02:00
Wanli Jiang
a7ca0fff54
[TRTLLM-6577][feat] Support nano_v2_vlm in pytorch backend (#7207)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-09-18 16:26:20 +08:00
dongfengy
2ae08bd1b8
[https://nvbugs/5519530][fix] Fix gptoss 2-gpu test (#7819)
Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
2025-09-18 16:01:53 +08:00
xinhe-nv
236f71ea05
[None][chore] Add failed cases into waives.txt (#7801)
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-09-18 14:48:16 +08:00
Leslie Fang
870cfcf9a0
[None][chore] Remove executor config in create_py_executor (#7599)
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
2025-09-18 14:24:58 +08:00
yuanjingx87
b6e916b762
[None][infra] update ci allow list 2025/09/17 (#7816)
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-09-17 23:21:40 -07:00
mpikulski
1c7f601265
[https://nvbugs/5508890][fix] gen. result cleanup when using PostprocWorker (#7771)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-09-18 14:01:18 +08:00
Li Min
14e455da3e
[None][fix] Fix CI issue for dsl pkg install (#7784)
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-09-18 13:58:20 +08:00
Barry Kang
4f0e6b5f96
[None][feat] Cherry-pick DeepGEMM related commits from release/1.1.0rc2 (#7716)
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
2025-09-18 13:51:48 +08:00
Ziyi Xiong
28469dbf27
[https://nvbugs/5523080][fix] Correct the batch index in device tensors (#7803)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-09-18 13:45:37 +08:00
Ivy Zhang
26d50eb539
[TRTLLM-8070][test] add generation logits case for llama3 (#7759)
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-09-18 13:33:16 +08:00
Guoming Zhang
e0423bfaab
[https://nvbugs/5519544][fix] fix invalid expression for disabling pa… (#7806)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-09-18 12:54:52 +08:00
Yanchao Lu
f8e811d134
[None][chore] Version bump for 1.1.0rc6 (#7824)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-18 11:13:56 +08:00
Yukun He
cd80e0a7f1
[None][fix] Make tile_tokens_dim calculation just in time before kernel launching. (#7529)
tile_tokens_dim directly depends on the num_token, which is a dynamic shape during tuning and inference. When AutoTuner prepares dummy tensors with different num_tokens, it does not update the value of tile_tokens_dim automatically. Therefore, the value stored in the AutoTuner cache is misaligned, which will introduce a lot of cache misses during inference, which hurts perf a lot.

To avoid this issue, we move the calculation of tile_tokens_dim right before kernel launching, so that the value of tile_tokens_dim is always up to date with the num_tokens of the current input tensor used for the kernel runner.

Also, the tile_tokens_dim is calculated based on the number of tokens of a tuned bucket, instead of the original token number. Because we only tune the value for the buckets, not for the raw input token number, to avoid unexpected misalignment between tile_tokens_dim and the token number.

This PR also removes the warmup requests with the extra input shapes, which are triggered in the CUDA graph warmup phase.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-09-18 10:58:52 +08:00
Yan Chunwei
327e5e5eed
[None][ci] restore unwaive list (#7802)
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
2025-09-18 10:50:34 +08:00
Lucas Liebenwein
39eb120b96
[#7308] [feat] AutoDeploy: graph-less transformers mode for HF (#7635)
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Co-authored-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
2025-09-18 10:44:24 +08:00
Netanel Haber
a5cfc8368f
[https://nvbugs/5508536][fix] Revert #7041: Move stop_criteria to sample_async (#7041) (#7796)
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Co-authored-by: Mike Iovine <miovine@nvidia.com>
2025-09-17 21:27:01 -04:00
yunruis
7c03eb9ea2
[https://nvbugs/5516661][fix] Drop waive case 5516661 (#7791)
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
2025-09-18 08:55:32 +08:00
Matthias Jouanneaux
022d77807d
[TRTLLM-5966][feat] Helix: make softmax stats pointer available to attention gen (#6865)
Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>
Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com>
2025-09-18 05:01:24 +08:00
Anu
2b1472fb0a
[None][doc] Update Documentation link to point to docs instead of docs source code (#6495)
Signed-off-by: Anu <asrivastava274@gmail.com>
2025-09-18 04:39:18 +08:00
Emma Qiao
c4abca323e
[None][infra] Waive failed tests on main (#7812)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-17 23:44:36 +08:00
William Zhang
2614d71994
[TRTLLM-7410][feat] Enable KV cache reuse and chunked prefill for mistral3.1 (#7628)
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-09-17 08:11:16 -07:00
QI JUN
d3467f9f12
[None][doc] fix section header of llm_kv_cache_offloading example (#7795)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-17 17:26:11 +08:00
xinhe-nv
f918302b3a
[TRTLLM-7250][fix] waive block tests (#7782)
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-09-17 15:31:03 +08:00
ruodil
e6073b3911
[None][test] add gpt oss model for trtllm perf test (#7328)
Signed-off-by: Ruodi Lu <ruodil@nvidia.com>
Signed-off-by: Ruodi Lu <ruodil@users.noreply.github.com>
Co-authored-by: Ruodi Lu <ruodil@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-09-17 15:23:21 +08:00
xinhe-nv
7801d0992b
[None][chore] Remove closed bugs (#7697)
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-09-17 15:14:09 +08:00
QI JUN
d3e680b3c3
[None][ci] waive test_llama_eagle3[True-FLASHINFER-False-False-False-False-True] (#7788)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-17 15:12:55 +08:00
Fanrong Li
523a17d990
[https://nvbugs/5485325][fix] Cherry-pick #7373: fix the CUDA graph warmup issue when using speculative decoding (#7734)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>
2025-09-17 13:57:39 +08:00
QI JUN
39248320d4
[None][feat] add an example of KV cache host offloading (#7767)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-17 13:51:15 +08:00