Commit Graph

658 Commits

Author SHA1 Message Date
Wanli Jiang
2a30f11d63
[None][chore] Upgrade transformers to 4.56.0 (#7523)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-09-22 22:20:16 +08:00
Emma Qiao
324301ccba
[None][infra] Skip failed test for nvbugs 5532023 (#7905)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-22 03:49:44 -07:00
Yechan Kim
f77aca9f2c
[TRTLLM-7385][feat] Optimize Qwen2/2.5-VL performance (#7250)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-09-22 03:40:02 -07:00
Bo Deng
8cf95681e6
[TRTLLM-7989][infra] Bundle UCX and NIXL libs in the TRTLLM python package (#7766)
Signed-off-by: Bo Deng <deemod@nvidia.com>
2025-09-22 16:43:35 +08:00
xinhe-nv
9c1b75e978
[TRTLLM-7070][feat] add gpt-oss chunked prefill tests (#7779)
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-09-22 00:12:43 -07:00
Yanchao Lu
5c8b022d1e [None][ci] Test waives for the release/1.0 branch 09/15 (#7700)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
Simeng Liu
99995846b3 [https://nvbugs/5470782][chore] Remove the skip statement in 1.0 rele… (#7573)
Signed-off-by: Simeng Liu <simengl@nvidia.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
Yan Chunwei
afca2fcbe0 [https://nvbugs/5351244][fix] test_mpi_session (#7501)
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-22 14:28:38 +08:00
Stefan Niebler
8aead224fb
[https://nvbugs/5513423][fix] Correctly respect min_tokens in PyTorch Workflow (#7808)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
Co-authored-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
2025-09-21 22:15:18 -07:00
Ziyi Xiong
897c4dd23b
[https://nvbugs/5517404][fix] Use the correct cuda graph for dynamic spec dec (#7728)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-09-21 08:20:48 +08:00
Yan Chunwei
4509d97780
[TRTLLM-8188][chore] refactor GenerationExecutorWorker with WorkerBase for better code reusing (#7840)
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
2025-09-20 06:24:22 -07:00
Chang Liu
2e317a7db6
[https://nvbugs/5520490][fix] Fix intermittent test failures by avoiding external web data pulls (#7879)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-09-19 17:24:13 -07:00
Liao Lanyu
18095a7cb8
[https://nvbugs/5503440][fix] Fix potential hang due to wrong type of ZMQ socket and protocol for worker_init_status_queue (#7646)
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
2025-09-19 18:13:33 +08:00
xinhe-nv
efb763402f
[None][chore] Add failed cases into waives.txt (#7841)
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-09-19 17:59:47 +08:00
Ziyi Xiong
420f0fbcf5
[https://nvbugs/5522851][fix] Correct the logic to update kv_lens_cuda (#7790)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-09-19 08:11:29 +08:00
QI JUN
7646da2d85
[None][ci] set TORCHINDUCTOR_COMPILE_THREADS correctly (#7800)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-19 07:19:50 +08:00
Li Min
d921fc3352
[TRTLLM-6898][feat] Add swapab, tileN64, cga sync support for cute dsl nvfp4 gemm (#7764)
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
2025-09-18 21:20:04 +08:00
Leslie Fang
870cfcf9a0
[None][chore] Remove executor config in create_py_executor (#7599)
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
2025-09-18 14:24:58 +08:00
Li Min
14e455da3e
[None][fix] Fix CI issue for dsl pkg install (#7784)
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-09-18 13:58:20 +08:00
Yukun He
cd80e0a7f1
[None][fix] Make tile_tokens_dim calculation just in time before kernel launching. (#7529)
tile_tokens_dim directly depends on the num_token, which is a dynamic shape during tuning and inference. When AutoTuner prepares dummy tensors with different num_tokens, it does not update the value of tile_tokens_dim automatically. Therefore, the value stored in the AutoTuner cache is misaligned, which will introduce a lot of cache misses during inference, which hurts perf a lot.

To avoid this issue, we move the calculation of tile_tokens_dim right before kernel launching, so that the value of tile_tokens_dim is always up to date with the num_tokens of the current input tensor used for the kernel runner.

Also, the tile_tokens_dim is calculated based on the number of tokens of a tuned bucket, instead of the original token number. Because we only tune the value for the buckets, not for the raw input token number, to avoid unexpected misalignment between tile_tokens_dim and the token number.

This PR also removes the warmup requests with the extra input shapes, which are triggered in the CUDA graph warmup phase.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-09-18 10:58:52 +08:00
Lucas Liebenwein
39eb120b96
[#7308] [feat] AutoDeploy: graph-less transformers mode for HF (#7635)
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Co-authored-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
2025-09-18 10:44:24 +08:00
Netanel Haber
a5cfc8368f
[https://nvbugs/5508536][fix] Revert #7041: Move stop_criteria to sample_async (#7041) (#7796)
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Co-authored-by: Mike Iovine <miovine@nvidia.com>
2025-09-17 21:27:01 -04:00
Emma Qiao
c4abca323e
[None][infra] Waive failed tests on main (#7812)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-17 23:44:36 +08:00
William Zhang
2614d71994
[TRTLLM-7410][feat] Enable KV cache reuse and chunked prefill for mistral3.1 (#7628)
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-09-17 08:11:16 -07:00
QI JUN
d3e680b3c3
[None][ci] waive test_llama_eagle3[True-FLASHINFER-False-False-False-False-True] (#7788)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-17 15:12:55 +08:00
Lucas Liebenwein
4c3dc89f84
[None][chore] AutoDeploy: clean up of model unit test configuration (#7742)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-09-17 10:42:01 +08:00
Kaiyu Xie
62042a9733
[TRTLLM-6741] [feat] enable LM tp for MTP, under attention dp case (cherry-pick #7128) (#7571)
Signed-off-by: Cheng Hang <chang@nvidia.com>
Co-authored-by: Cheng Hang <chang@nvidia.com>
2025-09-17 09:41:32 +08:00
Iman Tabrizian
a91453de34
[None][waive] Waive tests (#7775)
Signed-off-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
2025-09-16 19:42:32 -04:00
xinhe-nv
1fbea497ff
[TRTLLM-7070][feat] add gpt-oss serve benchmark tests (#7638)
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-09-16 16:39:31 +08:00
Li Min
b278d06481
[TRTLLM-6898][feat] Add Cute DSL nvfp4 linear op (#7632)
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
2025-09-16 14:25:26 +08:00
xiweny
c076a02b38
[TRTLLM-4629] [feat] Add support of CUDA13 and sm103 devices (#7568)
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>
Signed-off-by: Daniel Stokes <dastokes@nvidia.com>
Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com>
Signed-off-by: Xiwen Yu <xiweny@nvidia.com>
Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Signed-off-by: Bo Deng <deemod@nvidia.com>
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: xiweny <13230610+VALLIS-NERIA@users.noreply.github.com>
Co-authored-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>
Co-authored-by: Daniel Stokes <dastokes@nvidia.com>
Co-authored-by: Zhanrui Sun <zhanruis@nvidia.com>
Co-authored-by: Jiagan Cheng <jiaganc@nvidia.com>
Co-authored-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Bo Deng <deemod@nvidia.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-09-16 09:56:18 +08:00
Ziyi Xiong
536e8776cd
[TRTLLM-6668][feat] Enable overlap scheduler for two-model spec decoding (#7651)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-09-16 07:33:44 +08:00
ixlmar
965a3dab90
[None][test] add test for min_tokens (#7678)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-09-15 08:59:23 +01:00
JunyiXu-nv
a2c45d82c3
[None][chore] Enable multiple postprocess workers tests for chat completions api (#7602)
Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>
2025-09-15 12:16:44 +08:00
Chang Liu
47e37755a3
[TRTLLM-6903][feat] Support chunked prefill for multimodal models (#6843)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-09-14 20:10:10 -07:00
Yanchao Lu
70aa4e28c1
[None][ci] Test waives for the main branch 09/14 (#7698)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-14 23:48:04 +08:00
Pengyun Lin
c2bc39af63
[TRTLLM-1302][feat] Topk logprobs for TRT backend and top1 logprob for PyT backend (#6097)
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-09-12 15:32:34 +08:00
Guoming Zhang
ef676fc71f
[https://nvbugs/5513192][fix] Add the missing param for kv_cache_tran… (#7679)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-09-11 19:00:16 +08:00
Emma Qiao
9986070044
[None][infra] Waive failed cases on main 0910 (#7676)
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-11 01:43:29 +08:00
xinhe-nv
207c5258c4
[https://nvbugs/5494698][fix] skip gemma3 27b on blackwell (#7505)
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-09-10 21:09:27 +08:00
Frida Hou
bbb5ae3349
[#5861][autodeploy] Refactor: Quantization Transforms with Inheritance (#7227)
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
2025-09-10 13:00:06 +08:00
Zheyu Fu
c353ff342e
[None][feat] Make the should_use_spec_decode logic a bit smarter (#7112)
Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>
2025-09-10 12:53:59 +08:00
Guoming Zhang
beefd6413e
[None][fix] fix post-merge issue raised by #5488 (#7655)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-09-10 09:26:27 +08:00
Jin Li
d49374bc45
[TRTLLM-7408][feat] Wrap MOE with custom op. (#7277)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-09-09 12:18:56 -04:00
QI JUN
a0e1604898
[None][ci] add DGX_H100-2_GPUs-PyTorch-Others-1 pipeline (#7629)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-09 11:06:32 -04:00
William Zhang
c53d1814a7
[None][feat] Extend VLM factory and add Mistral3 factory (#7583)
This commit:

* extends existing factory interfaces to enable Mistral3 in AutoDeploy.
* adds a Mistral3 VLM factory.
* adds various model patches for pixtral (the vision model) and mistral3
  to make the VLM export compliant.
* adjusts checkpoint loading code to take possible parameter name
  conversions into account.
* fixes a sampling bug (the `end_id` needs to be take into account when
  sampling, but it is not included in the stop words' token IDs).

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-09-09 02:47:18 -04:00
Eran Geva
5f2a42b3df
[TRTLLM-6142][feat] AutoDeploy: set torch recompile_limit based on cuda_graph_batch_sizes and refactored (#7219)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-09-08 08:45:58 -04:00
Chang Liu
4a1e13897f
[None][feat] Update multimodal utility get_num_tokens_per_image for better generalization (#7544)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-09-08 07:42:46 -04:00
dominicshanshan
c9dca69e1b
[None][chore] Mass integration of release/1.0 - 3rd (#7519)
Signed-off-by: Nave Assaf <nassaf@nvidia.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Bo Deng <deemod@nvidia.com>
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: Pamela <179191831+pamelap-nvidia@users.noreply.github.com>
Signed-off-by: Hui Gao <huig@nvidia.com>
Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com>
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Co-authored-by: Nave Assaf <55059536+Naveassaf@users.noreply.github.com>
Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com>
Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Co-authored-by: Emma Qiao <qqiao@nvidia.com>
Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Co-authored-by: Bo Deng <deemod@nvidia.com>
Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Co-authored-by: yifeizhang-c <219273404+yifeizhang-c@users.noreply.github.com>
Co-authored-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Co-authored-by: Erin <14718778+hchings@users.noreply.github.com>
Co-authored-by: chenfeiz0326 <chenfeiz@nvidia.com>
Co-authored-by: ChristinaZ <83400082+ChristinaZ@users.noreply.github.com>
Co-authored-by: Venky <23023424+venkywonka@users.noreply.github.com>
Co-authored-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Co-authored-by: HuiGao-NV <huig@nvidia.com>
Co-authored-by: milesial <milesial@users.noreply.github.com>
Co-authored-by: Shi Xiaowei <39303645+Shixiaowei02@users.noreply.github.com>
Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>
Co-authored-by: peaceh-nv <103117813+peaceh-nv@users.noreply.github.com>
Co-authored-by: Guoming Zhang <137257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Co-authored-by: pcastonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Linda <57756729+Linda-Stadter@users.noreply.github.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: Jiagan Cheng <jiaganc@nvidia.com>
Co-authored-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>
Co-authored-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-09-08 14:03:04 +08:00
JunyiXu-nv
504bb7ffa9
[TRTLLM-7779][feat] Support multiple postprocess workers for chat completions API (#7508)
Signed-off-by: Junyi Xu 
Co-authored-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-09-08 11:11:35 +08:00