Commit Graph

1988 Commits

Author SHA1 Message Date
Yuening Li
1f8ae2b2db
[TRTLLM-5863][feat] Support MoE INT8 Weight-Only-Quantization in PyTorch Workflow (#6629)
Signed-off-by: Yuening Li <62227368+yueningl@users.noreply.github.com>
2025-08-15 17:15:49 -04:00
dongfengy
0ad0b967bb
[None][fix] Make TP working for Triton MOE (in additional to EP we are using) (#6722)
Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
2025-08-15 16:58:42 -04:00
ajrasane
4162d2d746
[None][test] Add accuracy evaluation for AutoDeploy (#6764)
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-08-15 13:46:09 -04:00
yifeizhang-c
4127d77678
[https://nvbugs/5394392][fix] Enlarge scheduler capacity under disagg bs == 1 (#6537)
Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>
2025-08-15 09:52:06 -07:00
liji-nv
18ccd053d3
[https://nvbugs/5427801][fix] Torch compile support for Llama4 and Ea… (#6858)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-08-15 11:14:20 -04:00
tomeras91
f7dbc1435a
[None] [chore] Mamba cache in separate file (#6796)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-08-15 13:42:51 +03:00
Bo Li
15aabc1540
[None][fix] Fix perfect router. (#6797)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-08-14 20:09:08 -07:00
Frank
2cc59aacb3
[None][fix] Correct reporting of torch_dtype for ModelConfig class. (#6800)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-08-14 22:46:20 -04:00
qianbiao
5c2f0fd03d
[None] [feat] Add Tencent HunYuanMoEV1 model support (#5521)
Signed-off-by: sorenwu <sorenwu@tencent.com>
Co-authored-by: sorenwu <sorenwu@tencent.com>
Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com>
2025-08-15 06:56:44 +08:00
Mike Iovine
078e907b16
[https://nvbugs/5455651][fix] Make ngram use XQA attention on Blackwell (#6873)
Signed-off-by: Michael Iovine <miovine@nvidia.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
Signed-off-by: Mike Iovine <mike.iovine7@gmail.com>
2025-08-14 18:36:19 -04:00
Bo Li
26f413ad90
[https://nvbugs/5450262][fix] Fix unsupported alltoall use case (#6882)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-08-14 17:46:54 -04:00
Matthias Jouanneaux
69574ad730
[TRTLLM-5966][feat] Helix: extend mapping to support different CP types (#6816)
Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>
2025-08-14 09:00:02 -07:00
kris1025
4aed7a7d19
[TRTLLM-6853][feat] refactor deepseekv3 model (#6698)
Signed-off-by: linquanh <linquanh@nvidia.com>
2025-08-14 11:03:17 -04:00
Pengbo Wang @ NVIDIA
ffc976ceaf
[https://nvbugs/5445466][fix] fix deepseek r1 hang by not enabling mnnvl by default (#6860)
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>
2025-08-14 22:36:56 +08:00
Shi Xiaowei
1095dfd03c
[None][fix] BREAKING CHANGE: Mismatch between docs and actual commands (#6323) 2025-08-14 03:48:57 -04:00
Yan Chunwei
0132c1db84
[https://nvbugs/5427043][fix] request length exceeds max_num_tokens (#6821)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-08-14 13:31:12 +08:00
Bo Deng
d8acca495b
[TRTLLM-6675][infra] Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/6623 (#6735)
Signed-off-by: Bo Deng <deemod@nvidia.com>
2025-08-14 04:36:38 +00:00
jmydurant
4200fa46d1
[None][feat] Add support for Hopper MLA chunked prefill (#6655)
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2025-08-14 10:39:26 +08:00
Izzy Putterman
ef53de8eef
[None][feat] Add test for speculative rejection sampler (2-model) (#6542)
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
2025-08-13 22:09:35 -04:00
Tin-Yin Lai
6c52bb07ff
[https://nvbugs/5302040][feat] Add whisper support (Bert Attention on SM100 and GPTAttention for cross attention on SM100) (#5527)
Signed-off-by: tinyinl <tinyinl@nvidia.com>
2025-08-13 11:19:13 -07:00
danielafrimi
bda42f8c3a
[None][feat] Support running heterogeneous model execution for Nemotron-H (#6866)
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>
2025-08-13 19:51:19 +03:00
Anthony Chang
2198587b35
[https://nvbugs/5378031] [feat] Hopper W4A8 MoE supports ModelOpt ckpt for PyT backend (#6200)
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-08-13 21:24:40 +08:00
Yukun He
bc5f766e0e
[TRTLLM-4501][feat] AutoTuner tuning config refactor and valid tactic generalization. (#6545)
* Generalize the definition of tactics so that users can implement more customizable tactic types, making the configurations clearer for each kernel run. 
* Allow the user not to specify the `gen_tuning_buckets` or the `map_to_tuning_buckets` function.
* Other code refactoring.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-08-13 16:25:22 +08:00
Void
1d80df0955
[None][feat] DeepEP LL combine FP4 (#6822)
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
2025-08-13 04:20:21 -04:00
Mike Iovine
f68e03e646
[https://nvbugs/5452167][fix] Fix ngram padding issue (#6837)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-08-13 11:23:16 +08:00
Yechan Kim
12102e2d48
[TRTLLM-6772][feat] Multimodal benchmark_serving support (#6622)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-08-12 19:34:02 -07:00
Fanrong Li
1bbc0e323b
[None][fix] Pre-allocate workspaces for DeepGEMM MoE to avoid frequent cudaFree/cudaMalloc (#6811)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-08-13 10:27:57 +08:00
rakib-hasan
2923eb88a1
[None][fix] Refactoring input prep to allow out-of-tree models (#6497)
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
2025-08-12 20:29:10 -04:00
dongxuy04
bd9a6dd9ab
[TRTLLM-7008][fix] fix wideEP weights loading and args (#6789)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-08-12 19:14:20 -04:00
Robin Kobus
45c7518032
[None][refactor] Simplify decoder state initialization (#6559)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-08-12 21:44:41 +02:00
Robin Kobus
dd11e08d26
[#6187][feat] add LayerNorm module (#6625)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-08-12 21:43:30 +02:00
nvchenghaoz
81f0ded1c4
[None][feat] Add GPT OSS support for AutoDeploy (#6641)
Signed-off-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com>
2025-08-12 14:03:22 -04:00
Jhao-Ting Chen
a060e12041
[https://nvbugs/5438869][fix] Set nvfp4 expert w1 w3 weight scale to the same value if they're not (#6656)
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
2025-08-12 20:47:10 +08:00
Shunkangz
ab0d768acf
[None][fix] Fix attention dp log (#6570)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-08-12 04:53:09 -04:00
Liao Lanyu
f7c13a4aa7
[TRTLLM-6906][chore] Using pybind to bind functions in thop/attentionOp (#6745)
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
2025-08-12 16:45:16 +08:00
Sergey Klevtsov
27fc35175e
[None][feat] CUTLASS MoE FC2+Finalize fusion (#3294)
Signed-off-by: Sergey Klevtsov <sklevtsov@nvidia.com>
2025-08-12 15:56:48 +08:00
Fridah-nv
0dc4b4e699
[#4403][autodeploy] Refactor: Move more transformations to new inf optimizer, Add quantization_source to factory interface (#6760)
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
2025-08-11 22:02:46 -07:00
Enwei Zhu
7c686ba8de
[TRTLLM-2285][feat] Enable guided decoding with CUDA graph padding and draft model chunked prefill (#6774)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-08-12 09:30:06 +08:00
Ziyi Xiong
b4fcd5f592
[https://nvbugs/5441438][fix] Set correct draft length for the cuda graph dummy request (#6701)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-08-12 09:28:47 +08:00
Jinyang Yuan
ead89a0e40
[None][perf] Improve the performance of online EPLB on Hopper by better overlapping (#6624)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-08-12 09:25:13 +08:00
Chang Liu
be9dd4713c
[https://nvbugs/5385987][fix] Fix Qwen2 quantization issue by pinning transformers version (#6673)
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-08-11 17:16:49 -07:00
rakib-hasan
7ab8112450
[None][fix] Refactoring to avoid circular import when importing torch models (#6720)
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
2025-08-11 18:00:42 -04:00
bhsueh_NV
83dbc6c75d
[TRTLLM-5532][feat] store the block of context request into kv cache (#6683)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-08-11 16:14:52 +08:00
Tracin
49bcaa4e95
Add gpt-oss GSM8K test. (#6732)
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-08-10 22:45:43 -04:00
Zero Zeng
4b4b91ab51
[None][feat] improve dataloading for benchmark_dataset by using batch… (#6548)
Signed-off-by: Zero Zeng <38289304+zerollzeng@users.noreply.github.com>
2025-08-11 09:50:41 +08:00
Yechan Kim
60073a7ad9
[None][feat] Support SharedTensor on MultimodalParams (#6254)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-08-10 17:48:24 -07:00
shaharmor98
b6baa9ed9b
[TRTLLM-6823][doc] Add checkpoint refactor docs (#6592)
Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-08-10 19:47:39 -04:00
shaharmor98
14b36e07d7
[TRTLLM-6174][feat] Enable FP32 mamba ssm cache (#6574)
Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-08-10 16:27:51 -04:00
Gal Hubara-Agam
3c5aec19c2
[#5048][enhance] AutoDeploy: Optimize prepare_inputs (#6634)
Optimize prepare_inputs routine in AutoDeploy, as part of the effort to reduce the performance gap compared to the default backend.
This PR includes two major fixes, and some other minor tweaks:
1. Avoid back and forth data copies
2. Optimize position ids update by separating the implementation for generation mode and context mode.

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-08-10 13:55:04 +03:00
Ziyi Xiong
de472828b9
[TRTLLM-6637][feat] Resolve KV cache divergence issue (#6628)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-08-09 23:15:04 +08:00
Yilin Fan
d643aef73c
[Perf] Improve Llama4 performance for small max_seqlen cases (#6306)
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
2025-08-09 02:58:31 -04:00
Ye Zhang
bcf5ec0c9a
[None][feat] Core Metrics Implementation (#5785)
Signed-off-by: Ye Zhang <zhysishu@gmail.com>
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-08-09 02:48:53 -04:00
Yibin Li
97787883c3
[TRTLLM-6420][feat] add support for Eclairv2 model - cherry-pick changes and minor fix (#6493)
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-08-08 21:40:48 -04:00
dongfengy
d06675071e
[None][fix] WAR GPT OSS on H20 with Triton MOE (#6721)
Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
2025-08-08 19:47:09 -04:00
Mike Iovine
90145cf557
[None][feat] Optimize CUDA graph memory usage for spec decode cases (#6718)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-08-08 13:56:53 -04:00
Wanli Jiang
d45236b253
[TRTLLM-6308][feat] Support Aggregate mode for phi4-mm (#6184)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-08-08 20:09:26 +08:00
Stefan Niebler
b8f036f264
[TRTLLM-6650][fix] Enhance CUDA graph + Beam search to correctly handle padding (#6665)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-08-08 14:00:33 +02:00
Liao Lanyu
32ad7f3c12
[None][fix] Remove lock related typo in py_executor (#6653)
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
2025-08-08 17:48:57 +08:00
JunyiXu-nv
5f45227a93
[https://nvbugs/5437106][fix] Fix llama4 scout TRTLLM attn_backend (#6690)
Signed-off-by: Junyi Xu <junyix@nvidia.com>
2025-08-08 17:48:23 +08:00
Yuxian Qiu
9ff4e75f14
[None][refactor] Combine resmooth_to_fp8_e8m0 and transform_sf_into_required_layout (#6654)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-08-08 17:11:41 +08:00
Li Min
d913955952
[TRTLLM-6898][feat] make fused_moe_cute_dsl work on blackwell (#6616)
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
2025-08-08 15:03:48 +08:00
2ez4bz
064eb7a70f
[TRTLLM-5252][fix] Propagate mapping to intermediate layers (#6611)
This commit propagates the mapping to intermediate layers to enable
tensor parallelism (amongst other things) in them.

It also fixes issues with a unit test for TP for pixtral, and adds it to a
test list.

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-08-08 01:50:36 -04:00
Enwei Zhu
aee828d98a
[TRTLLM-6854][feat] Enable guided decoding with disagg serving (#6704)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-08-08 12:10:36 +08:00
zhanghaotong
1cf669496a
[None][fix] Fix unnecessary GPU synchronization in torch sampler caused by incorrect tensor reference (#6626)
Signed-off-by: 皓聪 <zhanghaotong.zht@alibaba-inc.com>
Co-authored-by: 皓聪 <zhanghaotong.zht@alibaba-inc.com>
2025-08-07 23:44:47 -04:00
NVJiangShao
2f2f5cc72c
[TRTLLM-6744][feat] Remove input_sf swizzle for module WideEPMoE (#6231)
Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>
2025-08-08 11:13:42 +08:00
Daniel Cámpora
efca359b66
[TRTLLM-6785][feat] BREAKING CHANGE Enable TRTLLM sampler by default (#6216)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-08-07 22:19:37 -04:00
Iman Tabrizian
82276167e6
[None][feat] Add NCCL Symmetric Integration for All Reduce (#4500)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-08-07 17:28:14 -07:00
Haohang Huang
980929e1a9
[https://nvbugs/5410687][fix] Hopper w4a8 groupwise MoE interleave (#6708)
Signed-off-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-08-07 15:30:16 -07:00
Yuan Tong
db8dc97b7b
[None][fix] Migrate to new cuda binding package name (#6700)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-08-07 16:29:55 -04:00
Mike Iovine
e968f98b43
[None][feat] Clean up ngram auto mode, add max_concurrency to configs (#6676)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-08-07 12:51:47 -04:00
Emma Qiao
3c44b44e45
[None][infra] Fix guardwords (#6711)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-08-07 21:06:47 +08:00
pcastonguay
453a06e6ab
[TRTLLM-6881][feat] Include attention dp rank info with KV cache events (#6563)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-08-07 14:17:07 +02:00
Enwei Zhu
1b9781e8e7
[TRTLLM-6409][feat] Enable guided decoding with speculative decoding (part 1: two-model engine) (#6300)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-08-07 05:53:48 -04:00
hlu1
8207d5fd39
[None] [feat] Add model gpt-oss (#6645)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-08-07 03:04:18 -04:00
amitz-nv
85af62184b
[TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter (#6510)
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-08-07 09:05:36 +03:00
Yiqing Yan
5fa1914cab
[None][chore] Bump version to 1.1.0rc0 (#6651)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-08-07 13:39:49 +08:00
Izzy Putterman
7e0158b583
Qwen3: Fix eagle hidden states (#6199)
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
2025-08-06 17:05:18 -04:00
Hanjun Cho
80f918cc22
[None][feat] Add Qwen3 MoE support to TensorRT backend (#6470)
Signed-off-by: gkswns0531 <gkswns0531@gmail.com>
Signed-off-by: hanjuncho <gkswns0531@gmail.com>
Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com>
2025-08-06 17:02:35 +08:00
Zongfei Jing
0ff8df95b7
[https://nvbugs/5433581][fix] DeepGEMM installation on SBSA (#6588)
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-08-06 16:44:21 +08:00
Netanel Haber
83ee91e17b
[None][fix] Fix 6522 mpi.pkl5.intracomm.Request has wait not Wait (#6646)
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
2025-08-06 14:18:09 +08:00
JunyiXu-nv
13e0214fe0
[TRTLLM-6263][feat] Enable fp8 SwiGLU to minimize host overhead (#6540)
Signed-off-by: Junyi Xu <junyix@nvidia.com>
2025-08-06 10:42:19 +08:00
brb-nv
9a01934dbf
[None][feat] Switch to internal version of MMProjector in Gemma3 (#6572)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-08-05 21:48:23 -04:00
yunruis
3ff4f503ad
[None][opt] ADP schedule balance optimization (#6061)
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
2025-08-06 09:38:02 +08:00
Yechan Kim
c17f4984e2
[None][feat] Refactor Llava-Next (#6478)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-08-05 17:53:53 -07:00
Aurelien Chartier
6da95f29a9
[None][feat] Add support for fused gate_up_proj scales for FP8 blockwise (#6496)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-08-05 11:22:32 -07:00
Wanli Jiang
46df8712c8
[https://nvbugs/5355007][fix] Set enable_chunked_context as True by default in trtllm bench (#6582)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-08-05 11:11:36 -07:00
ixlmar
1ebceb790d
[TRTLLM-5508][feat] check input tokens + improve error handling (#5170)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-08-05 18:27:43 +01:00
liji-nv
dcbfa7e509
[https://nvbugs/5252313][fix] Fix torch compile + MTP (#6554)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-08-05 10:31:29 -04:00
Venky
61da2daeb4
[TRTLLM-6761][refactor] Replace LogitBiasLogitsProcessor with embedding bias tensor system (#6464)
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
2025-08-05 07:14:24 -07:00
Pengbo Wang @ NVIDIA
c289880afb
[None][fix] fix kimi k2 serving and add test for Kimi-K2 (#6589)
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
2025-08-05 18:05:33 +08:00
amitz-nv
dc84695520
[TRTLLM-6826][feat] Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5 (#6522)
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-08-05 11:28:26 +03:00
danielafrimi
ed801ff74b
[None][fix] Remove expand configuration from mamba2 mixer (#6521)
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>
2025-08-05 04:18:25 -04:00
Haohang Huang
c9eebcb454
[TRTLLM-6674][feat] (Breaking Change) Hopper SWA non-cyclic kernels + KV reuse + Spec Dec (#6379)
Signed-off-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
Signed-off-by: symphonylyh <31998628+symphonylyh@users.noreply.github.com>
2025-08-05 07:47:41 +00:00
kris1025
6a3a921284
[TRTLLM-6685][feat] Add speculative metrics for trt llm bench (#6476)
Signed-off-by: linquanh <linquanh@nvidia.com>
2025-08-04 15:22:57 -07:00
Olya Kozlova
13cc1c4878
[TRTLLM-5271][feat] best_of/n for pytorch workflow (#5997)
Signed-off-by: Olya Kozlova <okozlova@nvidia.com>
2025-08-04 14:08:06 +02:00
brb-nv
87e4e9f468
[None][chore] Add unit test for Gemma3 lora (#6560)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-08-04 04:56:57 -04:00
Yiqing Yan
3916dbd98b
[None][chore] Bump version to 1.0.0rc6 (#6597)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-08-04 04:39:15 -04:00
Pengyun Lin
a15e33351d
[None][fix] Revert commit 48ddc3d & add test for disagg server with different max_num_tokens (#6259)
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-08-04 15:09:51 +08:00
Yuan Tong
a2f271c8e0
[TRTLLM-4406][feat] LLM sleep & wakeup Part 1: virtual device memory (#5034)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-08-04 13:51:01 +08:00
Yechan Kim
ee6ab5be96
chore: add EXAONE4 accuracy test (#6397)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-08-04 10:14:16 +08:00
Jinyang Yuan
df90202b51
[fix] Fix DeepSeek w4a8 weight loading (#6498)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-08-04 10:12:06 +08:00
Chuang Zhu
542f552d0b
use cudaSetDevice to create context ,fix nvbug 5394497 (#6403)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-08-03 13:32:55 -04:00
Shunkangz
67a3fd858b
[None][feat] Add support of scheduling attention dp request (#6246)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-08-01 20:38:01 -04:00
Richard Huo
31802de0b0
[None][fix] Serialize the window_size in the kv event (#6526)
Signed-off-by: richardhuo-nv <rihuo@nvidia.com>
2025-08-01 15:25:18 -07:00
Lucas Liebenwein
5247df6ae2
[AutoDeploy] merge feat/ad-2025-07-22 (#6520)
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Signed-off-by: Gal Agam <ghubaraagam@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: haoguo <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com>
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Co-authored-by: Gal Agam <ghubaraagam@cw-dfw-h100-004-328-012.cm.cluster>
Co-authored-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Co-authored-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-08-01 08:51:08 -07:00
brb-nv
7447d6ed85
[TRTLLM-6657][feat] Add LoRA support for Gemma3 (#6371)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-08-01 09:19:54 -04:00
liji-nv
1daa8c3232
[https://nvbugs/5340941][https://nvbugs/5375785] - fix: Wrap attentio… (#6355)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-08-01 07:38:06 -04:00
Yukun He
90856bf97d
[https://nvbugs/5419069][fix] Fix the mismatched layer name components. (#6417)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-08-01 16:32:39 +08:00
Zero Zeng
48768fd720
fix: Fix missing key (#6471)
Signed-off-by: Zero Zeng <38289304+zerollzeng@users.noreply.github.com>
2025-08-01 14:25:58 +08:00
Robin Kobus
d3c14682f0
refactor: Remove unused buffers and bindings from sampler (#6484)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-08-01 00:43:03 -04:00
Jaedeok Kim
fbee279909
fix: remove duplicate layer multiplication in KV cache size calculation (#6481)
Signed-off-by: Jaedeok Kim <jaedeokk@nvidia.com>
2025-07-31 22:34:34 -04:00
Zongfei Jing
7bb0a78631
Deepseek R1 FP8 Support on Blackwell (#6486)
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-08-01 10:26:28 +08:00
Venky
8c165fd27a
[TRTLLM-6611][feat] Add warnings and stricter validation to LoraManager adapter loading (#6453)
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
2025-07-31 22:22:51 -04:00
Yukun He
00059de380
chore: Improve the AutoTuner log information. (#6368)
* Change the fallback alert from DEBUG to WARNING level and only do it once.
* Add debug information for profiling cache right after the warmup phase.
* Change the level of exception message during tactic profiling from ERROR to WARNING level. All exception details are pushed to the DEBUG level.
* Other trivial refinements and cleanups.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-08-01 09:19:52 +08:00
brb-nv
2eca0d5925
fix: Fix poor generation with FP8 Gemma3 1B checkpoint (#6499)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-31 17:18:23 -07:00
Simeng Liu
8cf3faa26a
[feat] Auto-enable ngram with concurrency <= 32. (#6232)
Signed-off-by: Simeng Liu <simengl@nvidia.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
Signed-off-by: Mike Iovine <mike.iovine7@gmail.com>
Co-authored-by: Mike Iovine <miovine@nvidia.com>
Co-authored-by: Mike Iovine <mike.iovine7@gmail.com>
2025-07-31 18:45:51 -04:00
Ziyi Xiong
8062e0fe7c
[TRTLLM-6392][feat] Support turning on/off spec decoding dynamically (#6363)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-07-31 15:31:39 -04:00
shaharmor98
0c42f54a39
Bugfix/fix nemotron nas lora support (#6380)
Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-07-31 13:39:35 -04:00
amitz-nv
1ee7a08d2b
[5830][feat] Improve LoRA cache memory control (#6220)
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-07-31 09:26:38 +03:00
dongjiyingdjy
17e0d0fb1a
fix: fix illeagel memory access (#6437)
Signed-off-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>
2025-07-31 10:01:34 +08:00
Enwei Zhu
4b299cb77e
feat: Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21 (#6408)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-07-31 09:53:52 +08:00
Vadim Gimpelson
25cd4f215e
[PERF] Move calculation Qwen2-VL's rotary_cos_sin to LLM worker process (#6004)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@centml.ai>
2025-07-31 09:35:24 +09:00
shaharmor98
f9cf683e39
add propagation of trust_remote_code to OpenAIServer (#6446)
Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-07-30 15:25:41 -04:00
Wanli Jiang
9632dba02e
feat: TRTLLM-6450 update long rope for phi3.5/phi4-mini/phi4-mm (#6353)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-07-30 09:20:16 -07:00
NVShreyas
e67f4da9b5
[Perf]: Add residual, norm for nemotron_nas models (#6455)
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
2025-07-30 09:10:38 -07:00
Chang Liu
b4065d8ca6
[TRTLLM-6654][feat] Add support for external multimodal embeddings (#6263)
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
2025-07-30 10:00:15 -04:00
pcastonguay
e7ae5e2824
feat: Add support for disaggregation with pp with pytorch backend (#6369)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Signed-off-by: raayandhar <rdhar@nvidia.com>
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Signed-off-by: pcastonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: raayandhar <rdhar@nvidia.com>
Co-authored-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-07-30 09:42:13 -04:00
tomeras91
a2514d93fc
[nvbug 5380101][fix] Fix nemotronNAS loading for TP>1 (#6447)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-07-30 07:22:32 -04:00
QI JUN
2fe9cc0889
chore: remove draft_model_engine from init parameter list of PyExecutor (#6325)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-07-30 03:31:49 -04:00
QI JUN
1f39a11af0
chore: clean code of PyExecutor (#6445)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-07-30 02:11:43 -04:00
2ez4bz
d6eed1b624
[fix] Switch placement of image placeholder for mistral 3.1 (#6435)
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-07-30 14:10:36 +08:00
Jinyang Yuan
a427f5bece
[fix] Fix wide EP when using DeepEP with online EPLB (#6429)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-07-30 00:13:18 -04:00
Zheng Duan
c9ed1ab436
[TRTLLM-6549] chore: record delay introduced by disaggregated serving in kv cache measure (#6135)
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-07-30 10:39:40 +08:00
peaceh-nv
5b420ad267
Rename layer to comply with deepseek (#6393)
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
2025-07-30 10:00:48 +08:00
Yechan Kim
d6eb8e2366
fix: support mixture of text & multimodal prompts (#6345)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-07-30 08:52:31 +08:00
Yunfan Fan
1a8e28d295
[FIX] fix bugs caused by None attention_bias during Qwen3 model convert engine (#6344)
Signed-off-by: fanyunfan <2569548856@qq.com>
Co-authored-by: fanyunfan <2569658856@qq.com>
2025-07-30 07:13:44 +08:00
Yan Chunwei
ad662ddcdd
chore: disallow arbitrary in llm_args.Configs (#6367)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-29 16:16:52 -04:00
Michal Guzek
7efe3cb0cd
[fix] Add detokenization-based stop word logic to LLM API (#5948)
Signed-off-by: moraxu <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
2025-07-29 10:16:59 -07:00
Yukun He
0eee2e2850
[5385981] fix: Update the usage of VisionAttention init API. (#6413)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-07-29 16:41:48 +08:00
QI JUN
13e24ab1cb
chore: remove unused code in PyExecutor (#6351)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-07-29 16:24:26 +08:00
Frank
d2a04abb95
[fix] Fixes to parameter usage and low latency configuration. (#6343) 2025-07-29 01:36:13 -04:00
nv-guomingz
49044733e1
chore: delete useless gitkeep files. (#6400)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-07-28 11:38:30 -04:00
QI JUN
4efc6496b7
chore: add _prepare_and_schedule_batch function in PyExecutor (#6365)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-07-28 05:50:27 -04:00
Yan Chunwei
45d441e60c
[TRTLLM-5061] chore: add status tags to LLM API reference (#5707)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-28 15:57:07 +08:00
Zero Zeng
c9b8b6180f
Add Acceptance Rate calculation to benchmark_serving (#6240)
Signed-off-by: Zero Zeng <38289304+zerollzeng@users.noreply.github.com>
2025-07-28 14:00:58 +08:00
Jinyang Yuan
97f7e12588
[fix] Fix perf regression caused by MoE autotuner when using DeepEPLowLatency (#6288)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-07-28 01:37:11 -04:00
Chang Liu
dc757799e1
[nvbugs/5401156][fix] Avoid import all models when import trtllm._common (#6266) 2025-07-27 23:29:21 -04:00
Void
f172face98
DeepEP LL dispatch FP4 (#6296)
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
2025-07-28 11:25:42 +08:00
Yukun He
93a0fd0a23
[TRTLLM-6445] feat: Enable AllReduce-associated fusion patterns in Llama3/4. (#6205)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-07-28 09:36:26 +08:00
YueWeng
2dd3186727
fix: remove cudaStreamSynchronize when using relaxed acceptance (#5262)
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
2025-07-28 09:18:41 +08:00
Ziyi Xiong
d853811190
[https://nvbugs/5402719][fix]: Add cuda graph dummy requests to the spec_resource_manager (#6258)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-07-26 20:32:39 -04:00
Michal Guzek
08d57123f9
[nvbug/5374773] chore: Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache (#5974)
Signed-off-by: moraxu <mguzek@nvidia.com>
2025-07-25 18:10:40 -04:00
ameynaik-hub
1e5e71aa42
Mtp optimizations round1 (#5689)
Signed-off-by: Amey Naik <212485788+ameynaik-hub@users.noreply.github.com>
Co-authored-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com>
2025-07-25 13:48:27 -04:00
nv-guomingz
b8d4cb8beb
feat: Support JSON Schema in OpenAI-Compatible API (#6321)
Signed-off-by: noiji <52301388+noiji@users.noreply.github.com>
2025-07-25 12:55:56 -04:00
xiaoqi
a0aecf0476
[feat]: support logit_bias (#5354)
Signed-off-by: xq25478 <xq25478@qq.com>
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: hexiao.xq <hexiao.xq@antgroup.com>
Co-authored-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
Co-authored-by: hexiao.xq <hexiao.xq@antgroup.com>
Co-authored-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-07-25 09:37:41 +00:00
liji-nv
e07fff4f78
[https://nvbugs/5340941] - fix: Correct custom ops used by Qwen3 Moe … (#6285)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-07-25 14:49:45 +08:00
Mike Iovine
0f2f11f90b
[TRTLLM-6453][feat] Support chunked prefill on spec decode 2 model (#6104)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-07-24 21:50:11 -04:00
Linda
9a99e6d6d7
fix: integration tests with nanobind (#6326)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-25 09:23:20 +08:00
Shiyu Li
375f74ecb2
[fix][nvbugs/5399355] Fix Lamport buffer clear issue for MNNVL TwoShot Allreduce and add FP16 support. (#6237)
Signed-off-by: Shiyu Li <shili@nvidia.com>
2025-07-25 08:01:40 +08:00
Frank
f8f5ba65fc
[fix] Update to remove popping of KV cache and other args. (#6310)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-07-24 15:54:33 -04:00
Stefan Niebler
0df758ec9f
[TRTLLM-6650][feat] Enhance beam search support with CUDA graph integration (#6217)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-07-24 18:04:41 +02:00
bhsueh_NV
7b6aadc800
[Fix][nvbug 5401163][nvbug 5404726][Qwen3] Fix bug of MoE on tp > 1 with trtllm moe backend (#6235)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-07-24 21:47:37 +08:00
liji-nv
14d94a3856
feat: Add non UB AR + Residual + Norm + Quant fusion (#6320)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-07-24 05:51:43 -04:00
Lizhi Zhou
a63a1ac7f9
[TRTLLM-6444] Add some UCX trouble shooting docs and print UCX related logs (#6085)
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2025-07-24 16:21:01 +08:00
QI JUN
428e34080f
chore: remove unused variables in pyexecutor (#6280)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-07-24 13:16:15 +08:00
Stefan Niebler
2486eb778e
[TRTLLM-6651][feat] Enable Overlap scheduler + Beam Search in TRTLLM Sampler (#6223)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-07-23 12:30:50 +02:00
YueWeng
ed62a06eef
[nvbug/5322354] fix PD + MTP + overlap scheduler accuracy issue (#6136)
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
2025-07-23 14:53:37 +08:00
QI JUN
a8253b942f
chore: remove duplicate should_stop_processing check (#6242)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-07-23 14:11:23 +08:00
Yechan Kim
83c3ed128b
chore: set default device to cpu on Multimodal models (#5994)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-07-22 21:45:31 -07:00
Erin
5636c67388
fix: nvbug_5398806 (#6239) 2025-07-23 11:45:11 +08:00
Venky
9538c8d0e5
Add basic Nemo Ckpt Lora Loading in pytorch flow (#6019) 2025-07-22 19:42:45 -07:00
wili
8ecdeee300
[refactor] Simplification of Speculative decoding configs - Part 2 (#5936)
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-07-23 09:20:27 +08:00
Lucas Liebenwein
41fb8aa8b1
[AutoDeploy] merge feat/ad-2025-07-07 (#6196)
Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Co-authored-by: Gal Hubara-Agam <96368689+galagam@users.noreply.github.com>
Co-authored-by: Neta Zmora <nzmora@nvidia.com>
Co-authored-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: Frida Hou  <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Co-authored-by: Grzegorz Kwasniewski <213329731+greg-kwasniewski1@users.noreply.github.com>
2025-07-23 05:11:04 +08:00
2ez4bz
ab7434ac62
[feat] Enable TP and batching for PixtralVisionModel / Mistral3VLM (#6152)
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-07-22 11:06:41 -07:00
John Calderon
b7c8a672da
[Issue 6193] Fix gemma3vl weight loader (#6233)
Signed-off-by: John Calderon <johncalesp@gmail.com>
2025-07-22 10:32:18 -07:00
danielafrimi
ff9963978a
Add register_fake for finegrained_mixed_dtype_gemm torch_op (#6255)
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>
2025-07-22 16:59:55 +03:00
Yiqing Yan
3e18ee5fe1
chore: bump version to 1.0.0rc5 (#6252)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-07-22 16:24:28 +08:00
Pengyun Lin
48ddc3d4b9 [fix]: Revert commit 388b491 (#6143)
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-07-22 12:48:00 +08:00
Yi Zhang
eb7d0f84b5 [nvbugs/5368410][fix] Disable moe allreduce for multi node (#5918)
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-07-22 12:48:00 +08:00
Fanrong Li
c66941036f fix: fix index out of bounds error in spec decoding (#5954) 2025-07-22 12:48:00 +08:00
Shunkangz
ee45e0c63f
feat: Refactor the fetching request logic (#5786)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-07-22 09:16:28 +08:00
Chang Liu
7381f1dba7
[TRTLLM-5059][feat] Add KV cache reuse support for multimodal models (#5444)
Only supports qwen in this PR
2025-07-21 16:11:58 -07:00
Ziyi Xiong
d7f0b0ab68
[fix] Correct the returned value of has_spec_drafter (#6178)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-07-21 11:38:59 -04:00
Pengyun Lin
9832bef07d
[BREAKING CHANGE]: change default backend to PyTorch in trtllm-serve (#5717)
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-07-21 21:09:43 +08:00
liji-nv
3e0fb60e50
[TRTLLM-4279] feat: Multistream initial support for torch compile flow (#5847)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-07-21 19:10:22 +08:00
Linda
3efad2e58c
feat: nanobind bindings (#6185)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-21 08:56:57 +01:00
Yuening Li
e8c068b4b1
[TRTLLM-5863][feat] Support Weight-Only-Quantization in PyTorch Workflow (#5850)
Signed-off-by: Yuening Li <62227368+yueningl@users.noreply.github.com>
Co-authored-by: Yuening Li <62227368+yueningl@users.noreply.github.com>
2025-07-21 15:17:35 +08:00
Jinyang Yuan
88076eecd0
[fix] Fix can_use_alltoall in fused_moe_wide_ep.py (#6173)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-07-21 10:53:07 +08:00
brb-nv
ca9bc5727e
fix: Flush stale PlanParams with custom attention mask (#6163)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-21 09:55:09 +08:00
brb-nv
a433ebad2b
enh: Lift expectation of single image per sample in Gemma3 VLM (#6195)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-21 08:43:07 +08:00
danielafrimi
5300a99bd8
W4A8 GEMM (#6005)
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>
2025-07-20 17:34:57 +03:00
amitz-nv
98428f330e
[TRTLLM-5826][feat] Support pytorch LoRA adapter eviction (#5616)
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-07-20 08:00:14 +03:00
Void
118307c224
DeepEP LL support variable hidden size and tokens num (#6141)
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
2025-07-20 09:32:41 +08:00
Pengyun Lin
69e9f6d489
[fix]: Skip prompt length checking for generation only requests (#6146)
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-07-19 21:26:37 +08:00
Ziyi Xiong
66030ef815
[TRTLLM-6452][feat]: Two-model engine KV cache reuse support (#6133)
Signed-off-by: ziyixiong-nv <fxiong@nvidia.com>
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-07-19 13:17:15 +08:00
Rashid Kaleem
152e2df43b
[Disaggregated] Add retry knobs and handling (#5808)
Signed-off-by: Rashid Kaleem <4079439+arekay@users.noreply.github.com>
Signed-off-by: Shi Xiaowei <39303645+Shixiaowei02@users.noreply.github.com>
Co-authored-by: Shi Xiaowei <39303645+Shixiaowei02@users.noreply.github.com>
2025-07-19 07:27:59 +08:00
John Calderon
fc8b29c4ff
[Issue 5927][fix] Avoid memory calls during broadcast for single GPU (#6010)
Signed-off-by: John Calderon <johncalesp@gmail.com>
2025-07-18 14:21:03 -07:00
Netanel Haber
d9a3530048
[nvbug/5393888][nvbug/5393042] Always use py_seq_slot (#6147)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-07-18 22:45:16 +03:00
Stefan Niebler
6d7874a467
[nvbugs/5369799] fix: Update disaggregation handling in sampler (#5762)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-07-19 01:40:46 +08:00
xiaoqi
28858c8711
feat(eagle3):support qwen3 dense model (#5879)
Signed-off-by: xq25478 <xq25478@qq.com>
2025-07-19 01:24:32 +08:00
Bo Li
07e8813984
feat: Remove padding in attention DP. (#6064)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-07-18 23:30:34 +08:00
Stefan Niebler
fd6ce7f20e
[ci] Speedup beam search unit tests with fixtures for LLM (#5843)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-07-18 22:54:49 +08:00
Erin
9522cde464
fix: NVBug 5385576 py_batch_idx issue (#6153)
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-07-18 22:36:43 +08:00
Robin Kobus
ec2b953e7e
refactor: Enhanced handling of decoder requests and logits within the batch manager (#6055)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-18 12:12:08 +02:00
Aurelien Chartier
812243bdd6
feat: add support for Modelopt fp8_pb_wo quantization scheme (#6106)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-07-18 10:35:12 +08:00
Zhenhuan Chen
992b273045
[https://nvbugs/5387375] fix(scaffolding): fix scaffolding aime test in test_e2e (#6140)
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-07-18 10:34:37 +08:00
yifeizhang-c
0155e7a3a1
[TRTLLM-6368] Update deepep dispatch API (#6037)
Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>
2025-07-18 10:13:31 +08:00
Iman Tabrizian
b75e53ab69
Revert "feat: nanobind bindings (#5961)" (#6160)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-18 10:12:54 +08:00
qixiang-99
2c90203c36
Refactor KVCacheManager: Simplify token availability calculation and … (#6134)
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
2025-07-17 13:33:33 -07:00
Frank
161490f039
[fix] Fixes KV Cache overrides in trtllm-bench (#6103)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-07-18 03:44:44 +08:00
2ez4bz
8480c120b1
[fix] Fix Mistral3VLM weight-loading & enable in pre-merge (#6105)
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-07-17 11:04:17 -07:00
Iman Tabrizian
10dbf4f0f4
[fix] Remove duplicated KVCache transmission check (#6022)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-17 12:02:19 -04:00
Linda
5bff317abf
feat: nanobind bindings (#5961)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-17 22:42:52 +08:00
Ziyi Xiong
58d22a72f1
[TRTLLM-6352][feat] Migrate EAGLE3 and draft/target speculation to Drafter (#6007)
Signed-off-by: ziyixiong-nv <fxiong@nvidia.com>
2025-07-17 21:15:01 +08:00
Enwei Zhu
21efb50068
[TRTLLM-6406] feat: Enable guided decoding with overlap scheduler (#6000)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-07-17 17:46:10 +08:00
Chuang Zhu
44c70c88f9
chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service (#5234)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-07-17 17:42:07 +08:00
Iman Tabrizian
d4d21a106e
[fix] Release slots with spec decode + disagg (#5975) (#6032)
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
Signed-off-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-17 12:58:18 +08:00
Shiyu Li
6e1aee6fd6
[fix] Performance Optimization for MNNVL TwoShot Kernel (#5934)
Signed-off-by: Shiyu Li <shili@nvidia.com>
Co-authored-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-07-17 10:49:51 +08:00
chenfeiz0326
fe070a0168
test: Update Llama4 Scout FP4 & FP8 accuracy tests (#5901)
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-07-17 09:41:18 +08:00
Wanli Jiang
2d2b8bae32
feat: TRTLLM-5574 Add phi-4-multimodal pytorch-backend support (#5644)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-07-17 06:30:58 +08:00
qixiang-99
e09e409dfb
Fix: Enhance ModelConfig for kv cache size calculations (#5868)
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
2025-07-16 14:41:31 -07:00
Mike Iovine
fa34cb7234
[refactor] Clean up drafter/resource manager creation logic (#5805)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-07-16 12:45:46 -07:00
shaharmor98
e0836f9ca9
[TRTLLM-5493] Add core infrastructure to enable loading of custom checkpoint formats (#5372)
Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-07-17 00:50:30 +08:00
Wanli Jiang
9354114f68
fix: Update trtllm args issues with extra nested config (#5996)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-07-16 12:41:45 -04:00
Bo Li
fc2347eaf5
chore: Cleanup disable_fp4_allgather. (#6006)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-07-16 17:54:36 +08:00
Yan Chunwei
a02606a9e2
[TRTLLM-5530][BREAKING CHANGE] refactor: unify KvCacheConfig in LLM class for pytorch backend (#5752)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-16 16:42:59 +08:00
Yan Chunwei
7568deb2f1
[nvbug/5387226] chore: add propogation for trust_remote_code to AutoConfig (#6001)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-16 16:05:38 +08:00
Yiqing Yan
e51c541617
chore: Bump version to 1.0.0rc4 (#6086)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-07-16 13:02:23 +08:00
Wanli Jiang
8679a058a3
fix: Unable to load phi4-model with tp_size>1 (#5962)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-07-16 11:39:41 +08:00
danielafrimi
edab7532dd
feat/add latency support for trtllm bench (#3730)
Signed-off-by: Ubuntu <dafrimi@nvidia.com>
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>
Signed-off-by: Frank <3429989+FrankD412@users.noreply.github.com>
Co-authored-by: Daniel Afrimi <dafrimi@nvidia.com>
Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com>
2025-07-15 13:13:49 -07:00
Fanrong Li
7a1af1c738
Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/5947 (#5989)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-07-16 01:33:12 +09:00
Xiaodong (Vincent) Huang
0523f77b36
support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-con… (#5684)
Signed-off-by: Vincent Huang <vincenth@nvidia.com>
2025-07-15 18:34:21 +03:00
Tailing Yuan
4a26bd6500
Fix: pad DeepEP fp4 recv tensors if empty (#6048)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2025-07-15 23:14:01 +09:00
MinaHuai
9ebc3ab9c4
[nvbugs/5385972][nvbugs/5387423][Fix] Minor fix for llava_next/llava_onevision (#5998)
Signed-off-by: Mina Huai <121143971+MinaHuai@users.noreply.github.com>
2025-07-15 10:01:35 -04:00
Jaedeok Kim
ab1c54709d
fix: adjust window sizes of VSWA at torch backend (#5880)
Signed-off-by: Jaedeok Kim <jaedeokk@nvidia.com>
2025-07-15 17:41:54 +08:00
nv-guomingz
4e4d18826f
chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie… (#6003)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-07-15 15:50:03 +09:00
Lucas Liebenwein
e499f6c44a
[Fix] check for ImportError or ModuleNotFoundError for deep_ep_utils (#6026)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-07-15 14:31:35 +09:00
Rashid Kaleem
2ea4077993
[Model load] Fix llama min-latency model load (#5883)
Signed-off-by: Rashid Kaleem <4079439+arekay@users.noreply.github.com>
2025-07-15 09:29:19 +08:00
ixlmar
f225f5cd2e
[nvbugs-5318143] fix: restrict PyTorch memory usage to avoid OOMs (#5964)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-07-15 06:49:42 +08:00
brb-nv
f5f5be9e94
enh: Bidirectional mask with multiple images for Gemma3 (#5976)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-14 22:39:18 +08:00
brb-nv
1a2d96919c
feat: Update Gemma3 Vision Encoder (#5973)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-14 22:38:10 +08:00
Yechan Kim
63139fdcff
feat: EXAONE4.0 support (#5696)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-07-14 22:28:10 +09:00
Zhenhuan Chen
30608a5e6d [https://nvbugs/5355316] fix: update torch.compile option to fix triton store_cubin error (#5865)
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-07-14 17:17:30 +08:00
Robin Kobus
5a61d64b5b [nvbugs/5345391] fix: chunked prefill + overlap scheduling (#5761)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
Pengyun Lin
388b4919b8 [nvbug 5304752][fix] enhance _check_arguments to filter illegal requests for pytorch backend (#5541)
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
Pengyun Lin
6992616c1f [nvbug 5004744][fix] rewrite completion API to avoid repetitive tokens (#5201)
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
Iman Tabrizian
c8874a7f94 [nvbug/5337601][fix] Fix disagg + speculative decoding (#5558)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Co-authored-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
Dom Brown
afaa388bee [TRTLLM-6100] fix: Nvbug 5356427: autotuned TRTLLM Gen fp8 block scale MoE illegal memory access (#5676)
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
WeiHaocheng
4d8920982a
fix: set allreduce strategy to model config (#5955)
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
2025-07-14 17:59:11 +09:00
dominicshanshan
c9e7f831dc
Breaking change: perf: [TRTLLM-4662] Enable cuda graph by default (#5480)
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-07-14 16:42:23 +08:00
wili
cfcb97af0e
[BUG5388075][fix] Fix error in post-merge-tests (#5949)
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-07-14 14:33:39 +09:00
QI JUN
ce39409530
fix cancel request logic (#5800)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-07-14 10:23:20 +08:00
Mike Iovine
8950223f6f
[fix] Remove SpecConfig and fix thread leak issues (#5931)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-07-12 21:03:24 +09:00
Enwei Zhu
bc1d4fb5da
[NvBug 5378370] fix: Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-07-12 15:50:31 +09:00
Thor Johnsen
041f1fa513
[TRTLLM-6264] Fix flaky test_e2e.py::test_openai_lora (#5885)
Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
2025-07-11 16:20:41 -07:00
2ez4bz
6304866ce8
[refactor] Move vision parts from processor to model for Gemma3 (#5888)
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-07-11 15:13:51 -07:00
brb-nv
0385f89abc
test: Fix Gemma3 unit tests due to transformers upgrade (#5921)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-10 17:24:10 -07:00
Void
854655f2f7
deepEP fp4 post quant all2all dispatch (#5881)
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
2025-07-11 08:18:54 +08:00
Frank
aa4eebe973
[enhance] Add the ability to write a request timeline. (#5258)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Frank <3429989+FrankD412@users.noreply.github.com>
2025-07-10 17:15:30 -07:00
wili
2e3cf42e03
[refactor] Simplification of Speculative decoding configs (#5639)
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-07-10 11:37:30 -04:00
Kaiyu Xie
7b09a415c1
fix: Make the bench serving script compatible with different usages (#5905)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-07-10 19:36:26 +08:00
Enwei Zhu
055c4a9fe6
[NvBug 5370718, 5371538] fix: Fix incremental detokenization (#5825)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-07-10 16:30:00 +08:00
CarstyYou
dc32f9ae73
[fix] fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531)
Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>
2025-07-10 15:16:18 +08:00
Anthony Chang
7d21b55b5a
[feat] Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE (#5723)
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-07-10 14:06:50 +08:00
Yan Chunwei
07f6da763d
[TRTLLM-5530] chore: rename LLM.autotuner_enabled to enable_autotuner (#5876)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-10 11:31:35 +08:00
Hanjun Cho
6490a27ad7
[feat] Add TensorRT-Engine Qwen3 (dense) model support (#5650)
Signed-off-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal>
Signed-off-by: Hanjun Cho <46752251+gkswns0531@users.noreply.github.com>
Co-authored-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal>
2025-07-10 10:26:06 +08:00
brb-nv
3209b31665
feat: Custom masking utils for Gemma3 VLM (#5853)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-10 06:18:04 +09:00
2ez4bz
87fe44fd29
feat(models): Mistral3.1 VLM pytorch backend support (#5529)
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-07-09 13:17:40 -07:00
Chang Liu
b61a717275
[1/N][TRTLLM-5195][feat] Share PyTorch tensor between processes (#5396) 2025-07-10 05:12:53 +09:00
Wanli Jiang
3f7cedec7c
Update transformers to 4.53.0 (#5747)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-07-09 09:32:24 -07:00
DylanChen-NV
74dca0aa7b
[NVBUG-5304516/5319741]Qwen2.5VL FP8 support (#5029)
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
2025-07-09 23:16:42 +08:00
tomeras91
5aa958a11a
[TRTLLM-5838][fix] fix max batch size and max tokens in kv cache estimations for Nemotron-H (#5371)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-07-09 11:30:15 +03:00
Dom Brown
3e3b1769ad
[TRTLLM-5881] feat: Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner (#5764)
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-07-09 08:21:58 +01:00
dongxuy04
dd3c736c7e
chore: some refactor on WideEP (#5727)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-07-09 14:26:57 +08:00
chenfeiz0326
64fd64fcf2
[TRTLLM-6262] Fix Llama4 Scout FP4 crash issue (#5834)
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-07-09 14:23:21 +08:00
Chang Liu
4df5f96c8d
[Bugfix] LLama4: fix for llama4 multimodal support (#5809) 2025-07-09 13:03:40 +09:00
Xianjie Qiao
5ab1cf5ae6
Remove unnecessary benchmarking results (#5852)
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
2025-07-09 11:19:06 +08:00
brb-nv
2bd09ed2d4
fix: Skip rope scaling for local layers in Gemma3 VLM (#5857)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-09 10:10:33 +08:00
Omer Ullman Argov
d6d2ab2c99
[fix] Catch inference failures in trtllm-bench (#5841)
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-07-09 03:53:03 +03:00
Iman Tabrizian
c508b994b6
Fix lost requests for disaggregated serving (#5815)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-09 08:42:45 +09:00
Kaiyu Xie
bb5b16fcb9
feat: Return context response immediately when stream_interval > 1 (#5836)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-07-09 00:19:57 +09:00
Raayan Dhar
e3268a4221
[TRTLLM-5847][feat] Support n-gram speculative decoding with disagg (#5732)
Signed-off-by: raayandhar <rdhar@nvidia.com>
2025-07-08 09:39:58 -04:00
Yukun He
e104f8bbb5
[5305318] fix: Fix the accuracy issue when reduce_fusion is enabled for GEMMA model. (#5801)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-07-08 19:51:05 +08:00
Yegor
b01d1c28f7
[feat] Detokenize option in /v1/completions request (#5382)
Signed-off-by: Yegor <75512761+Wokzy@users.noreply.github.com>
Signed-off-by: Yegor Yershov <yegor6741@gmail.com>
2025-07-08 19:36:04 +08:00
xiweny
eaf8bec88b
fix: Disaggregate serving with attention DP (#4993)
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-07-08 16:15:03 +08:00
Yiqing Yan
5203a0f6df
chore: bump version to 1.0.0rc3 (#5819)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-07-08 16:04:40 +09:00
Zhenhuan Chen
dee6644ed9
feat(scaffolding): add streaming scaffolding_llm.generate_async support (#5345)
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-07-08 15:08:40 +09:00
nv-guomingz
0be41b6524
Revert "chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie…" (#5818) 2025-07-08 13:15:30 +09:00
Yechan Kim
5bc3a15f10
feat: add MultimodalParams & putting all multimodal params into it and refactor HyperCLOVAX & Qwen2/2.5-VL (#5522)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-07-07 18:03:12 -07:00
nv-guomingz
5a8173c121
chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie… (#5795)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-07-08 08:52:36 +08:00
Robin Kobus
30a19fcf7c
[TRTLLM-6291] feat: Add user-provided speculative decoding support (#5204)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-07 16:30:43 +02:00
Tailing Yuan
85b4a6808d
Refactor: move DeepEP from Docker images to wheel building (#5534)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2025-07-07 22:57:03 +09:00
Daniel Cámpora
1260e2f33f
feat: Optimize TRTLLM Sampler perf single beam single step (#5550)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-07-07 15:44:47 +02:00
DylanChen-NV
5ca2b9bb15
[TRTLLM-5812][feat] support FP8 row-wise dense GEMM in torch flow (#5615)
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
2025-07-07 18:04:57 +08:00
Yan Chunwei
dfce61f4b9
[TRTLLM-5530][BREAKING CHANGE] refactor: LLM arglist rename mixed_sampler to enable_mixed_sampler (#5751)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-07 17:05:14 +08:00
Zheng Duan
de10774c2e
chore: log stack trace on error in openai server (#5749)
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-07-07 14:54:36 +08:00
Daniel Stokes
ec6c7dff1a
feat: Add support for MXFP8xMXFP4 in pytorch (#5535)
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-07-06 15:32:06 -07:00
Robin Kobus
ae27261094
refactor: decoding inputs (#5679)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-06 08:21:02 +02:00
Xianjie Qiao
b1976c2add
Add wide-ep benchmarking scripts (#5760)
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
Signed-off-by: Xianjie Qiao <5410381+qiaoxj07@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-07-05 19:29:39 +08:00
Xianjie Qiao
089fd55eda
Add dummy all_reduce for kernel breakdown (#5745)
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
2025-07-05 13:08:58 +09:00
Frank
d61893dc77
[fix] Update to properly set cuda graphs in trtllm-bench overrides. (#5634)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-07-05 05:19:16 +09:00
Stefan Niebler
d1112aac37
[TRTLLM-3442] feat: added beam search support to the PyTorch Workflow (#5333)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-07-05 01:35:13 +09:00
HuiGao-NV
3ed3bbcb5d
Fix: pass allreduce strategy to pytorchConfig (#5746)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-07-04 21:32:13 +09:00
Shunkangz
32339d1b20
Raise shut down error for each request (#4936)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-07-04 18:58:24 +09:00
Tailing Yuan
e134a52e07
Perf: reduce DeepEPLowLatency memory and time (#5712)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2025-07-04 14:46:28 +08:00
Shunkangz
a79d8c9f5e
Fix none response in PD (#5422)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-07-04 14:25:10 +08:00
brb-nv
cdaa6abce7 fix: Investigate Gemma3 1B decoder output discrepancy (#5564)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-04 13:14:13 +08:00
Frank
819ae903de [https://nvbugspro.nvidia.com/bug/5351333][fix] Update to chunking calculation. (#5625)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-07-04 13:14:13 +08:00
Clay
7a319524da
feat: support more parameters in openai worker of scaffolding (#5115)
Signed-off-by: Clay <ccs96307@gmail.com>
2025-07-04 09:35:34 +08:00
Lucas Liebenwein
24ac9b5f69
[AutoDeploy] merge feat/ad-2025-06-29 (#5737)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: Neta Zmora <nzmora@nvidia.com>
Co-authored-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2025-07-04 10:21:18 +09:00
Netanel Haber
aa72d39b72
MTP and derivatives: Align sample state with trtllm sampler sample state (#5675)
This PR moves MTPSampler and derivatives to use the universal seq_slot indexing for sampling.
This is the last piece of the puzzle: After this, all of the samplers will use this format.
See: 6ee94c7
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
2025-07-03 19:55:48 +02:00
Zhenhuan Chen
528ff52ef4
[https://nvbugs/5365714] fix(scaffolding): use default LLM rather than trt backend LLM (#5705)
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-07-03 23:54:20 +09:00
Rashid Kaleem
2b0c87e613
[ModelLoad] Concurrent load model (#5291)
Signed-off-by: Rashid K <rkaleem@nvidia.com>
Co-authored-by: Zhihan Jiang <68881590+nvzhihanj@users.noreply.github.com>
2025-07-03 22:18:04 +08:00
nv-guomingz
8dad22cbe7
chore: refine the default value by using pydantic default instead of … (#5695)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-07-03 22:41:29 +09:00
tomeras91
7dbecf7272
[TRTLLM-4923][feat] Enable CUDA graphs for Nemotron-H (#5646)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-07-03 11:07:51 +03:00
Yiqing Yan
3c9dd5cd66
chore: bump version to 1.0.0rc2 (#5645)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-07-03 12:35:28 +08:00
Enwei Zhu
3a46cf275b
fix: Fix missing arg to alltoall_prepare_maybe_dispatch (#5669)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-07-02 21:41:55 -04:00
Jhao-Ting Chen
77082cde38
[https://nvbugspro.nvidia.com/bug/5329655] [feat] Pytorch path add spec dec param to attention op (#5146)
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
2025-07-02 04:54:43 -04:00
qixiang-99
ca7b6ec8d8
Feat/pytorch vswa kvcachemanager (#5151)
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
2025-07-02 15:58:00 +08:00
Yan Chunwei
2d69b55fe8
chore: enhance yaml loading arbitrary options in LlmArgs (#5610)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-02 14:21:37 +08:00
HuiGao-NV
10c50515c2
fix: Add back allreduce_strategy parameter into TorchLlmArgs (#5637)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-07-02 09:49:20 +08:00
Perkz Zheng
ba2ab5098b
[Bug] attention DP doesn't work with embedding TP (#5642)
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-07-02 08:57:46 +08:00
Aurelien Chartier
efef911f5e
fix: add missing self. from PR #5346 (#5653)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-07-01 20:38:55 -04:00
Aurelien Chartier
fa95e402a5
feat: add LLmArgs option to force using dynamic quantization (#5346)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-07-01 12:16:09 -07:00
liji-nv
c345f5876c
[feat] Support torch compile for attention dp (#5086)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-07-01 13:48:52 -04:00
Kaiyu Xie
f9a455651b
perf: Use tokenizers API to optimize incremental detokenization perf (#5574)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-07-01 09:35:25 -04:00
Anurag Mukkara
93edfea2b8 [nvbug/5354825] Fix nougat test image url (#5496)
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
2025-07-01 20:12:55 +08:00
Wanli Jiang
3789ba1d37 feat: TRTLLM-5941 Upgrade xgrammar to 0.1.18 (#5364)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-07-01 20:12:55 +08:00
brb-nv
4ef60d5fbb nvbugs-5331031; nvbugs-5344203 - address intermittent issues with Mistral Small multimodal for BS=8 (#5453)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-01 20:12:55 +08:00
danielafrimi
7a617ad1fe
feat: W4A16 GEMM (#4232)
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>
2025-07-01 10:36:05 +03:00
Netanel Haber
6ee94c7ac8
Reintroduce with perf fixes: feature: unify new_tokens format sample state to trtllm samper tokens format (#5513)
58a8a8f - these changes were previously merged to main here.
6aef149 - the changes were temporarily reverted in main, due to a significant perf regression in models using the TorchSampler (observed by @byshiue).
This PR is meant to re-merge these changes along with a fix to prevent the regression.

The first commit of this PR is actually just the reverted revert - filter it out of the changes to see previously unmerged changes.

Signed-off-by: Netanel Haber <nhaber@nvidia.com>
2025-06-30 11:58:59 -07:00
Wei-Ming Chen
f28cd3056e
feat: AutoDeploy fp8 quantization support for bmm (#3849)
Signed-off-by: Wei-Ming Chen <17592131+meenchen@users.noreply.github.com>
2025-06-30 12:36:34 -04:00
nv-guomingz
6e48ac25a6
chore: remove cuda_graph_ prefix from cuda_graph_config filed members. (#5585)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-06-30 12:23:14 -04:00
Li Min
16fc99391f
refactor: [TRTLLM-6150] Refactor moe permute and finalize op by removing duplicated code (#5557)
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
2025-06-30 08:48:04 -07:00
Yan Chunwei
98a7c24062
chore [TRTLLM-6009]: remove ptuning knobs from TorchLlmArgs (#5595)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-30 20:40:23 +08:00
Robin Kobus
9bdc5951f8
refactor: decoder state setup (#5093)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-30 11:09:43 +02:00
Fanrong Li
6cbc9a5297
[nvbug/5354946][fix] Fix mtp vanilla draft inputs (#5568)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-30 15:59:12 +08:00
WeiHaocheng
42a9385d02
[TRTLLM-5331] perf: Replace allgaher with AllToAllPrepare (#5570)
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
2025-06-30 13:06:09 +08:00
dongjiyingdjy
852b79053d
feat : support duplicate_kv_weight for qwen3 blockwise scale (#5459)
Signed-off-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>
2025-06-30 11:49:22 +08:00
nv-guomingz
578430e64c
[TRTLLM-5530][BREAKING CHANGE]: enhance the llm args pytorch config part 1(cuda_graph_config) (#5014)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-06-30 11:05:40 +08:00
Bo Li
6000380a0c
perf: Avoid reswizzle_sf after allgather. (#5504)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-06-29 21:25:50 +08:00
Talor Abramovich
70e34a3291
[TRTLLM-5831][feat] Add LoRA support for pytorch backend in trtllm-serve (#5376)
Signed-off-by: Talor Abramovich <talora@nvidia.com>
2025-06-29 12:46:30 +00:00
amirkl94
de9779900c
feat: Add support for YARN in NemotronNAS models (#4906)
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
2025-06-29 09:45:49 +03:00
Lucas Liebenwein
619709fc33
[AutoDeploy] merge feat/ad-2025-06-13 (#5556)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-06-29 03:52:14 +08:00
Li Min
6021a439ab
Make moe permute and final as custom op (#5412)
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
2025-06-27 15:48:33 -07:00
Darragh Hanley
5437075def
ReDrafter support for Qwen (#4875)
Signed-off-by: darraghdog <darragh.hanley@gmail.com>
Signed-off-by: Darragh Hanley <darragh.hanley@gmail.com>
Co-authored-by: rakib-hasan <rhasan@nvidia.com>
2025-06-28 02:33:10 +08:00
Robin Kobus
a8141a4513
refactor: Speculative decoding buffers part 2 (#5316)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-27 17:41:48 +02:00
Aurelien Chartier
833c0dea4a
[TRTLLM-6104] feat: add request_perf_metrics to LLMAPI (#5497)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-06-27 17:03:05 +02:00
wili
56cdfe5c6c
[TRTLLM-5000][feat] NGrams V2 (#4569)
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-06-27 23:00:17 +08:00
peaceh-nv
cb58073ab7
Fix : fix build for sm120 (#5265)
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
2025-06-27 20:42:47 +08:00
Daniel Cámpora
73b8a95049
feat: Use inference mode in update_requests to improve perf of TRTLLM Sampler (#5538) 2025-06-27 18:40:53 +08:00
Daniel Stokes
83a1f60556
feat: Expose bias and FP8_MXFP4 MOE CUTLASS backend features to pytorch (#5410)
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-06-27 12:29:34 +08:00
Yuxian Qiu
dc36228f52
fix: Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-06-27 11:03:38 +08:00
Yibin Li
0f3bd7800e
[TRTLLM-4971]: Use safe deserialization in ParallelConfig (#4630)
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-06-27 09:58:41 +08:00
Frank
aa6e015ef8
Update trtllm-bench to support new Pytorch default. (#5491)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-06-26 17:05:43 -07:00
Venky
0083228d2a
fix: Mapping rank boundary check bug (#4935)
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
2025-06-27 07:27:59 +08:00
jmydurant
8836990bde
[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) (#5475)
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2025-06-26 22:18:08 +08:00
Robin Kobus
8dfa31c71d
refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead (#5384)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-26 19:45:52 +08:00
Rashid Kaleem
3a1f4d4001
[feat] Add progress bar to benchmark (#5173)
Signed-off-by: Rashid Kaleem <rkaleem@nvidia.com>
Co-authored-by: Zhihan Jiang <68881590+nvzhihanj@users.noreply.github.com>
Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com>
2025-06-26 18:39:45 +08:00
Kaiyu Xie
2eb6502b1d
feat: Add support for TRTLLM CustomDataset (#5511)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-06-26 18:27:37 +08:00
Bo Li
1bab9000a6
perf: Optimize swizzle_sf, unswizzle_sf, reswizzle_sf (#5318)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-06-26 14:03:56 +08:00
dongxuy04
490d2e5819
feat: large-scale EP(part 8: Online EP load balancer integration for PCIe fp8) (#5226)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-06-25 22:25:13 -07:00
amitz-nv
e0bb123ae7
[TRTLLM-5921][feat] Prevent serialization of entire LoRA adapters in each request (#5080)
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-06-26 08:15:06 +03:00
Yukun He
9ee33605bb
[TRTLLM-6019] feat: Remove cutlass min latency code from AutoTuner. (#5394)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-06-26 13:12:03 +08:00
Netanel Haber
6aef14943c
Revert "feature: unify new_tokens format sample state to trtllm samper new_tokens format (#4401)" (#5474)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-06-25 20:56:04 -07:00
jmydurant
578dbc8d9a
feat: chunked prefill for MLA (Blackwell) (#4651)
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2025-06-26 09:01:00 +08:00
Yukun He
3fc57543e2
[5356427] fix: Remove the seq_len of 4096 from FP8 block scale MoE tuning configs. (#5485)
The seq_len of 4096 will cause some unknown CUDA illegal memory access issue if run with some other tests consecutively.
Put a saturated upper bound for any sequence length larger than it.
2025-06-26 08:38:35 +08:00
Xianjie Qiao
1e4fa13d33
Add sleep function for disagg gen-only benchmarking (#5398)
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
2025-06-26 07:32:16 +08:00
QI JUN
3a2c4ca77b
chore: split _build_model method for TorchLlm and TrtLlm (#5418)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-06-26 04:32:46 +08:00
Mike Iovine
5bc8c894f7
[chore] Disable block reuse when draft model speculation is being used (#5448)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-26 03:51:20 +08:00
Daniel Cámpora
205c97a4ae
[TRTLLM-5974][feat] Support disaggregated serving in TRTLLM Sampler (#5328)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-06-25 17:41:36 +02:00
Kaiyu Xie
c5ae3272b9
feat: Make benchmark_serving part of the library (#5428)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-06-25 23:13:56 +08:00
HuiGao-NV
b3a4c1f404
feat: Remove not used padding_idx in models (#5385)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-25 17:19:59 +08:00
Yiqing Yan
f3cfe86dd1
chore: bump version to 1.0.0rc1 (#5460)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-06-25 16:21:34 +08:00
Enwei Zhu
fc7a81ceb0
test: Add LLGuidance test and refine guided decoding (#5348)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-25 14:12:56 +08:00
Enwei Zhu
76da7fed86
fix (NvBug 5354925): Fix static EPLB (#5411)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-25 13:14:40 +08:00
Shunkangz
d5354897c0
feat: Dynamically remove servers in PD (#5270)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-06-25 09:50:04 +08:00
Lucas Liebenwein
5cffb7e0ec
[AutoDeploy] Merge feat/ad_2025_06_13 feature branch (#5454)
Signed-off-by: Grzegorz Kwasniewski <213329731+greg-kwasniewski1@users.noreply.github.com>
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Co-authored-by: Grzegorz Kwasniewski <213329731+greg-kwasniewski1@users.noreply.github.com>
Co-authored-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Co-authored-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
2025-06-25 09:30:13 +08:00
bhsueh_NV
73ba4fc320
fix: fix bug of qwen3 + eagle3 + finalize_moe_fusion (#5369)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-06-25 09:20:23 +08:00
dongxuy04
699520082b
Add MTP support for Online EPLB (#5213)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-06-25 07:58:13 +08:00
QI JUN
d93a5e04b5
Chore: remove unused variables (#5314)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-24 22:27:32 +08:00
HuiGao-NV
35a92f6bab
Add debug hook to support dump tensor data and add new debug functions easily (#5182)
Signed-off-by: Hui Gao
2025-06-24 17:45:28 +08:00
Luis Vega
d26040e5d9
chore: delete mamba hybrid, since it is now called NemotronH (#5409)
Signed-off-by: Luis Vega <vegaluisjose@users.noreply.github.com>
2025-06-24 16:27:31 +08:00
Robin Kobus
e2a8cbc80b
refactor: manage cache indirection in decoder state (#5315)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-24 09:15:59 +02:00
HuiGao-NV
e16c1bef6e
[fix] Add 1 and draft_token_num to seq_len when overlap scheduling is enabled during memory estimation (#5343)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-24 11:43:43 +08:00
Netanel Haber
58a8a8fd37
feature: unify new_tokens format sample state to trtllm sampler new_tokens format (#4401)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-06-23 10:38:37 -07:00
dongxuy04
4f0f17ac8a
feat: Misc Opt for large scale EP (#5374)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-06-20 13:11:31 +08:00
Fanrong Li
5d4ab47d5b
fix: refactor and fix mtp vanilla (#4762)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-20 05:23:39 +08:00
Yan Chunwei
9bd42ecf9b
[TRTLLM-5208][BREAKING CHANGE] chore: make pytorch LLM the default (#5312)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-20 03:01:10 +08:00
Kaiyu Xie
7246fd75d1
feat: Support stream_interval (#5284)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-06-19 21:57:10 +08:00
Fanrong Li
c7af650d5a
Fix: fix the deterministic issue in the MTP Eagle path (#5285)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-19 18:08:40 +08:00
Frank
68687a9f56
[WAR][nvbug/5321947] Add an async sleep to unblock event loop. (#5342)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-06-19 17:25:18 +08:00
hlu1
b558232ce1
Refactor CutlassFusedMoE (#5344)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-06-19 00:04:07 -07:00
amitz-nv
1753202b61
[TRTLLM-5825][fix] Fix torch LoRA TP (#5338)
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-06-19 09:12:00 +03:00
Yiqing Yan
dedce8ab0e
chore: bump version to 1.0.0rc0 (#5326)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-06-19 12:02:28 +08:00
nv-guomingz
6a388b105a
chore: remove torch_compile prefix for TorchCompileConfig field members (#5261)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-06-19 09:21:51 +08:00
Zongfei Jing
2b23cd56ce
[feat] Fusion finalize and allreduce for qwenmoe model (#5223)
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
Co-authored-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com>
2025-06-19 08:03:58 +08:00
Yan Chunwei
3946e798db
fix[nvbug5298640]: trtllm-llmapi-launch multiple LLM instances (#4727)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-19 06:13:53 +08:00
jellysnack
0623ffe3bc
feat: Add LLGuidance Support for PyTorch Backend (#5214)
Signed-off-by: jellysnack <oleg.jellysnack@gmail.com>
Signed-off-by: jellysnack <158609015+jellysnack@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-18 19:33:34 +08:00
Zhanrui Sun
516bd4dc05
chore: bump version to 0.21.0rc3 (#5309)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-06-18 15:59:53 +08:00
Robin Kobus
38547b92f3
refactor: Introduce ResourceManagerType enum for resource management (#5246)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-18 09:55:59 +02:00
Yukun He
6711ad9cf3
[TRTLLM-5589] feat: Minor optimizations for tunable FP8 batched GEMM op. (#5139)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-06-18 14:33:46 +08:00
Yan Chunwei
724e495254
chore: partition LLM class into TorchLLM and TrtLLM (#4900)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-18 14:01:25 +08:00
Yi Zhang
e44f7687af
feat: Add no_kv_cache_reuse option and streaming support for trtllm serve bench (#4971)
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-06-18 13:37:31 +08:00
QI JUN
855036d8ee
update LlmRequest.is_dummy property (#5283)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-18 10:52:13 +08:00
Robin Kobus
627062c265
refactor: Update decoder buffer and logits management (#4450)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-18 08:10:32 +08:00
Mike Iovine
9bf69c9fdb
[chore] Remove BaseDraftTokenManager (#5251)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-17 11:57:52 -04:00
QI JUN
f899c4d294
Re-implement LlmResponse in Python to reduce host overhead of pybind (#5224)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-17 21:28:09 +08:00
Dom Brown
44fb3c1673
[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner (#5207)
- Adds a new Python custom op (fp8_block_scale_moe_runner) and a FP8BlockScaleMoERunner class for autotuning.
- Updates C++ MoE and batched GEMM kernels to accept a configIndex for workspace sizing and execution.
- Extends the unit test to run both autotuned and non-autotuned code paths.

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-06-17 21:01:56 +08:00
amirkl94
8451a87742
chore: Mass integration of release/0.20 (#5082)
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: Erin <14718778+hchings@users.noreply.github.com>
Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com>
Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-06-17 14:32:02 +03:00
liji-nv
13eef642e6
[feat] Piecewise cuda graph support for MLA (#4467)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-06-17 18:58:38 +08:00
Yilin Fan
498fadceb4
[feat] Add EAGLE3 support for Qwen3 (#5206)
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
2025-06-17 17:07:06 +08:00
Enwei Zhu
4b82b8b4c7
[TRTLLM-5330] perf: Optimize MoE supplementary kernels for large-scale EP (#5215)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-17 15:23:24 +08:00
Izzy Putterman
e607768e45
Speculation: Draft Target in new FW (#4558)
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
2025-06-17 02:26:08 +08:00
tomeras91
cea5dd1e38
[TRTLLM-5835][feat] Optimized Mamba2Mixer prefill (#5128)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-06-16 16:29:17 +03:00
Yilin Fan
dd29063538
[feat] Add llm args to tune python gc threshold (#5141)
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
2025-06-16 17:45:22 +08:00
Robin Kobus
b6ca677741
refactor: remove decoder request from decoder interface (#5129)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-16 09:12:30 +02:00
Robin Kobus
dda64166cd
refactor: Scheduling based on KV cache state (#4865)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-16 08:14:58 +02:00
Tracin
ef3fdc8051
feat: Add w4a8_mxfp4_fp8 quantization recipe. (#4867)
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-06-16 11:30:57 +08:00
Enwei Zhu
babdd9ce06
test: Add json_mode_eval for guided decoding evaluation (#5179)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-16 10:03:55 +08:00
Yilin Fan
7a5e0fd300
[fix] Fix Llama4 min-latency import error (#5209)
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
2025-06-16 10:03:07 +08:00
Yan Chunwei
c84e41fd9d
fix: build_config in TorchLlmArgs and avoid arbitrary args (#4972)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-15 17:51:56 -07:00
amitz-nv
109c426077
Enable trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA in PyT flow (#5130) 2025-06-15 18:54:04 +03:00
Fanrong Li
39bba63758
[TRTLLM-4983] feat: enable overlap scheduler between draft forwards (#4802)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-15 23:09:16 +08:00
Fanrong Li
159ffc584e
fix: fix cuda graph max batch size for spec decoding cases. (#5076)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-15 14:57:28 +08:00
Kaiyu Xie
dce1dcc4f9
feat: Support post_proc for bench (#5122)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-06-15 13:02:38 +08:00
Enwei Zhu
63bc62ddf4
feat: Enable EPLB to existing MoE models (#5203)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-15 11:48:06 +08:00
Yuan Tong
6bce7337a9
perf: avoid dynamic import overhead in is_llm_response with duck typing (#5110)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-06-15 07:45:02 +08:00
ixlmar
e055af1bc9
chore: improve disagg test failure detection (#4738)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-06-15 01:28:26 +08:00
Aurelien Chartier
1389f5a4d3
feat: Add support for fp8 rowwise quantization (#4876)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Co-authored-by: aikitoria <151776613+aikitoria@users.noreply.github.com>
2025-06-14 06:37:48 -07:00
2ez4bz
dc52b67492
linting(python): Enable ruff on more files (wave 1/N) (#5140)
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-06-14 19:19:34 +08:00
Tailing Yuan
0b60da2c45
feat: large-scale EP(part 7: DeepEP integration) (#4792)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-06-14 19:12:38 +08:00
yunruis
b99c5ce8c1
Feat/ds r1 min latency opt round3, add router gemm, fused a gemm, PDL (#4560)
Signed-off-by: yunruis <yunruis@nvidia.com>
Signed-off-by: kduan <176893526+Kefeng-Duan@users.noreply.github.com>
Signed-off-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com>
Co-authored-by: kduan <176893526+Kefeng-Duan@users.noreply.github.com>
2025-06-14 17:36:22 +08:00
nv-guomingz
3b7b5a5ad5
refactor [BREAKING CHANGE]: enhance the llm args pytorch config part 3(torch_compile_config) (#5032)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-06-14 14:23:13 +08:00
Yilin Fan
06342ffb4d
[feat] Implement model-agnostic one-engine eagle3 (#4778)
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
2025-06-13 08:11:41 -07:00
Mike Iovine
25aa3881d7
[nvbug/5319281][fix] Stop drafting when we hit the draft model's max seq len (#4879)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-13 11:06:36 -04:00
brb-nv
089be8912a
feat: Basic skeleton for Gemma3 VLM (#5108)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-06-13 17:27:04 +08:00
nv-guomingz
b959618579
refactor [BREAKING CHANGE]:: remove the redundant use_kv_cache field from PytorchConfig (#5031)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-06-13 16:34:24 +08:00
yunruis
30c5b4183a
refactoring: port customized kernels with public cutlass version (#5027)
Signed-off-by: yunruis 

Merge this to unblock others since the full CI has been run through
2025-06-13 16:19:31 +08:00
Zheng Duan
4d0a5ad384
chore: gracefully exit disagg process in tests; better startup and logging (#5109)
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
2025-06-13 14:03:55 +08:00
Yibin Li
b79eb34bfe
[fix]: Fall back to HMAC to Avoid IPC Serialization Churn (#5074)
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-06-13 11:37:50 +08:00
zhhuang-nv
a891013e3c
[feat] Optimize KV Cache Reuse for MLA (#4869)
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
2025-06-13 11:03:05 +08:00
Fanrong Li
38a907aaca
[TRTLLM-5278][feat] Add attention dp support to MTP relaxed acceptance (#5119)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-13 08:58:44 +08:00
pcastonguay
3a04c9fa7b
chore: Include prompt_token_ids only for context-only disagg requests (#5055)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-06-12 15:00:08 -04:00
Mike Iovine
690873ba1a
[nvbug/5334370][fix] Fix one model EAGLE3 (#5134)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-12 10:28:14 -04:00
HuiGao-NV
dfeeaf6746
Move allreduce_strategy from committed api to reference (#5147)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-12 21:00:20 +08:00
brb-nv
8cfb567182
fix: Updates to yarn implementation (#5105)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-06-12 20:45:34 +08:00
nv-guomingz
58d4ca2385
fix:remove duplicated trust_remote_code knob from trtllm-serve (#5143)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-06-12 19:48:24 +08:00
liji-nv
10ab9791ec
[fix] Do not reuse dummy request KVCache (#4804)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-06-12 15:24:50 +08:00
Daniel Cámpora
e46267765f
Fix logprobs issues. (#5136)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-12 15:07:01 +08:00
Lucas Liebenwein
49d7268acc
[nvbugs/5331013] fix AutoDeploy for PyTorch 25.05 dependency upgrade (#5106)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-06-12 13:07:27 +08:00
Netanel Haber
e692779ead
Solve underallocation in VSWA+/VGQA (#4667)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-06-12 12:12:46 +08:00
HuiGao-NV
43192379af
Use backend to replace macro to control enablement of MNNVL all reduce (#4635)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-12 11:22:49 +08:00
Zheng Duan
c592798f64
fix: limit process pool size when prefetching (#5088)
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
2025-06-12 10:52:52 +08:00
liji-nv
8282d6c1a7
[fix] Fix llama4 min latency (#5117)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-06-11 15:44:38 +08:00
Zhanrui Sun
e2863a3159
chore: bump version to 0.21.0rc2 (#5112)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-06-11 15:08:14 +08:00
Daniel Cámpora
fdf1c47d1d
[TRTLLM-4995][feat] TRTLLM Sampler log probs support (#4836)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-11 08:18:13 +02:00
nvpohanh
7b210ae9c3
test: add unit tests for Llama4 min_latency code (#4980)
Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
2025-06-10 12:10:26 -07:00
Lucas Liebenwein
7ddc4d6282
[AutoDeploy] Merge Feature Branch Week 3 (#5054)
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
2025-06-11 00:20:43 +08:00
Tracin
6c91f1c7ac
Mxfp8xmxfp4 quant mode(#4978)
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-10 22:01:37 +08:00
Zongfei Jing
6d1f2d0fd7
[TRTLLM-3927] [feat] Finalize + Allreduce + add + rmsnorm fusion (#4756)
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-06-10 19:55:16 +08:00
Yuxian Qiu
08dc369a4d
fix: pytorch_backend_config is deprecated in update_llm_args_with_extra_dict. (#4890)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-06-10 18:40:29 +08:00
tomeras91
f121f13ddf
[nvbug 5325284][fix] Increase Nemotron-H warmup request robustness (#4954)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-06-10 11:09:37 +03:00
Xiaowei Wang
ec6b1821c7
[fix] Fix W4A8 weight loading error in WInt4AFP8FusedMoEMethod (#5026)
Signed-off-by: Xiaowei Wang <100599594+xiaoweiw-nv@users.noreply.github.com>
2025-06-10 15:09:06 +08:00
Daniel Cámpora
d68b8180d3
feat: port MakeDecodingBatchInputOutput to python in TRTLLMSampler (#4828)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-10 07:28:34 +08:00
Chang Liu
f70815c945
[TRTLLM-5007][feat] Add multimodal hashing support (image hashing) (#4145)
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-06-10 01:59:56 +08:00
Yuxian Qiu
e79527d195
chore: Refine weight prefetching. (#4893)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-06-09 21:24:16 +08:00
Mike Iovine
f4d9c87c51
[nvbug/5314469][feat] Include the executor's max batch size in CUDA g… (#4843)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-09 08:31:35 -04:00
Yukun He
137fe35539
fix: Fix warmup phase batch size out of range. (#4986)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-09 19:19:16 +08:00
Yuxian Qiu
88480197da
ci: [nvbugs/5280806] Unwaive unittests/_torch. (#4951)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-06-09 19:04:11 +08:00
Dom Brown
9c012d5bf8
[TRTLLM-5589] feat: Integrate TRT-LLM Gen FP8 Batched GEMM with Pytorch workflow kernel autotuner (#4872)
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-06-09 11:02:48 +01:00
ChristinaZ
f45aff2b7d
Add customized renormalized moe routing kernel for moe cutlass backend (#4955)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-06-09 17:38:50 +08:00
Bo Li
c104388d37
chore: Refactor apply_rope. (#4918)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-06-09 16:51:59 +08:00
Daniel Stokes
3a4851b7c3
feat: Add Mixture of Experts FP8xMXFP4 support (#4750)
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-06-09 13:25:04 +08:00
amitz-nv
77e8d739f1
[TRTLLM-4987][feat] Support generation logits in TRTLLMSampler (#4819) 2025-06-09 06:30:01 +03:00
Yechan Kim
8b4104d34a
feat: add HyperCLOVAX-SEED-Vision support in refactored way (#4799)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-06-09 11:04:04 +08:00
Omer Ullman Argov
8731f5f14f
chore: Mass integration of release/0.20 (#4898)
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Hui Gao <huig@nvidia.com>
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
Signed-off-by: moraxu <mguzek@nvidia.com>
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: HuiGao-NV <huig@nvidia.com>
Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com>
Co-authored-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
Co-authored-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Co-authored-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
Co-authored-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Faraz <58580514+farazkh80@users.noreply.github.com>
Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com>
2025-06-08 23:26:26 +08:00
Mike Iovine
ec0d984656
[nvbug/5280806][fix] Fix 2 model spec decode flow (#4807)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-08 07:40:02 -04:00
dongxuy04
1e369658f1
feat: large-scale EP(part 6: Online EP load balancer integration for GB200 nvfp4) (#4818)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Co-authored-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-06-08 10:25:18 +08:00
QI JUN
5ee0de7f2a
Resubmit #4894 (#4969)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-08 04:42:15 +08:00
Bo Li
f414a079ad
chore: Change the type annotations of input_ids and position_ids to int32. (#4632)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-06-07 16:10:47 +08:00
nv-guomingz
0c7dd660d8
fix:https://nvbugs/5324248 (#4973)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-06-07 04:14:07 +08:00
Fanrong Li
75d020cf07
fix: fix cuda graph padding for spec decoding (#4853)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-06 22:21:42 +08:00
Anthony Chang
eeb555e37b
chore: memoize weight shuffle index to speed up weight preproc in moe_backend=TRTLLM (#4826)
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-06-06 16:13:54 +08:00
QI JUN
ec50684d80
Revert "fix a bug of global cuda graph dummy request" (#4970) 2025-06-06 08:54:45 +08:00
QI JUN
bfa877a22e
Fix: fix autodeploy (#4957)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-05 21:06:55 +08:00
QI JUN
154f7cc40a
fix a bug of global cuda graph dummy request (#4894)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-05 19:47:40 +08:00
Shunkangz
3eae58ca36
Add disaggregated unittest (#4899)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-06-05 19:14:31 +08:00
QI JUN
b8c5e3892b
Revert "fix: build_config in TorchLlmArgs and avoid invalid args" (#4949)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-05 17:43:30 +08:00
Lucas Liebenwein
743fb0a159
[AutoDeploy] _AutoDeployLlmArgs as primary config object (#4891)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-06-05 17:20:55 +08:00
ixlmar
6437756da8
fix: handle OOMs during KV cache estimation (#4690)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-06-05 10:02:26 +02:00
Shiyu Li
b0d287c9b7
[TRTLLM-4647][fix] Fix the no fusion allreduce hanging (#4594)
Signed-off-by: Shiyu Li <shili@nvidia.com>
2025-06-04 18:26:13 -07:00
Yuxian Qiu
6b3242654e
fix: Fix broken vanilla moe since FusedMoE refactor. (#4897)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-06-05 03:56:41 +08:00
Yi Zhang
1fca654bfd
tests: Update gb200 test case (#4754)
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-06-04 18:49:20 +08:00
ixlmar
2bbb6b5976
chore: introduce KvCacheCreator (#4581)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-06-04 11:03:17 +02:00
Xianjie Qiao
325ccaae3d
Fix trtllm-bench iter_stats and cuda_graph_batch_sizes error errors. (#4827)
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com>
2025-06-04 16:36:07 +08:00
Zhanrui Sun
35e87b99f3
chore: bump version to 0.21.0rc1 (#4896)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-06-04 14:31:18 +08:00
tomeras91
8d31e16877
[TRTLLM-4923][feat] Paged mamba cache (#4822)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-06-04 09:27:08 +03:00
Omer Ullman Argov
e71de2a13e
chore: Mass integration of release/0.20. (#4871)
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
2025-06-04 14:12:27 +08:00
Yan Chunwei
ac20159d32
fix: build_config in TorchLlmArgs and avoid invalid args (#4600)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-04 13:17:29 +08:00
QI JUN
e2eea80c1d
Chore: refine comments of prepare inputs method of model engine (#4837)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-04 12:14:13 +08:00
Yukun He
5fa6fbd989
feat: Enhance AutoTuner inference path and code readability (#4466)
Fix AutoTuner warmup request generating.
* The current warmup phase creates one request, which is insufficient for the warmup to cover the max_num_tokens. Revise the warmup phase to a batch of requests to cover the max_num_tokens to eliminate potential fallback cases.
Refactor AutoTuner API and reduce host overhead.

Refine (min, opt, max) values of optimization profile setup for get_valid_tactics to achieve the correct canImplement definition.
* Refine cache key assembly process to reduce host overhead and simplify API.
* Fix lru_cache usage to reduce host overhead.
* Move tuning config initialization as a one-time object in tunable runner to reduce host overhead.

Improve tuning config readability.
* Use dataclass to define tuning config.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-06-04 10:53:11 +08:00
Shi Xiaowei
b13f8c9cba
Fix: NVBug 5302895 (#4835)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-06-04 09:31:39 +08:00
Shunkangz
c835f06371
Refactor the first token response in PD (#4692)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-06-04 09:11:23 +08:00
Mike Iovine
73389d6531
[fix] Fix llama 4 long context (#4809)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-04 07:48:08 +08:00
Nikita Korobov
8043d7a03c
feat: update DeepSeek FP8 TRT-LLM Gen cubins (#4643)
Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
2025-06-03 14:07:54 -07:00
rakib-hasan
d0eb47d33a
[TRTLLM-5053] Refactoring and Unifying the Multimodal input preparation (#4506)
* refactoring the multimodal input prep

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* adding out-of-tree override option

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* adding exceptional case for llava-next

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* fixing typo

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* addressing review comments, adding placement option, handling tokenizer variations

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* addressing pytest-asyncio behavior change

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

---------

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
2025-06-03 12:02:07 -07:00
hlu1
b4ed4b22f3
[Arch] Freeze model_config (#4814)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-06-04 02:51:35 +08:00
Yan Chunwei
80b4026775
chore: remove request_error ipc in LLM.submit (#4763)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-03 20:55:59 +08:00
pcastonguay
01f29ce38b
[nvbug 5294316] fix: Fix queued request stats (#4714)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-06-03 08:33:08 -04:00
Shunkangz
ae9a6cf24f
feat: Add integration of etcd (#3738)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Signed-off-by: BatshevaBlack <132911331+BatshevaBlack@users.noreply.github.com>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Batsheva Black <bblack@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: BatshevaBlack <132911331+BatshevaBlack@users.noreply.github.com>
2025-06-03 20:01:44 +08:00
Enwei Zhu
3fe4a1842a
fix: Register MoeLoadBalancerConfig to serialization.py (#4864)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-03 19:22:36 +08:00
Frank
80f9989a1e
[enhanchment] Add beam width to low latency. (#4812)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-06-03 17:24:55 +08:00
Robin Kobus
3de02582dd
refactor: Separate DecoderState from GptDecoderBatched (#4700)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-03 09:42:01 +02:00
Robin Kobus
b9263a8e10
fix: max_num_sequences calculation with overlap scheduling (#4532)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Co-authored-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-03 09:31:22 +02:00
hlu1
320195dc0d
[Architecture] Refactor FusedMoE (#4790)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-06-03 14:02:19 +08:00
Yuxian Qiu
ec796e44e4
feat: add heuristics for checkpoint files prefetching. (#4765)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-06-03 12:10:37 +08:00
Yan Chunwei
e013c8cbc2
fix [nvbug5256044]: bench hang due to llmapi ipc (#4798)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-03 10:10:53 +08:00
Fanrong Li
380a5d1690
[https://nvbugs/5271281][fix] fix a pd+mtp accuracy issue (#4536)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-03 10:03:34 +08:00
Tian Zheng
9832787050
[feat] Enable NVFP4 output for TRTLLM attention kernels (#4737)
Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>
2025-06-03 10:00:17 +08:00
Yilin Fan
eb2d51a429
[fix] Fix llama4 min-latency mode (#4810) 2025-06-02 08:50:01 +08:00
Enwei Zhu
5b4852b7b5
feat: large-scale EP(part 5: Static EP load balancer with offline statistics) (#4695)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-02 01:25:02 +08:00
Fanrong Li
7d356efc7d
fix: fix accuracy and illegal memory access issues when using mtp + attention dp (#4379)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-02 00:35:52 +08:00
tomeras91
bf9cd11fd4
[TRTLLM-4783][feat] Mamba2 kernel updates for Nemotron-H (#4494)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-06-01 13:56:44 +03:00
Lucas Liebenwein
491a09b0c6
[AutoDeploy] Increased Model Coverage Mass Migration Week 2 (#4817)
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Co-authored-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: sugunav14 <178320438+sugunav14@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-06-01 14:40:29 +08:00
Enwei Zhu
0087bd27ba
[fix] Fix SamplingParams check on n and best_of (#4655)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-01 09:11:55 +08:00
Daniel Cámpora
69c7fe8905
[TRTLLM-4987][feat] Partial support of context logits in TRTLLMSampler (#4538)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-01 03:32:43 +08:00
Enwei Zhu
25dde49c28
fix: EP load balancer with MTP layer and route offset by EP rank (#4767)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-01 00:07:44 +08:00
Yuxian Qiu
a02df6aa4b
fix: re-enable tp/pp for quickstart_advanced.py. (#4766)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-31 19:13:46 +08:00
Yan Chunwei
93c0632ee4
opt: the perormance for dist-agg streaming generation (#4214)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-31 17:40:32 +08:00
Mike Iovine
8cb6163a57
[fix] Fix Llama 3.3 70b EAGLE (#4772)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-30 10:08:08 -04:00
Yuxian Qiu
f82e44bbb9
fix: [nvbugs/5310520] disable embed_tokens's TP when DP enabled for llama model. (#4758)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-30 18:04:08 +08:00
Pengyun Lin
bac22ff7b5
[feat] support sharegpt downloading in benchmark_serving (#4578)
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-05-30 17:27:53 +08:00
QI JUN
99fdef20c4
[TRTLLM-5516] perf: replicate dummy request for cuda graph padding (#4729)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-05-30 17:14:23 +08:00
ixlmar
c026dda400
fix: iteration logging and typing in PyExecutor (#4734)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-05-30 11:01:20 +02:00
ixlmar
7e6d06d5d7
feat: estimate GPU mem. usage w/ minimal KV cache (#4574)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-05-30 10:40:45 +02:00
Jinyang Yuan
5339d367ce
[perf] Reduce the workspace size of FP4 activation scales for MoE (#4303)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-05-30 09:03:52 +08:00
hlu1
3093c747b7
[Architecture] Redesign Linear module (#4721)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-05-29 16:05:46 -07:00
Yilin Fan
31bb650298
Cherry pick feat/llama4 to main (#4739)
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
Co-authored-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-05-30 05:28:40 +08:00
QI JUN
255779a91d
Chore: fuse _merge_requests method into _fetch_new_requests method (#4689)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-05-29 17:47:44 +08:00
Yan Chunwei
33a9ba55f5
fix: test trtllm-bench mgmn (#4613)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-29 14:43:47 +08:00
Yan Chunwei
ac17142495
chore: rename ExecutorBindingsWorker/Proxy (#4716)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-29 10:32:35 +08:00
Arthur Rasmusson
812b1abf86
feature: KV Cache GPUDirect Storage (#3209)
Signed-off-by: Arthur Rasmusson <47877520+arthurrasmusson@users.noreply.github.com.>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-05-28 23:27:43 +00:00
Yuxian Qiu
bf691b3d28
feat: support packed weights in vanilla moe (#4719)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-29 06:24:24 +08:00
Yan Chunwei
5506f60037
chore [BREAKING CHANGE]: Flatten PyTorchConfig knobs into TorchLlmArgs (#4603)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-28 18:43:04 +08:00
ixlmar
fbe4db207d
feat: forward exceptions to Python and catch OOMs (#4497)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-05-28 11:58:10 +02:00
Iman Tabrizian
c875184f78
Add missing serialization classes (#4642)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-05-28 16:40:23 +08:00
amirkl94
fbec0c3552
Release 0.20 to main (#4577)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
Signed-off-by: Simeng Liu <simengl@nvidia.com>
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
Signed-off-by: moraxu <mguzek@nvidia.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Co-authored-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>
Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: Venky <23023424+venkywonka@users.noreply.github.com>
Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: stnie <82932102+stnie@users.noreply.github.com>
Co-authored-by: Simeng Liu <109828133+SimengLiu-nv@users.noreply.github.com>
Co-authored-by: Faraz <58580514+farazkh80@users.noreply.github.com>
Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Co-authored-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-05-28 16:25:33 +08:00
Pengyun Lin
971d16a2ee
[TRTLLM-1658][feat] Enable multiple response in trtllm-serve for TRT backend (#4623)
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-05-28 11:36:44 +08:00
Bo Li
9c4b8f66b4
feat: Integration of Fused QKNorm+RoPE. (#4611)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-05-28 11:20:45 +08:00
Shunkangz
6493401986
Fix handle cancel request for attentionDP (#4648)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-05-28 11:04:02 +08:00
Yuxian Qiu
5700a4ffcd
feat: Add vanilla MOE. (#4682)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-28 10:44:14 +08:00
Yuxian Qiu
e538b0d95e
refactor: extract and reuse filter_weights. (#4681)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-27 19:48:01 +08:00
Lucas Liebenwein
5cdd6bb10f
[AutoDeploy] Increased Model Coverage Mass Migration Week 1 (#4468)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: sugunav14 <178320438+sugunav14@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Co-authored-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-05-27 16:43:15 +08:00
QI JUN
1582361400
Chore: only pad one dummy request for attention dp scenario (#4664)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-05-27 14:56:22 +08:00
Tracin
268171bc66
[NVBUG 5301980] Fix fp4 gemm padding. (#4662)
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-05-27 11:30:53 +08:00
Chuang Zhu
4318037ca3
fix disagg config params (#4646)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-05-26 23:28:52 +08:00
Enwei Zhu
88190faa34
feat: large-scale EP(part 4: Static EP load balancer integration) (#4615)
* MoeLoadBalancerConfig

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* MoeLoadBalancer integration

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* config file

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* test

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* test

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-26 18:25:11 +08:00
QI JUN
44eb053b95
introduce RequestQueueItem class instead of using tuple (#4649)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-05-26 17:34:53 +08:00
Shunkangz
fd27f89df6
fix: Remove duplicate tokenization in generation server (#4492)
* Add nvtx

Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>

* Add draft change

Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>

* Refactor and add support of chat

Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>

---------

Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-05-26 16:43:07 +08:00
QI JUN
4a81991b65
Chore: refine shutdown signal of PyExecutor (#4614)
* refine shutdown signal of PyExecutor

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>

* clean

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>

---------

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-05-26 11:14:54 +08:00
Yuxian Qiu
8f055f5d14
feat: Skip sampler for intermediate pp stages. (#4514)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-26 10:08:51 +08:00
Yibin Li
bb2f545729
fix pipeline tests due to rebase (#4640)
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-05-26 08:38:08 +08:00
shaharmor98
2b8f6d2871
Fix snake case format (#4559)
fix snake case format

Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-05-25 17:57:17 +08:00
Anton
5dff0bff8f
[#4633][doc] Fixed typo in scaffolding README.md (#4634)
* Fixed typos in the scaffolding README.MD

Signed-off-by: Anton <44649959+amemov@users.noreply.github.com>

* Fixed links for 'More examples' and 'Contribute Guide'

Signed-off-by: Anton <44649959+amemov@users.noreply.github.com>

---------

Signed-off-by: Anton <44649959+amemov@users.noreply.github.com>
2025-05-25 09:04:12 +08:00
Robin Kobus
7b2818a47b
refactor: CreateNewDecoderRequests (#4452)
* refactor: CreateNewDecoderRequests

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Consolidate request generation in CreateNewDecoderRequests

- Removed the GenerateRequestOptions class and integrated its functionality into CreateNewDecoderRequests.
- Updated the constructor of CreateNewDecoderRequests to accept parameters for speculative decoding and normalization options.
- Modified the operator() method to handle request generation directly, improving code organization and reducing redundancy.
- Cleaned up associated includes and references throughout the codebase.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Simplify request handling in CreateNewDecoderRequests

- Removed the generateRequestOptions method and integrated its logic directly into the operator() method.
- Updated the request generation process to improve clarity and reduce redundancy.
- Adjusted the return type to streamline the handling of batch slots, decoder requests, and sampling configurations.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Enhance createDecoderRequests method in CreateNewDecoderRequests

- Updated the createDecoderRequests method to include additional parameters for decoder state and CUDA streams, improving flexibility in request handling.
- Removed redundant request generation logic from the operator() method, streamlining the process.
- Adjusted the newRequest method to utilize the updated decoder request structure, enhancing clarity and maintainability.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Use MedusaBuffers instead of RuntimeBuffers in CreateNewDecoderRequests

- Updated references from RuntimeBuffers to MedusaBuffers across the CreateNewDecoderRequests class and its methods, enhancing clarity in buffer management.
- Adjusted method signatures and internal logic to accommodate the new MedusaBuffers type, ensuring compatibility with existing functionality.
- Cleaned up unnecessary includes and improved code organization for better maintainability.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update CreateNewDecoderRequests to use DecoderState and CudaStream parameters

- Modified method signatures in CreateNewDecoderRequests to replace GptDecoderBatched with runtime::decoder::DecoderState and added a separate CudaStream for the decoder.
- Adjusted the implementation of the operator() method to accommodate the new parameters, enhancing flexibility in request handling.
- Updated associated bindings in the pybind11 interface to reflect the changes in method signatures, ensuring consistency across the codebase.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update TRTLLMSampler to use refactored create_new_decoder_requests

- Updated the sampler.py to reflect changes in the request handling logic, replacing generate_request_options with create_new_decoder_requests for improved clarity and consistency.
- Updated bindings and method signatures for decoder stream handling.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update gptDecoderBatchedTest to use CreateNewDecoderRequests::newRequest

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-23 22:54:37 +08:00
zhhuang-nv
8452775db8
[TRTLLM-5070][feat] Support FP8 KV Cache Reuse for MLA (#4535)
* optimize kv cache reuse workflow for MLA

write kv cache first and only call up-projection GEMM once
relax contiguous requirements of k/v for setting paged kv cache
return two contiguous tensors when loading MLA KV Cache

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* support fp8 kv cache for MLA kv cache reuse

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* resolve comments

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

---------

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
2025-05-23 19:47:50 +08:00
Anthony Chang
bbea2647b1
Qwen3 supports TRTLLM FP4 MoE backend (#4530)
* MoE TRTLLM backend for Qwen3

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* add extra moe_backend to test

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* address comments

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* conditionally compile kernels on newer archs

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* missing positional arg

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* Update the routing kernels

Signed-off-by: Christina Zhang <christinaz@nvidia.com>

* Revise usage of TLLM_LOG_ERROR

Signed-off-by: Christina Zhang <christinaz@nvidia.com>

* Add unit test for Qwen3 moe (trtllm_gen backend)

Signed-off-by: Christina Zhang <christinaz@nvidia.com>

* improve weight processing speed of moe_backend=TRTLLM; roughly 2x

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* tidy and minor fix

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* temporarily disable accuracy test that has known issue

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

---------

Signed-off-by: Anthony Chang <anchengc@nvidia.com>
Signed-off-by: Christina Zhang <christinaz@nvidia.com>
Co-authored-by: Christina Zhang <christinaz@nvidia.com>
2025-05-23 18:31:08 +08:00
bhsueh_NV
d69c662215
[Fix][Qwen3] fix bug of qwen3 fp4 workflow with EP (#4575)
* fix bug of qwen3 fp4 workflow with EP

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix bug of qwen3_moe with ep

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-05-23 13:34:05 +08:00
coldwaterq
1cf0e672e7
fix: [nvbugs/5066257] serialization improvments (#3869)
* added a restricted pcikler and depickler in a sepparate serialization function.

Signed-off-by: coldwaterq@users.noreply.github.com <coldwaterq@users.noreply.github.com>

* updated IPC to remove approved classes, removed the serialization function because it didn't work for all objects that made debugging harder, added tests.

Signed-off-by: coldwaterq@users.noreply.github.com <coldwaterq@users.noreply.github.com>

* removed LLM arg and moved class registration to a serialization module function. Also added missing classes to approved list.

Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com>

* cleaned up a couple files to reduce conflicts with main.

Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com>

* fix unit tests

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

* reorder BASE_ZMQ_CLASSES list alphabetically

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

* fix tests and move LogitsProcessor registration to base class

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

* revert changes to import log of tensorrt_llm._torch.models

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

* added comments to explain why BASE_ZMQ_CLASSES has to be passed into spawned child processes

Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com>

* fix tests and move LogitsProcessor registration to base class

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

* additional comments for multiprocess approved list sync

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

* add dataclass from tests

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

---------

Signed-off-by: coldwaterq@users.noreply.github.com <coldwaterq@users.noreply.github.com>
Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Co-authored-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-05-23 13:06:29 +08:00
Yuxian Qiu
38241b2346
fix: Fix moe_ep_groups/moe_cluster_groups in Mapping. (#4555)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-23 10:41:49 +08:00
pcastonguay
d7d455e7ea
[feat][TRTLLM-5018] Dis serving python runtime trt backend (#4243)
* feat: Enabling dis serving with TRT backend with Python runtime

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing formatting

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing disagg mtp test

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

---------

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-05-22 22:01:06 -04:00
Kunyao Wu
60a6c20174
Scaffoldingllm supports MCP (#4410)
* support mcp

# Conflicts:
#	tensorrt_llm/scaffolding/worker.py

Signed-off-by: wu1du2 <wu1du2@gmail.com>

* move all into contrib/mcp

# Conflicts:
#	examples/scaffolding/contrib/mcp/mcptest.py
#	tensorrt_llm/scaffolding/__init__.py
#	tensorrt_llm/scaffolding/contrib/__init__.py
#	tensorrt_llm/scaffolding/contrib/mcp/__init__.py
#	tensorrt_llm/scaffolding/contrib/mcp/mcp_controller.py
#	tensorrt_llm/scaffolding/task.py
#	tensorrt_llm/scaffolding/worker.py

Signed-off-by: wu1du2 <wu1du2@gmail.com>

* support sandbox, websearch

# Conflicts:
#	examples/scaffolding/contrib/mcp/mcptest.py
#	examples/scaffolding/contrib/mcp/weather/weather.py
#	tensorrt_llm/scaffolding/contrib/mcp/mcp_controller.py
#	tensorrt_llm/scaffolding/contrib/mcp/mcp_utils.py
#	tensorrt_llm/scaffolding/contrib/mcp/mcp_worker.py
#	tensorrt_llm/scaffolding/worker.py

Signed-off-by: wu1du2 <wu1du2@gmail.com>

* remove pics

Signed-off-by: wu1du2 <wu1du2@gmail.com>

* pre-commit fix

# Conflicts:
#	tensorrt_llm/scaffolding/contrib/mcp/__init__.py
#	tensorrt_llm/scaffolding/contrib/mcp/mcp_utils.py
#	tensorrt_llm/scaffolding/contrib/mcp/mcp_worker.py

Signed-off-by: wu1du2 <wu1du2@gmail.com>

* fix spell

Signed-off-by: wu1du2 <wu1du2@gmail.com>

* rebase

Signed-off-by: wu1du2 <wu1du2@gmail.com>

---------

Signed-off-by: wu1du2 <wu1du2@gmail.com>
2025-05-23 01:54:49 +00:00
QI JUN
1e55d616da
Chore: clean up _gather_dp_requests_num method of PyExecutor (#4571)
clean up _gather_dp_requests_num method of PyExecutor

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-05-23 08:37:39 +08:00
Mike Iovine
9c0de251db
[feat] Integrate Hopper chunked attention kernels (#4330)
* Integrate chunked attention kernels

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>

* Fix cache key

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>

* Fix lint

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>

---------

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-22 17:10:57 -04:00
Robin Kobus
e5c90883a9
fix: Move cv2 import to load_video function (#4541)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-22 17:56:07 +02:00
QI JUN
1e5d526db4
Chore: clean up _merge_dummy_request method of PyExecutor (#4438)
* clean up _merge_dummy_request method of PyExecutor

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* clean

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* update comment

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

---------

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-05-22 18:19:07 +08:00
Chuang Zhu
3410508020
cache_transceiver_config (#4556)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-05-22 13:59:51 +08:00
Kaiyu Xie
2898d268f9
feat: add health_generate route to openai serving (Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/3856) (#4349)
Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/3856

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Co-authored-by: Dhruv Singal <dhruvsingalabc@gmail.com>
2025-05-22 11:46:06 +08:00
Yan Chunwei
4798d088d9
chore: Partition LlmArgs into TorchLlmArgs and TrtLlmArgs (#3823)
* partition LlmArgs

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* update backend

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

---------

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-22 09:40:56 +08:00
Chuang Zhu
44cfd757b2
Agent interface impl for NIXL (#4125)
* agentConnection

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

recv

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

agentState

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

NIXL interfaces

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

update cmakelists

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

nixl improve

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

remove cppzmq

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

fix

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

transferAgent remove register

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

work for cache Test

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

reduce sleep time

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

fix test

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

intergarte

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

nixl env

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

fix rebase error

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

cpp test

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

stash for send metaData

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

loadRemoteMD after fetchRemoteMD

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

workaround for mixed gen and context

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

test_env

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

avoid port conflict in test

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* format

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* use std::string

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* typo

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* fix transferAgentTest

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

---------

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-05-22 09:09:41 +08:00
Zongfei Jing
dbaddb3a29
Adding two-shot allreduce kernel and mnnvl multicasting buffer (#4216)
* Adding two-shot allreduce kernel and mnnvl multicasting buffergit gffe

Signed-off-by: Shiyu Li <shili@nvidia.com>

Adding comments

Signed-off-by: Shiyu Li <shili@nvidia.com>

Add unittest of the twoshot kernel.

Signed-off-by: Shiyu Li <shili@nvidia.com>

Update dispatch logic

Signed-off-by: Shiyu Li <shili@nvidia.com>

Use cpu barrier instead of GPU at init

Signed-off-by: Shiyu Li <shili@nvidia.com>

Merge dispatch logic fix

Signed-off-by: Shiyu Li <shili@nvidia.com>

Update the kernel to use GPU-managed buffer

Signed-off-by: Shiyu Li <shili@nvidia.com>

* Refine

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Clean code

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Fix compile error

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Fix issue

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Clean up

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Simplify AllReduce interface

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Rename

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Fix warning

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Tidy code

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Rename

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Fix compile error

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Refine

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Skip ut for no_fusion

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Refine

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

---------

Signed-off-by: Shiyu Li <shili@nvidia.com>
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
Co-authored-by: Shiyu Li <shili@nvidia.com>
2025-05-22 03:42:36 +08:00
dongxuy04
4018806742
feat: large-scale EP(part 3 - refactor: FusedMoe for redundant expert) (#4495)
refactor fused_moe for redundant expert

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-05-21 17:17:49 +08:00
WeiHaocheng
a201ce9d53
docs: update the introduction for scaffolding (#4360)
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
2025-05-21 14:54:01 +08:00
Thor Johnsen
5d438be59a
[TRTLLM-5000][feat] Pytorch implementation of ngram drafter (#3936)
* v1.5

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

v1.5.4 Add back draft_overhead to spec dec stats

Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>

* v1.5.5: fix CI error

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* v1.6: fix CI error 8196 > 8192

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* Address reviewer concerns

Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>

* Address reviewer concerns

Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>

* precommit run

Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>

* v2.0: Address reviewer concerns

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* v2.1: add fix from wili

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* Revert changes that require use of TypeAlias because that requires python version >= 3.10

Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>

---------

Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-05-21 10:40:00 +08:00
Yan Chunwei
9199793848
fix: llmapi-launch add add trtllm-bench test with engine building (#4091)
* add trtllm-bench mgmn test

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-21 10:18:01 +08:00
Yuxian Qiu
62c16b6d37
fix: skip weights defined in create_weights for pp. (#4447)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-21 10:13:20 +08:00
djns99
a030a898d1
perf: Fuse gemm setup function for SM90/SM100 MOE plugin path (#4146)
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-05-21 10:00:36 +08:00
Zheng Duan
77a0189554
feat: conditional disaggregation in disagg server (#3974) 2025-05-21 09:57:46 +08:00
Rohan Varma
3d940e77f0
[TRTLLM-5273]feat/Use full attention mask if Llama3 is used as encoder and fix EarlyStopDecoder unsqueeze bug (#4290)
* add bidirectional support and fix EarlyStopDecoder unsqueeze to be compatible with LogitsStorage

Signed-off-by: Rohan Varma <rohanv@nvidia.com>

* run pre-commit

Signed-off-by: Rohan Varma <rohanv@nvidia.com>

* instead of bidirectional flag use ModelConfig.is_generation

Signed-off-by: Rohan Varma <rohanv@nvidia.com>

* fix unit test to extract logits from correct dim

Signed-off-by: Rohan Varma <rohanv@nvidia.com>

---------

Signed-off-by: Rohan Varma <rohanv@nvidia.com>
2025-05-20 10:15:36 -07:00
Robin Kobus
8564c5a41f
refactor: Unify request order in TRT and PyTorch workflow (#4096)
* chore: Partition context requests in MicroBatchScheduler

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fixup! chore: Partition context requests in MicroBatchScheduler

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-20 18:49:27 +02:00
Daniel Cámpora
f038218f83
fix: Fix TRTLLMSampler beam width bug. (#4473)
* Fix TRTLLMSampler.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Added type hint.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

---------

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-05-20 18:00:14 +02:00
Yan Chunwei
174c5188a2
fix[nvbug/5286515]: trtllm-llmapi-launch on single node single gpu (#4428)
* add test

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* fix

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

---------

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-20 20:16:14 +08:00
dongxuy04
21aff2e313
feat: large-scale EP(part 2: MoE Load Balancer - core utilities) (#4384)
* first commit of cpp moe loadbalance code

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add python bindings for moe load balance

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add python wrapper, ut and bug fixes

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add binding for layerId and update binding test

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add host tensor sharing and ut

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

---------

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-05-20 17:53:48 +08:00
Zhanrui Sun
f2c0565577
chore: bump version to 0.21.0rc0 (#4465)
* chore: bump version to 0.21.0rc0

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>

* Update CODEOWNERS

Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>

---------

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-05-20 12:19:50 +08:00
Lucas Liebenwein
de409e8468
[AutoDeploy] HF factory improvements (#4371)
* [AutoDeploy] HF factory improvements

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* improve monkey-patches and add unit tests

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

---------

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-05-19 20:13:43 -07:00
kanghui0204
6f3922f318
feat: Low Precision Allreduce for PCIe based GPU (#4344)
This PR adds a customized allreduce to TensorRT-LLM. The new allreduce is used for communication on PCIe-based GPUs via low-precision quantization, which can accelerate the PCIe allreduce process.

Signed-off-by: Hui Kang <hkang@nvidia.com>
Co-authored-by: Hui Kang <hkang@nvidia.com>
2025-05-20 06:53:46 +08:00
Yuxian Qiu
c8e062bfd3
fix: [nvbugs/5287097] Align PP layer distribution between pytorch and TRT flow. (#4399)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Co-authored-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-05-19 14:25:36 -07:00
Venky
bb02d86b54
test(perf): Add some Llama-3_3-Nemotron-Super-49B-v1 integration-perf-tests (TRT flow, trtllm-bench) (#4128)
* changes to run llama-v3.3-nemotron-super-49b

Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>

* yapf

Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>

* address review comments pt 1

Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>

* re-add cpp super tests 

Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>

---------

Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
2025-05-19 12:00:48 -07:00
liji-nv
58e405624a
[https://nvbugs/5123103][fix] Fix torch compile for DeepSeekV3 (#3952)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-05-19 22:12:25 +08:00
Iman Tabrizian
c6074c47da
Add llama4 disagg accuracy tests (#4336)
* Add llama4 disagg accuracy tests

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

* Make it async and add GSM8K benchmark

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

---------

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-05-19 21:55:08 +08:00
Yukun He
98018f3bb9
Downgrade the logger level for fallback tactic warning. (#4440)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-05-19 18:26:54 +08:00
Zhenhuan Chen
e70a205dab
[TRTLLM-4638] feat(scaffolding): update Reward Controller to PRM specific controller with step split (#4337)
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-05-19 17:53:41 +08:00
Yuxian Qiu
cf6cd940e5
feat: Add pp support for hybrid attn/mamba model (#4358)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-19 14:47:45 +08:00
Yan Chunwei
5b1c88de8d
chore: cleanup perf_evaluator code (#3833)
* chore: cleanup perf_evaluator code

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* up

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

---------

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-19 13:21:36 +08:00
Pengyun Lin
039f7e3118
[https://nvbugspro.nvidia.com/bug/5243740][fix] deduce default max_tokens for trtllm-serve (#4265)
* Deduce default max_tokens for trtllm-serve

Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>

* Improve executor_config.max_seq_len assignment in TRT workflow

Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>

* Enhance error message

Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>

* Add deduced max_tokens test

Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>

---------

Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-05-19 00:34:40 +08:00
shaharmor98
27afcb9928
add changes for fp8, nemotron-nas, API (#4180)
Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-05-18 23:27:25 +08:00
Kaiyu Xie
3e08cd231c
fix: Remove real size allocation (#4396)
Remove real size allocation

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-05-18 19:13:22 +08:00
rakib-hasan
49f993d862
Removing the outdated argument (#4408)
removing the outdated argument

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
2025-05-18 15:52:15 +08:00
Jinyang Yuan
b618e1f55b
perf: Eliminate the need for attention DP padding when possible (#3439)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
Co-authored-by: raccoonliukai <raccoonliu@tencent.com>
2025-05-17 13:30:55 +08:00
Lucas Liebenwein
7c85890ec7
[AutoDeploy] eager pattern matcher new pattern (#4370)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-05-16 12:35:44 -04:00
Lucas Liebenwein
0e872ef0b0
[AutoDeploy] fix: proper process group clean up (#4373)
[AutoDeploy] proper process group clean up

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-05-16 12:35:25 -04:00
Netanel Haber
9cd8148f28
API Breaking Change + Readability: "decoder"->"sampler" (#4121)
* *decoder*->*sampler*; new_tensors_device: dict[str, torch.Tensor] -> device: SampleStateTensors

* **Breaking Change**, as it changes public interfaces, main changes:
* PyTorchConfig [consumed via LLM(pytorch_backend_config)]: Configuration parameters mixed_decoder and enable_trtllm_decoder -> sampler.
* Command-line argument --enable_trtllm_decoder becomes --enable_trtllm_sampler in examples/pytorch/quickstart_advanced.py.

---------

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-05-16 23:52:25 +08:00
ixlmar
13b61405e8
fix: improve PyExecutor resource allocations (#4299)
chore: restore symmetry of worker start/shutdown
chore: fix return type of cal_max_tokens
chore: type some more return values
fix: free resources before re-claiming

Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-05-16 16:28:10 +01:00
Tracin
7b19acfab1
fix: Fix chat template kwargs bug. (#4387)
* Fix chat template kwargs bug.

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>

* Fix chat template kwargs bug.

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>

* Fix chat template kwargs bug.

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>

---------

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-05-16 23:07:46 +08:00
Lucas Liebenwein
8e4320ede5
[AutoDeploy] configurable cache resize (#4372)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-05-16 10:07:09 -04:00
Robin Kobus
4e370a509a
refactor: Copy sequence lengths once in decoder setup (#4102)
* refactor: Copy sequence lengths once in decoder setup

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update DecoderInputBuffers to remove duplicated buffers

- Renamed and reorganized buffer variables in decoderBuffers.h and decoderBuffers.cpp for better readability.
- Adjusted references in generateRequestOptions.cpp to align with the new buffer structure.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Move getEmbeddingBias to anonymous namespace

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Filter context requests

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: GenerateRequestOptions using more fine-grained functions

- Added a new method `createDecoderRequests` to encapsulate the logic for creating decoder requests from finished context requests.
- Updated the `operator()` method to utilize the new method, improving code clarity and maintainability.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update TRTLLMDecoder

- Updated the `generate_request_options` call.
- Updated the `make_decoding_batch_input_output` call.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Remove const where we modify input buffers

- Changed `DecoderInputBuffers` parameters from const references to non-const references in multiple functions to allow modifications.
- Updated related function calls to ensure compatibility with the new parameter types.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fixup! refactor: Copy sequence lengths once in decoder setup

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-16 22:03:55 +08:00
Fridah-nv
bce281d592
feat: [AutoDeploy] update rope matcher with minor variants (Deepseek) (#3638)
* add docstring to summarize current rope support

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* minor: replace call_method, adjust inserting order of cos_sin_cache calculation node

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* add unit test for triton rope and ds rope

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* update rope matcher to match DS RoPE, add custom op for reference, add unit test case

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* cache cos[pos_idx].unsqueeze and sin[pos_idxs].unsqueeze

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* minor doc update

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* separate pattern matching and optimization for explicit and complex rope + minor updates

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* clean rope impl in repo

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* replace fused_flattened_mla_with_cache's rope impl with torch_apply_rope_with_qk_interleaving, update unit test

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* minor

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* separate layout infer and transpose to a new transformation

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* update rope_with_explicit_freqs and rope_with_input_interleaved to expose unsqueeze_dim and support match_rope_layout, add unit tests

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* solve merge conflict in transform.py, need to fix optimize_rope with cuda graph capture

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* minor clean up after rebase

Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com>

* fix pre-commit

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* support map to bnsd layout and infer unsqueeze_dim from op

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* fix cos/sin not the same across prompts in the same batch issue when mapping to flashinfer op

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* fix for unit test

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* fix custom op input/output node ordering issue for DeepSeek V3 rope

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* clean code

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* minor

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* move flattening of cos_sin_cache to the graph, update flashinfer op docstring and test

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* debug transform unit test failure

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

---------

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2025-05-16 09:55:32 -04:00
NVJiangShao
a6f2a1e918
Fix test_fused_moe_w4afp8 (#4393)
Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>
2025-05-16 17:21:33 +08:00
Daniel Cámpora
df19430629
chore: Mass Integration 0.19 (#4255)
* fix: Fix/fused moe 0.19 (#3799)

* fix bug of stream init

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix bug

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix: Add pre-download of checkpoint before benchmark. (#3772)

* Add pre-download of checkpoint before benchmark.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Add missing remote code flag.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Move from_pretrained to throughput benchmark.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Move download and use snapshot_download.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Removed trusted flag.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Fix benchmark command in iteration log test.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

---------

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* [https://nvbugspro.nvidia.com/bug/5241495][fix] CUDA Graph padding with overlap scheduler (#3839)

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fuse

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* TRTLLM-4875 feat: Add version switcher to doc (#3871)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>

* waive a test (#3897)

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* docs:fix https://nvbugs/5244616 by removing new invalid links. (#3939)

Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>

* fix: remote mpi session abort (#3884)

* fix remote mpi session

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* fix

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

---------

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* skip fp8 gemm for pre-hopper (#3931)

Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>

* [https://nvbugspro.nvidia.com/bug/5247148][fix] Attention DP with overlap scheduler (#3975)

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* update multigpu list

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix namings

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* Doc: Fix H200 DeepSeek R1 perf doc (#4006)

* fix doc

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

* update perf number

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

---------

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

* Fix the perf regression caused by insufficient cache warmup. (#4042)

Force tuning up to 8192 sequence length for NVFP4 linear op. Also, make this runtime-selectable with UB enabled.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* doc: Update 0.19.0 release notes (#3976)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>

* Optimize the AutoTuner cache access code to reduce host code overhead. (#4060)

The NVFP4 Linear op is very sensitive to the host overhead.
This PR introduces customizable `find_nearest_profile` and `get_cache_key_specifc`, which allow users to override the default method for generating the cache key.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Update switcher (#4098)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>

* doc: update release notes (#4108)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>

* docs:update 0.19 doc. (#4120)

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

* docs:add torch flow supported model list. (#4129)

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

* doc: Release V0.19 Perf Overview Update (#4166)

Signed-off-by: zpatel <22306219+zbpatel@users.noreply.github.com>

* Fix readme of autodeploy.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update tensorrt_llm/_torch/pyexecutor/llm_request.py

Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>

* Revert mgmn worker node.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Change to disable_overlap_scheduler.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: zpatel <22306219+zbpatel@users.noreply.github.com>
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com>
Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Co-authored-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Co-authored-by: Zac Patel <22306219+zbpatel@users.noreply.github.com>
2025-05-16 10:53:25 +02:00
Tracin
46c5a56444
Support dynamic per-tensor FP8 (#4250)
* Support dynamic per-tensor FP8

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>

* Update test cases.

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>

---------

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-05-16 13:33:58 +08:00
WeiHaocheng
54d28718c7
feat: support benchmark on scaffolding (#3328) (#4286)
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
2025-05-16 12:28:49 +08:00
yuxianq
a1daa22970
doc: Add docstring for Attention and MLA module. (#4354)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-05-16 09:37:04 +08:00
Suyog Gupta
b0f7522c82
[AutoDeploy]feat: Add an AutoDeploy compile backend that only calls torch.compile (#4240)
* add a torch-compile backend

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* readme changes

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* plumb torch-compile through build_and_run_ad.py

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* plumb torch-compile through build_and_run_ad.py

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* plumb torch-compile through build_and_run_ad.py

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* add torch-cudagraph backend

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* update readme

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* update readme

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* further enhanced compiler backends

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* further enhance readme

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* better specified defaults in simple_config.py

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* fix typo in simple_config.py

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* updated deepseek-v3 support

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* revert accidental deletion in AD Readme

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

---------

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-05-16 08:38:15 +08:00
rakib-hasan
25407249a5
[TRTLLM-5054][fix] Removing repeated loading of input processor (#4161)
removing repeated loading of input processor

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
2025-05-16 08:04:58 +08:00
Lucas Liebenwein
4883121477
[AutoDeploy] fix: disable overlap scheduler until supported (#4365)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-05-15 16:19:30 -07:00
Yechan Kim
c6e2111f4e
feat: enhance trtllm serve multimodal (#3757)
* feat: enhance trtllm serve multimodal

1. made the load_image and load_video asynchronous
2. add image_encoded input support to be compatible with genai-perf
3. support text-only on multimodal mdoels(currently, Qwen2-VL & Qwen2.5-VL)

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* add test

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* fix bandit

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* trimming uils

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* trimming for test

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* genai perf command fix

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* command fix

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* refactor chat_utils

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* stress test genai-perf command

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

---------

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-05-15 16:16:31 -07:00
yuxianq
4f8afe4cc6
feat: [nvbugs/5261055][nvbugs/5170160] non-invasive pipeline parallelism (#4034)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-16 04:16:53 +08:00
Venky
5ebe32f06f
enh: Enable option in trtllm-bench build subcommand to avoid loading weights (#4142)
* expose load_format 

Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>

* yapf

Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>

---------

Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com>
2025-05-16 03:50:53 +08:00
yuxianq
0e87fcc228
refactor: use x is None instead of x == None. (#4244)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-15 20:00:04 +08:00
ixlmar
4ee82fc0fd
chore: reduce code duplication (#4297)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-05-15 09:25:37 +01:00
Zongfei Jing
f0ca60a95d
Add allreduce and rmsnorm fusion for qwen3 (#4304)
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-05-15 16:22:11 +08:00
zhhuang-nv
97bc680cd8
feat: support kv cache reuse for MLA (#3571)
* support kv cache reuse for MLA

load compressed_kv and k_pe and do up-projection
use 192/128 head size MLA context kernel
support Blackwell and Hopper now

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* add CI test

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix: set k_pe head_num to 1 for kernel 2 and kernel 2V2

Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>

* resolve comments

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* use GPTJ style RoPE for MLA

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix rebase error and some docs

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix kv_lens

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* tiny fix

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix torch compile

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix: use normal device memory instead of pinned memory for unit test

Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>

* fix L0 tests

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix torch compile after rebase

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* resolve comments

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* resolve comments again

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

---------

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
Signed-off-by: zhhuang-nv <145532724+zhhuang-nv@users.noreply.github.com>
Co-authored-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2025-05-15 15:22:21 +08:00
Kaiyu Xie
b4e5df0ee0
Breaking change: perf: Enable scheduling overlap by default (#4174)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-05-15 14:27:36 +08:00
Fridah-nv
d008d6412f
feat:[AutoDeploy] Update MoE pattern matcher to drop expert selection logic (#3283)
* update matcher to match expert compute first, then extract other args with LCA

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* support 3D and 2D input in torch.ops.moe.trtllm_fused_moe

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* update custom ops to support 3D and 2D inputs

Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com>

* update deepseek patch

Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com>

---------

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
2025-05-15 13:53:09 +08:00
nv-guomingz
e76cf9d9fe
fix:https://nvbugs/5234033 enable starcoder trt-flow with transforme… (#3909)
fix:https://nvbugs/5234033 enable startcoder trt-flow with transformer 4.51.3.

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-05-15 11:16:45 +08:00
Zeyu WANG
2681b26e48
[TRTLLM-2795] feat: Add yarn support for other models in trt-flow (#3840)
Add yarn support for general models(e.g. llama, qwen) other than deepseek in trt-flow.

Signed-off-by: Zeyu Wang <zeyuw@nvidia.com>
2025-05-15 11:03:57 +08:00
Mike Iovine
f9adac3dea
[feat] Enable chunked context for flashinfer (#4132)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-15 10:59:38 +08:00
QI JUN
498ce8a056
Revert "feat: Low Precision Allreduce for PCIe based GPU" (#4340)
Revert "feat: Low Precision Allreduce for PCIe based GPU (#3851)"

This reverts commit 5e634dd1bd.
2025-05-15 09:52:39 +08:00
sugunav14
7c828d767f
feat: [AutoDeploy] DSV3 mla attn ref op (#4272)
* raw ref op + new patch untested

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

* Added mla attn ref op and unit tests for attn + module patches

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

* update stray changes in deepseek.py

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

* Updated stale documentation

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

* removed stray update in sdpa return shapes

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

---------

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
2025-05-15 01:58:20 +08:00
HuiGao-NV
f4059c6e2e
Add test case for kv memory estimation (#4158)
* Add test case for kv memory estimation
* Dump running log into file and parse kv cache memory size from file
* Set bigger peak memory size for mixed percision case and test_ptp_quickstart_advanced_eagle3 case
* Revert change to usage of fraction
* use context manager to guard temp files

Signed-off-by: Hui Gao <huig@nvidia.com>
2025-05-14 18:39:25 +08:00
kanghui0204
5e634dd1bd
feat: Low Precision Allreduce for PCIe based GPU (#3851)
This PR adds a customized allreduce to TensorRT-LLM. The new allreduce is used for communication on PCIe-based GPUs via low-precision quantization, which can accelerate the PCIe allreduce process.

Signed-off-by: Hui Kang <hkang@nvidia.com>
Co-authored-by: Hui Kang <hkang@nvidia.com>
2025-05-14 16:45:43 +08:00
Barry Kang
20b42912ce
[TRTLLM-3330][feat] Support DeepSeek-R1 W4A8 on Hopper (#4123)
Support DeepSeek-R1 W4A8 on Hopper

Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Co-authored-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
2025-05-14 15:48:07 +08:00
brb-nv
8280c3d4f2
feat: Support Gemma3-1b-it in Pytorch workflow (#3999)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-05-14 14:02:44 +08:00
Fridah-nv
21dbd163a7
[TRTLLM-5188] fix: [AutoDeploy] unwaive AD build test (#4273)
* unwaive small build test

Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com>

* unwaive mutigpu/integration tests

Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com>

* fix for torch.compile+flashinfer attention

Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com>

---------

Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com>
2025-05-14 10:40:12 +08:00
Zhanrui Sun
23b9705bf4
chore: bump version to 0.20.0rc3 (#4261)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-05-14 10:15:25 +08:00
Anurag Mukkara
b0a03a289c
fix: Merge PP overlap and non-overlap executor loop (#3878)
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
2025-05-14 06:04:36 +08:00
brb-nv
cd5b3d21a0
feat: Support Mistral Small 3.1 24B VLM in TRT workflow (#4183)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-05-14 03:47:22 +08:00
Frank
c0c3c7f68c
[TRTLLM-5233][feat]: Add chunking to PyT heuristic for trtllm-bench. (#4133)
* Add chunking to PyT heuristic.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Cast tokens and batch size to ints.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

---------

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-05-13 21:47:06 +08:00
Yukun He
cbca6505ff
[nvbugs/5268808][fix] Fix the list-out-of-range access issue of AllReduce workspace on multi-node. (#4159)
This issue is found for tp=ep=8 on the multi-node machine due to the inconsistent PP sizes.
* Reform the workspace allocation implementation to avoid the list-out-of-range issues.
* Disable min_latency_mode under the multi-node scenario to avoid the illegal memory access issue.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-05-13 17:17:25 +08:00
Perkz Zheng
e8d7834c50
fix: [https://nvbugspro.nvidia.com/bug/5238626] illegal memory address when running llama 4 with cuda graph enabled (#4101)
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-05-13 14:58:54 +08:00
v-shobhit
1770dd96d8
Fix Pipeline Parallelism in Llama4 (#4106)
Signed-off-by: Shobhit Verma <shobhitv@nvidia.com>
2025-05-12 22:54:37 -07:00
nvpohanh
13c8e5a8a8
feat: Prefetch safetensors files before loading them (#4140)
Prefetching safetensors files so that they are stored in the system file
cache. This significantly speeds up the model weight loading for the
very first run after entering the docker container.

This is beneficial because model weight loading is done layer-by-layer,
which means reading from the safetensors chunk-by-chunk, and that cannot
utilize the internet bandwidth very well, assuming that these files are
stored in some network drives. Instead, loading the whole files in bulk
can achieve higher internet bandwidth utilization.

When running with world_size>1, all ranks collaboratedly prefetch these
files.

In theory, we should add heuristics to decide whether to prefetch the
files or not, but that is beyond the scope of this commit.

For example, when the CPU memory is small, doing prefetching may result
in file cache thrashing, resulting in slower weight loading time.

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
2025-05-13 13:35:30 +08:00
pcastonguay
9643be5f20
[TRTLLM-5050][feat] Enable per-request stats with PyT backend (#4156)
* feat: Add per-request stats support with PyT backend

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Adding unit test

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing stats unit test

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing test with overlap

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

---------

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-05-12 21:35:15 -04:00
Simeng Liu
286a789549
feat: Add heuristic for GroupRMSNorm kernel selection. (#4047)
* feat: Add heuristic for GroupRMSNorm kernel selection.

Implements a logistic regression model to dynamically select between:
- GroupRMSNormBaseKernel: Allocates warps proportional to sum of dimensions
  (better SM occupancy in most cases)
- GroupRMSNormLargeBatch: Allocates warps proportional to max dimension
  (better block scheduling in large batch scenarios)

Selection heuristic considers batch size, allocated warps, and scheduling
efficiency on the current GPU architecture. Models for Compute Capability
9.x and 10.x are trained base on nsys kernel runtime data.
The default kernel selection is the base kernel.

The python operator group_rms_norm will use the heuristic by default.
User can pick to use the base or large batch kernels as well.

Signed-off-by: Simeng Liu <simengl@nvidia.com>

* Address the comments.

Signed-off-by: Simeng Liu <simengl@nvidia.com>

---------

Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-05-13 08:52:53 +08:00
Erin
4becf32360
fix: reshape token_ids for lp in torch backend (#4239)
reshape token_ids

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-05-13 08:43:47 +08:00
wili
eba3623a54
Feat: Variable-Beam-Width-Search (VBWS) part4 (#3979)
* feat/vbws-part4-v1.8: rebase

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* feat/vbws-part4-v1.9: fix incorrect output when using short output length

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* v1.9.1: remove useless variables

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* v1.9.2:fix incorrect output when using short output length

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* v1.9.3: rebase

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* v1.9.4: rebase

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* v1.9.5: remove API change

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

---------

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-05-12 22:32:29 +02:00
yuxianq
a4c3359513
fix: Reset planned states to avoid memory leak in TrtllmAttentionWrapper (#4227)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-12 23:25:54 +08:00
Fridah-nv
3dbb087292
[TRTLLM-5188] fix: [AutoDeploy] update output shape of prepare_fused_mha_metadata_fake (#4199)
* update output shape of fake kernel prepare_fused_mha_metadata_fake

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* minor

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

---------

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
2025-05-12 11:11:40 -04:00
yuxianq
b35f9a67f9
refactor: Allow models to override apply_qk_norm. (#4078)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-12 19:38:24 +08:00
Zheng Duan
c9e2a963e0
feat: add kv cache aware router (#3831)
* kv cache aware router

Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>

* add tests

Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>

* router config

Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>

* eviction test

Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>

add test

Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>

* eviction detect in worker test

Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>

* move worker tests to single gpu

Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>

* reduce memory fraction

Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>

* fix partial block

Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>

---------

Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
2025-05-12 07:23:57 -04:00
Yixin Dong
c90ebadd84
feat: Support the Structural Tag in guided decoding (#4066)
* finish

Signed-off-by: Ubospica <ubospica@gmail.com>

* update

Signed-off-by: Ubospica <ubospica@gmail.com>

* update

Signed-off-by: Ubospica <ubospica@gmail.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* exc overlap scheduler

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* add test

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix api ref

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Ubospica <ubospica@gmail.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-12 17:24:50 +08:00
Zhenhuan Chen
9212e9a740
[TRTLLM-4911] feat(scaffolding): make sampling_params only setable by controller (#4151)
feat(scaffolding): make sampling_params only setable by controller

Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-05-12 15:29:09 +08:00
Chuang Zhu
1333f4f5d5
remove cache_transceiver_prealloc_size (#4153)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-05-12 11:53:53 +08:00
Frank
0dcf47f1c2
[TRTLLM-4717][perf] Set CUDA graph max batch size and padding in throughput benchmark. (#3875)
* Set cuda graph max batch size.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Set padding.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

---------

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-05-09 23:20:52 +08:00
Mike Iovine
4b8ba7ad61
[fix][nvbug/5244009] Fix llama 4 test lists/scout accuracy issue (#4069)
[fix] Fix llama 4 test lists

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-09 22:45:14 +08:00
chenfeiz0326
ffc13bd325
Cherry-pick: Use multi-threading to load MoE expert weights (#4137)
* Use multi-threading to load MoE expert weights

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

* Update code formatting

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

* Update code formatting

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

---------

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Co-authored-by: Po-Han Huang <pohanh@nvidia.com>
2025-05-09 17:29:24 +08:00
WeiHaocheng
0f01826dde
feat: support task collection for to collect information (#3328) (#3824)
Signed-off-by: fredw (generated by with_the_same_user script) <20514172+WeiHaocheng@users.noreply.github.com>
2025-05-09 17:09:01 +08:00
Fanrong Li
0cf0fce5d3
[fix] Fix add_dummy_requests for spec decoding cases (#4084)
* fix add_dummy_requests.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* add max_seq_len to eagle3 test and fix add_dummy_requests.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* fix prompt_len in add_dummy_requests.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* add prepare_resource condition in add_dummy_requests.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* add some description of token_nums to add_dummy_requests and fix token_nums in torch compile warmup.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* fix available_tokens.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

---------

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-05-09 16:52:51 +08:00
Fanrong Li
77f8e43592
[fix] Fix relaxed acceptance to support enabling it in context phase (#4126)
* fix relaxed acceptance to support enable this feature in context phase.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* fix sample_and_accept_draft_tokens unit test.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

---------

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-05-09 14:11:14 +08:00
Yukun He
c9cac432dc
chore: Fix pipeline break caused by previous PR (#4081) rebase + pipeline reuse (#4169)
Fix import break caused by rebase.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-05-09 12:51:02 +08:00
Mike Iovine
d80dc40135
[nvbug/5262268][fix] Fix trtllm-bench for llama 4 (#4104)
[fix] Fix trtllm-bench for llama 4

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Co-authored-by: Zhihan Jiang <68881590+nvzhihanj@users.noreply.github.com>
2025-05-08 21:27:57 -07:00
Erin
cdf5ae1547
fix: change pp broadcast pattern for LPs (#4130)
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-05-08 20:07:13 -07:00
Yi Zhang
91bf5e6a8e
[TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804)
Add Piecewise CUDA Graph Support

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-05-09 11:04:01 +08:00
Yukun He
5b61486d87
chore: Clean up the legacy DeepseekAllreudceFusionOp. (#4081)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-05-09 10:20:41 +08:00
bhsueh_NV
700d09ab65
[TRTLLM-5147][Qwen3] fix: fix bug of attention dp on qwen3_moe model (#4141)
* fix bug of attention dp on qwen3

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix pre-commit changes

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix bug of attention dp 8

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-05-09 09:29:39 +08:00
pcastonguay
836c142e1b
[feat] Allow overriding cli args with yaml file in trtllm-serve (#4164)
feat: Allow overriding cli args with yaml file in trtllm-serve

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-05-08 21:19:05 -04:00
dongxuy04
7147efb2e8
fix: alltoall padding for chunked MoE (#4157)
fix alltoall padding for chunked MoE

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-05-09 09:01:35 +08:00
forrestl
9477661f4c
Support RingAttention in the BertAttention plugin and the DiT model (#3661)
support ring attn for bert_attention plugin and dit model

Signed-off-by: ChunhuanLin <lch_xdu@163.com>
2025-05-09 08:06:54 +08:00
Mike Iovine
9afe510367
[fix] Fix llama4 + eagle3 (#3998)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-08 19:20:27 -04:00
Frank
57afbf6b79
Fix incorrect conversion. (#4112)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-05-09 06:34:52 +08:00
Lucas Liebenwein
48ed38a2ac
[fix] [AutoDeploy] flashinfer usage on H100 (#4162)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-05-09 06:00:57 +08:00
chenfeiz0326
7f5716ef83
Cherry-pick trtllm-gen from feat/llama4 to main (#4086)
* feat: TRT-LLM Gen FP8 MoE Llama4

Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>

* feat: TRT-LLM Gen llama4 MoE Top1 routing

Signed-off-by: Jiqun Tu <jtu@nvidia.com>

* feat: add per tensor FP8 TRT-LLM Gen GEMMs

Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>

* Update

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

* Update

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

* Add license for cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/gemmCubins

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

* Add guard for routingIndicesClusterKernel

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

* Guard sm90+ for routingkernels

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

* Guard sm90+ for routingkernels

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

---------

Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
Signed-off-by: Jiqun Tu <jtu@nvidia.com>
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Co-authored-by: Nikita Korobov <nkorobov@nvidia.com>
Co-authored-by: Jiqun Tu <jtu@nvidia.com>
2025-05-08 14:13:01 -07:00
Yukun He
bb7bcc75c2
feat: Fallback to NCCL for various patterns when input size is large. (#4080)
* Fallback to NCCL for various patterns when input size is large.
Move the previous implementation to cpp side.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Revising.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

---------

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-05-08 11:13:13 -07:00
shaharmor98
7d94c9561f
feat: support multi lora adapters and TP (#3885)
* support multi lora, tp

Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-05-08 23:45:45 +08:00
Yuan Tong
5b93273156
feat: adopt new logprob definition in PyTorch flow (#4057)
feat: align logprob definition of PyTorch flow

Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: Erin <14718778+hchings@users.noreply.github.com>
2025-05-08 20:16:40 +08:00
Enwei Zhu
74df12bbaa
[TRTLLM-4480][doc] Documentation for new accuracy test suite and trtllm-eval (#3946)
* fix formula

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* update doc

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* 1st version

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* polish

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-08 19:35:23 +08:00
Tracin
b0dd581e6b
Fix TP8 for NVFP4 kv dupilcation. (#4143)
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-05-08 17:30:02 +08:00
zihaok
81cc60a0fd
[feat/] enable attention DP in Llama4 maverick model - part 1 (#4065)
* add feature

cosmetic changes

Signed-off-by: Zihao Kong <zihaok@nvidia.com>

address precommit fix

cosmetic

Signed-off-by: Zihao Kong <zihaok@nvidia.com>

* add feature

Signed-off-by: Zihao Kong <zihaok@nvidia.com>

* fix bug

Signed-off-by: Zihao Kong <zihaok@nvidia.com>

* address comments

Signed-off-by: Zihao Kong <zihaok@nvidia.com>

* remove WAR

Signed-off-by: Zihao Kong <zihaok@nvidia.com>

* fix format precommit

Signed-off-by: Zihao Kong <zihaok@nvidia.com>

* Update tensorrt_llm/_torch/models/modeling_llama.py

Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
Signed-off-by: zihaok <161090975+zihaok@users.noreply.github.com>

---------

Signed-off-by: Zihao Kong <zihaok@nvidia.com>
Signed-off-by: zihaok <161090975+zihaok@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-05-08 05:06:40 +08:00
hlu1
26a2679217
[Deepseek] Refactor Deepseek Decoder layer (#4016)
Refactor Deepseek Decoder layer

Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-05-08 01:43:10 +08:00
Pengyun Lin
721f84a0ac
fix: Align default setting & remove unnecessary check for chat and completion (#3888)
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-05-07 14:42:53 +08:00
Yan Chunwei
0c26059703
chore: Cleanup deprecated APIs from LLM-API (part 1/2) (#3732)
* beam_width and max_new_token

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* remove beam_width

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* remove min_length

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* remove return_num_sequences

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

---------

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-07 13:20:25 +08:00
rakib-hasan
bf9ac96de3
Adding option to specify a set of token ids for multimodal tokens (#4107)
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
2025-05-07 12:15:41 +08:00
bhsueh_NV
f670a036df
[Qwen3] chore: fix bug of fused_moe on tp > 1 (#4093)
* fix bug of fused_moe on tp > 1

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* refine codes

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-05-07 11:06:37 +08:00
Enwei Zhu
c28b90984f
[TRTLLM-3925, https://nvbugs/5245262] [fix] Normalize LLM.generate API (#3985)
* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-07 11:06:23 +08:00
Kaiyu Xie
52d4302dda
bench: TRTLLM-4936 Port benchmark_serving.py (#4011)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
2025-05-07 09:45:14 +08:00
milesial
001e666fc5
fix: Pass local dir to processor creation (#4018)
Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com>
Co-authored-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-05-06 12:25:04 -07:00
Erin
cba1793cda
cleanup logprob params (#4039)
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-05-07 00:50:16 +08:00
Daniel Cámpora
c56a2aca46
fix: Properly get decoding mode according to same logic as cpp. (#4026)
* Properly get decoding mode according to same logic as cpp.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Cross reference getDecodingMode implementations in pytorch - cpp.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Better bindings for DecodingMode.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Revert to version in main.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Revert configuration.py.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

---------

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-05-06 21:53:17 +08:00
Robin Kobus
72057a0a64
[TRTLLM-3429] feat: Overlap scheduling in C++ runtime (#3625)
* disable overlap in encoder

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* feat: invokeGatherBatch

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* feat: overlap same batch

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: add enableTrtOverlap to ExecutorConfig

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* disable overlap for beam search and spec decode

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* skip overlap tests with beam search or speculative decoding

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* moveFinishedContextRequestsToGeneration and skip unfinished requests in updateRequests

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* enable overlap in GptChunkedLongContextTests

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* feat: Enable overlap in gptManagerBenchmark

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* feat: Improve early exit

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Use OptionalRef for newOutputTokens tensor

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* feat: Add overlap scheduling support to TRTLLMDecoder

- Updated TRTLLMDecoder to accept an `enable_overlap_scheduler` parameter.
- Modified the decoder's internal logic to utilize the overlap scheduling feature.
- Adjusted the sequence lengths handling to ensure compatibility with the new scheduling approach.
- Enhanced unit tests to include cases for the overlap scheduler with the TRTLLMDecoder.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fix: allNewTokens in PP

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-06 15:06:46 +02:00
yuxianq
b6cfe08c52
fix: Fix NVLink version decoding. (#3996)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-06 13:56:50 +08:00
HuiGao-NV
5a4794b387
fix: skip add new slot if request has slot 0 (#3991)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-05-06 07:46:39 +02:00
Suyog Gupta
ac2ab9ba36
[AutoDeploy][perf] Further optimize flashinfer backend in AutoDeploy (#4024)
* reuse batch_indices, positions across layers

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* fix flashinfer unit tests

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* simplify call to get_batch_indices_positions

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* fix call to get_batch_indices_positions

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

---------

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-05-06 10:46:36 +08:00
bhsueh_NV
e053cb651b
Fix: fix bug of qwen3 moe (#4058)
* fix bug of qwen3 moe

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* update threshold

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-05-06 08:20:15 +08:00
pansicheng
e84dc6b3c7
feat: add deepseek-r1 reasoning parser to trtllm-serve (#3354)
* add deepseek-r1 reasoning parser

Signed-off-by: pansicheng <sicheng.pan.chn@gmail.com>

* fix test

Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>

---------

Signed-off-by: pansicheng <sicheng.pan.chn@gmail.com>
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
Co-authored-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-05-06 08:13:04 +08:00
Iman Tabrizian
85867d76dd
test: Add disaggregated serving accuracy tests (#4036)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-05-05 08:56:59 -07:00
Daniel Cámpora
aa980dc92f
fix: instantiate decoder early in pytorch (#4029)
* Instantiate decoder early to have better mem estimation.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Improve mem estimation by instantiating decoder earlier.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

---------

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-05-05 10:31:53 +02:00
yuxianq
017701343e
fix: apply rope twice in Qwen3. (#4040)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-05 15:12:45 +08:00
Yukun He
aa38e28cfa
fix: [nvbug/5241627] Fix AllReduce kernel hang issue when both tp and pp are enabled. (#3988)
* Fix AllReduce kernel hang issue when both tp and pp are enabled.
Allocate one workspace for each pp rank to avoid potential race.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* update waive list

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

---------

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-05-05 11:33:25 +08:00
yuxianq
266fef88f2
feat: support to trace executor loop. (#3983)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-05 10:26:33 +08:00
qixiang-99
bf4f7ad744
feat: add Pytorch support of Vision Encoder for multimodal models (#3791)
* feat: Add rename_weights_with_regex function for dynamic weight key renaming

Introduced a new utility function to rename weight keys in a dictionary based on regex pattern matching. This allows for flexible mapping of keys from Hugging Face naming conventions to TRT-LLM naming conventions, enhancing model compatibility and usability.

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>

* feat: Implement SiglipVisionModel and related components

Added the SiglipVisionModel along with its associated classes, including SiglipAttention, SiglipEncoderLayer, and SiglipEncoder.
Additionally, a new test suite for the SiglipVisionModel has been created to ensure compatibility with Hugging Face outputs.

Currently SiglipVisionModel support batch size larger than one. Also, inputs and outputs shape are same with the HF for compatibility.

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>

* feat: Add CLIPVisionModel and associated components

Introduced the CLIPVisionModel along with its related classes, including CLIPAttention, CLIPEncoderLayer, CLIPEncoder, and CLIPVisionTransformer. This implementation aligns with Hugging Face's CLIP architecture, ensuring compatibility in input and output shapes.

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>

* feat: Enhance CLIPVisionModel with attention metadata preparation and unit tests

Updated the CLIPVisionModel to include a method for preparing attention metadata, simplifying the model's usage. Additionally, added a comprehensive unit test suite for the CLIPVisionModel, ensuring compatibility with Hugging Face outputs and validating model performance across various scenarios.

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>

* feat: Refactor SiglipVisionModel with attention metadata preparation and update unit tests

Enhanced the SiglipVisionModel by adding a method to prepare attention metadata, streamlining its usage. Updated unit tests to validate model performance and compatibility with Hugging Face outputs, including adjustments to the configuration and test scenarios.

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>

* refactor: Remove unused rotary_emb parameter from CLIP and Siglip attention classes

Eliminated the rotary_emb parameter from the CLIPAttention and SiglipAttention classes to streamline the code. Updated unit tests to reflect changes in the model configurations, including clarifications in the default configurations sourced from Hugging Face.

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>

* feat: Integrate CLIPVisionModel into LlavaNextInputProcessor and enhance weight loading

Added CLIPVisionModel to the LlavaNextInputProcessor for improved vision processing. Updated the model loading mechanism to ensure compatibility with the new vision model and added attention metadata preparation. Removed debug print statements from weight renaming function for cleaner code.

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>

* refactor: Remove unused max_position_embeddings from CLIPAttention and update Siglip classes to use CLIP components

Removed the unused max_position_embeddings variable from the CLIPAttention class. Updated the Siglip classes to utilize CLIP components, specifically replacing SiglipEncoder and SiglipAttention with their CLIP counterparts, streamlining the codebase and enhancing consistency across models.

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>

* refactor: Consolidate weight loading logic into a shared implementation

Refactored the weight loading process across CLIP and Siglip models by using a new utility function, _load_weights_impl, to streamline the loading mechanism. This change enhances code maintainability and reduces redundancy in weight handling, ensuring consistent behavior across different model architectures.

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>

* refactor: Simplify output handling in CLIP and Siglip models by removing output_hidden_states parameter

Removed the output_hidden_states parameter from the CLIPEncoder and SiglipVisionTransformer classes, streamlining the output handling process. Updated the corresponding unit tests to reflect these changes and ensure compatibility with the new output structure.

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>

* feat: Enhance LlavaNextInputProcessor with dynamic model loading and memory optimization

Updated the LlavaNextInputProcessor to support dynamic model loading from local paths or Hugging Face, improving memory efficiency by partially loading the model components. Integrated the LlavaNextMultiModalProjector and adjusted weight loading to ensure compatibility with the new architecture.

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>

---------

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-05-03 05:13:47 +08:00
Daniel Cámpora
cb2c1cc829
[https://nvbugs/5248923] fix: Correctly sizes seqslotmanager considering pp. (#3984)
* Correctly sizes seqslotmanager considering pp.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Adapt order.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

---------

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-05-02 20:06:32 +02:00
Daniel Cámpora
c7cf032b89
fix: Move all casters to customCasters. (#3945)
* Move all casters to customCasters.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Use customCasters in all bindings.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Added customCasters to userbuffers.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

---------

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-05-02 19:08:28 +08:00
hlu1
52edabab30
Fix Deepseek MTP with moe_backend=TRTLLM (#4001)
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-05-02 14:47:22 +08:00
Simeng Liu
873c7532fd
feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. (#3438)
* feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator.

Previously, the RMSNorm implementation only supported a single input tensor. With group_rms_norm, multiple tensors can be normalized together:
```python
input_a, input_b, ... = group_rms_norm([input_a, input_b, ...])
```
All input tensors must share the same batch dimension. The kernel partitions work by dynamically assigning warp groups proportional to the last dimension of each input, improving launch efficiency and reducing overhead.

This MR provides two implementations:
GroupRMSNormKernel: Optimized for small-to-medium batch sizes
GroupRMSNormKernelLargeBatch: Contains additional optimizations for large batch sizes

Both kernels are currently exposed as custom PyTorch ops. A future MR will implement heuristic-based kernel selection and expose a unified interface.

Signed-off-by: Simeng Liu <simengl@nvidia.com>

* Resolve comments and fix typo with IS_FLASHINFER_AVAILABLE

Signed-off-by: Simeng Liu <simengl@nvidia.com>

---------

Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-05-02 13:25:30 +08:00
Lucas Liebenwein
be916b19e0
feat: [AutoDeploy] unfusing attention for native support (#3668)
* [AutoDeploy] unfused streamlined attention + caching

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* improved unit testing

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* reviewer feedback

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* some updates to attn_mask handling

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* updated manual benchmarking and cudagraph capture

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

---------

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-05-02 09:06:49 +08:00
Yukun He
a1645c922b
Fallback to NCCL for various patterns when input size is large. (#4009)
When input size is larger than the max workspace size, we shall fallback to NCCL + corresponding pre/post function to ensure the functionality of AllReduce.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-05-01 15:17:16 -07:00
Erin
8fe7bdeacf
feat: LogitsProcessor in PyTorch backend (#3145)
* support lp in pytorch backend

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

* fix tp

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

---------

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-05-01 14:15:30 -07:00
Suyog Gupta
f94af0fb86
[AutoDeploy] Make all ranks agree on kv-cache size (#4007)
* make all ranks agree on kv-cache size

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* lint

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* lint

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* lint

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* lint

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* minor cleanups

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* use all_gather_object wrapper

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

---------

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
2025-05-02 04:07:28 +08:00
Erin
83f37614ef
feat: Support Top-K logprobs and prompt_logprobs in LLMAPI (#3388)
* support return logprob in llmapi

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

update and add test

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

stability test

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

* revert removal of old flag

Signed-off-by: Erin Ho <erinh@nvidia.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

---------

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Signed-off-by: Erin Ho <erinh@nvidia.com>
2025-05-01 12:47:14 -04:00
bhsueh_NV
129bf19980
model: support Qwen3 (#4010)
* add qwen3 dense model pytorch backend support, initial commit

solve the results error issue

add qwen3 moe model pytorch backend support

reformat the code

* perf - use flash_infer rmsnorm for qwen3

* feat - support qwen3 moe rmsnorm

* Put the computation of Q and K norm (in attn) into a single CUDA stream, and get a 5% - 8% throughput improvement on Qwen3 4B and Qwen3 - moe 30B - A3B.

* Put the computation of Q and K norm (in attn) into a single CUDA stream, and get a 5% - 8% throughput improvement on Qwen3 4B and Qwen3 - moe 30B - A3B. -- Forgot to update all modifications.

* fix bugs of running qwen3 public models and fp8 models

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix bugs due to rebase

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix bugs captured by pre-commi

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix bug of attention

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
Co-authored-by: Keddy Jin <jin.gq@aliyun.com>
Co-authored-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>
Co-authored-by: shao <shao@nvidia.com>
2025-05-01 23:12:41 +08:00
YueWeng
b1621e8d4e
feat: add relaxed acceptance for DS (#3865)
* add relaxed acceptance for DS R1

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>

* clean and update docs

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>

* fix

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>

* Modified based on review

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>

* fix mtp manager issue

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>

---------

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-05-01 21:50:36 +08:00
milesial
6ded5f984b
Llama4 processor fixes (#3994)
* fix: Propagate sampling params

Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com>

* fix: type hints

Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com>

---------

Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com>
Co-authored-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-05-01 12:45:53 +08:00
Kate Cheng
7dbe618683
feat: Add multimodal embedding field in LlmRequest (#3855)
* Add a new param to LlmRequest and Request to natively support mm

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* update comment

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Update tests to match the new LlmRequest constructor parameters

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Modify unitTest and modify mm_embeding's dict name in llama4

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix based on comments

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix comment

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix LlmRequest initialization in kvCacheManagerTest

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Clean up code for promt_tuning_config

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Clean up prompt_tuning_config in GenerationRequest

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

---------

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-05-01 12:23:30 +08:00
Frank
1e317c98c6
[feat]: Allow for a settable end-of-sequence/padding token in max throughput benchmark. (#3776)
* Move world options to a different group for clarity.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Add eos_id option.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

---------

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-05-01 09:42:46 +08:00
Yukun He
9cc5922a0b
Clean up allreduce op in Deepseek V3 model. (#3829)
* Replace deepseek_allreduce op with the new unified allreduce op and moe_allreduce op.
* Minor revision of moe_allreduce op argument names.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-05-01 07:56:36 +08:00
Mike Iovine
8c2c969fcb
[fix] Pad requests to maximum draft length in spec decode (#3957)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-04-30 11:02:18 -04:00
Julien Debache
83670571dd
feat: Mistral-Large-2 support in the Pytorch workflow
- Added modelling file for models configured by a `MistralConfiguration` object as it is slightly different from the Llama one
2025-04-30 20:12:39 +08:00
Zhanrui Sun
86e7474a9b
chore: bump version to 0.20.0rc2 (#3949)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-04-30 11:44:43 +08:00
yuxianq
f568cbb671
chore: Remove duplicated get_sm_version. (#3935)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-30 11:43:53 +08:00
Fanrong Li
e6b482ef47
fix: change the seq_lens sync copy to an async one (#3786)
---------

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-04-29 23:56:49 +08:00
tomeras91
35010e8073
Support NemotronH FP8 Quantization
(1) match quant exclude modules names to TRTLLM names 
(2) No need for any special weight loading for quantization scales weights (#3891)

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-04-29 18:51:43 +03:00
yuxianq
0f8ec693b2
fix: get head_dim from model’s config. (#3916)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-29 23:04:29 +08:00
HuiGao-NV
8e6eead6a5
refactor: (part1) Add contraints doc for fusedMoe module. (#3882)
* Add doc string for FusedMoe module
* Address comments.

Signed-off-by: Hui Gao <huig@nvidia.com>
2025-04-29 22:23:02 +08:00
Junhong Liu
06e76020d7
feat: parallel q_b_proj and concat (#3917)
* add parallel_q_b_proj_and_concat

Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com>

* code cleanup

Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com>

* one gemm/concat and then split the latent_cache and pass them separately to context/gen

Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com>

---------

Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com>
2025-04-29 22:07:05 +08:00
Dom Brown
8709fe8b53
chore: bump version to 0.19.0 (#3598) (#3841)
test: add test cases for 0.19 release (#3608)

* fix test name



* add quickstart test for nemotron-ultra



* add rcca multi-node test case for deepseek-v3



* add rcca info



---------




squash (#3642)



fix: nvbugs/5187237: fix deterministic mode crash (#3448)

* nvbugs/5187237 nvbugs/5112075: fix deterministic mode error

* remove waive


* Revert "remove waive"

This reverts commit 0bf5486d19906d692bfb7a6262333c296b0087ac.



* revert ar fusion



---------



update fp8 doc (#3647)




tests: change qa perf test to trtllm-bench (#3619)




 fix: FP8 quantized lm_head (NvBug 5214229) (#3567)



infra: Add PR approval protection for the release branch (#3634)



fix: nvbugs/5231298: pytorch allreduce issue (#3673)



Fix: nvbugs/5222698 variable not defined (#3630)

* Fix: nvbugs/5222698 variable not defined



* Tidy code



---------



test:sync waives.txt from main branch by disabling test_perf/gpt_350m-cppmanager case (#3685)



test:restore fp8 kv cache testing for L0 (#3671)



doc: Update DeepSeek perf docs (#3693)

* Update DeepSeek perf docs



* update



* Apply suggestions from code review




---------




tests: waive test_llm_multi_node (#3664)



fix: update test_user_buffers_mm_add_prologue atol (#3711)



Fix: cherry-pick hmac encryption from main branch (#3635)

* security fix cherry-pick changes from main



* fix hmac in remote mpi session (#3649)



---------





Un-waive DS-V3-Lite tests. (#3621)



fix: FP8 kv accuracy (#3675)

* fix FP8 kv accuracy



* update doc



---------



Fix script options for engines. (#3622)



unwaive multi-node test (#3721)



chore : Split more tests out of gpt tests (#3524) (#3674)



doc:add torch examples link into torch backend documentation (#3749)




test: Get Eagle tests working (#3593) (#3722)




Waive L0 test (#3756)



waive failed case in perf test, change default max_batch_size to 512 and write config.json to output log (#3656)





Update ds v3 parameters in stress test. (#3676)

waive gemma on L20 (#3766)



https://nvbugs/5141291: Fix convert.py script for Qwen model. (#3758)

Include Qwen2VLDecoderLayer in the smooth_qwen2_model function.



fix: PP4 fixes and cleanup (#3688)




remove benchmark test list (#3643)



skip disagg deepseek test if sm!=90 (#3720)



test: skip failed cases on B200 (#3710)

* add skip condition to tests



* fix error



---------



test: [nvbug: 5234494] skip_pre_ada for fp8 cases (#3718)

* skip_pre_ada for fp8 cases



* update



* update after rebase



---------



add know issue to deepseek doc. (#3800)



Fix ModelOpt Mixtral AWQ OOM (#3714) (#3761)




Waive L0 tests (#3826)



fix: Reduce memory usage in fused moe op associated with AutoTuning and fix moe fallback issue. (#3793)

* Reduce memory usage in fused moe op associated with AutoTuning.
* Replace pre-defined bucket size strategy with a generating function based on the tune_max_num_tokens.
* Add free_memory logic of workspace in min_latency_mode fused moe path.



* Fix fused_moe fallback issue. (#3652)

min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression.



---------



[doc] Better document for Draft-Target-Model (DTM) speculative decoding (#3797)




Fix pre-commit



Fix again



Address some review comments for the MI

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-04-29 16:57:22 +08:00
bhsueh_NV
2e230b73ec
change log level of some text from info to debug (#3930)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-04-29 13:38:34 +08:00
yuxianq
adfa04745e
fix: revert https://github.com/NVIDIA/TensorRT-LLM/pull/3858 (#3928)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-29 11:26:13 +08:00
bhsueh_NV
0610d0ff84
add num_scheduled_requests into print_log (#3914)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-04-29 11:22:22 +08:00
Frank
cf15efa15e
[TRTLLM-4883][fix]: Update output speed calculation. (#3923)
* Update gen tps calculation.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Add back output speed for comparison.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Fix issue with f-string.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Fix some spacing.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Replace output speed with per-request genphase tput.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Add gen TPS breakdown.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Update some tagging.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

---------

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-04-29 11:04:12 +08:00
Perkz Zheng
35c5e4f1c5
feat: add CGA reduction fmha kernels on Blackwell. (#3763)
* update cubins

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>

* add trtllm-gen kernels for eagle3 and also kernels with cga-reduction

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>

* address the comments

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>

---------

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-04-29 10:43:54 +08:00
hlu1
d2f312b8e4
Fix fp8 kvcache (#3877)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-04-29 10:31:10 +08:00
WeiHaocheng
8a994d879f
feat: fix erros on scaffolding README (#3899)
Signed-off-by: fredw (generated by with_the_same_user script) <20514172+WeiHaocheng@users.noreply.github.com>
2025-04-29 10:15:06 +08:00
yuxianq
b91da764de
chore: remove DummyKvCacheManager. (#3896)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-29 09:59:37 +08:00
Mike Iovine
e534bf09cc
[fix] Fix flashinfer + speculation issues (#3686)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-04-28 14:34:22 -04:00
Mike Iovine
e6f7ff3a46
[chore] Make llama4 MoE use maybe_execute_in_parallel (#3779)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-04-28 10:58:03 -04:00
Zhenhuan Chen
ad15e45f07
[TRTLLM-4638 ][feat] add best of n support with reward model in scaffolding (#3807)
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-04-28 17:15:33 +08:00
bhsueh_NV
f77252e9ff
fix bug of create cuda stream as default parameter which will be init… (#3764)
* fix bug of create cuda stream as default parameter which will be initialized during importing

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* add torch.cuda.Stream() for the leader node

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix pre-commit issue

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-04-28 08:16:03 +08:00
Yan Chunwei
ad4226d946
fix: trtllm-bench build trt engine on slurm (#3825)
* add submit_sync to RemoteMpiSessionClient

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

add barrier

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

fix comment

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

disable test

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* fix

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

---------

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-04-27 22:26:23 +08:00
bhsueh_NV
76f2c631fb
fix: add warmup flag into py_executor to prevent enable profiler during wa… (#3852)
* add warmup flag into py_executor to prevent enable profiler during warmup

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix bug of pre-commit

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* change setting warmup to all ranks

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-04-27 19:22:42 +08:00
Chuang Zhu
e2318756ed
cacheTransceiver buffer manager (#3798)
* cacheTransceiver buffer manager

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* fix args

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* cpp kvCacheManager

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* format

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

---------

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-04-27 11:48:15 +08:00
HuiGao-NV
136aab5c54
fix: Update num_of_ctx_tokens in iteration stats (#3785)
* Update num_of_ctx_tokens in iteration stats
* Revert not neccessary change of importing module
2025-04-27 10:24:47 +08:00
bhsueh_NV
e9fab4f3d9
fix bug of deepseek gropu_size setting (#3860)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-04-27 09:10:37 +08:00
yuxianq
e6c14ca97a
fix: Detect pmix and raise error when mpirun is not used. (#3858)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-26 21:49:41 +08:00
milesial
362a8272f8
feat: llama4 input processor (#3383)
Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com>
Signed-off-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
Co-authored-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-04-25 16:47:14 -07:00
sugunav14
5b9897a8cd
fix: [AutoDeploy] update hf loading for e_score_correction_bias (#3847)
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
2025-04-26 02:03:47 +08:00
dongxuy04
16535991b2
feat: Add MNNVL MoE A2A support (#3504)
* add MNNVL memory mapping support

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add more MPI environment for trtllm-llmapi-launch

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add MoE communication and prepare kernels

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add MNNVL AlltoAll support for DeepSeekV3

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add output dump for throughput benchmark

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* support dynamic kernel launch grid

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* address review comments

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* address review comments #2

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

---------

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-04-25 17:29:08 +08:00
Yuan Tong
57944206ba
feat: return logits in PyTorch flow (#3221)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-04-24 16:56:03 -07:00
hlu1
d72add1794
[Deepseek] Pass hidden_states_fp4 to shared_experts (#3819)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-04-24 13:12:12 -07:00
HuiGao-NV
7420ddc3d0
fix: fix lora case failure (#3838)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-04-24 07:29:08 -07:00
WeiHaocheng
3fc2a16920
feat(part 2): Enhance the integrated robustness of scaffolding with __init__.py #3305 (#3731)
Signed-off-by: fredw (generated by with_the_same_user script) <20514172+WeiHaocheng@users.noreply.github.com>
2025-04-24 18:47:03 +08:00
Zhanrui Sun
ae34d60108
chore: bump version to 0.20.0rc1 (#3834)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-04-24 17:43:37 +08:00
hlu1
cd2bcdc1a9
Fix create_weights in attention (#3692)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-04-24 07:30:00 +08:00
Kaiyu Xie
dfbcb543ce
doc: fix path after examples migration (#3814)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-04-24 02:36:45 +08:00
Daniel Cámpora
1299f27c74
fix: Fix C++ decoder synchronization in PyTorch (#3106)
* Use updateDecoderBuffers in python decoder.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix synchronize in trtllm decoder.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Enable by default.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Use guided_decoder to setup seqslots and free them.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Use always decode_async and update_requests.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update decoder buffers.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix speculative decoding tests.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Send new_tensors_host instead of assuming dict.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Make default False in enable_trtllm_decoder.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Partially fix mtp, partially fix py_executor.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update request states before sending disagg ctx cache.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix disagg test for torch decoder.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Make isend_tensor_list and recv_tensor_list for sending the tensors_host.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix rebase.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Add disagg serving case to guided decoder.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Get overlap scheduling to work.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update cutlass to main.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update after rebasing.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update to use decode async and update requests.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Properly pass information to update_requests

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Make disaggregated serving a step closer to working.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix rebase.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix rebase and format.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Copy new device tokens more pythonic.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Restore MTP add dummy reqs.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Add ordereddict import to py_executor.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Added seq slot manager. Add test.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Use transmission for single tensor except when list of tensors is received.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Add TRTLLMDecoder allocation to estimate max kv cache tokens.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Add stream synchronization

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Make memory calculation of decoder adapt to the chosen decoder. Recognize decoder option passed in executorconfig. Make overlap scheduler test run on TinyLlama.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Format

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Add decoder creation to estimate max kv.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update submodule UCXX inline with main.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

---------

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-04-23 23:55:27 +08:00
Mike Iovine
0bc520f15e
fix: Limit llama4 context length to 8k (#3778)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-04-23 08:55:10 -07:00
shaharmor98
49262a62a5
add passing E2E LoRA flow (#3788)
add passing E2E LoRA flow (#3788)

Signed-off-by: Shahar Mor <smor@nvidia.com>
2025-04-23 18:38:06 +03:00
Enwei Zhu
a51b3cf7a6
[TRTLLM-4763][test] Accuracy test improvement (Part 3.6): Deprecate mmlu_llmapi.py (#3802)
* cleanup mmlu_llmapi.py

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* polish

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-04-23 23:05:13 +08:00
Fanrong Li
bc1c4ddcb5
fix: remove the unnecessary metadata changes in mtp. (#3787)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-04-23 16:01:28 +08:00
Zongfei Jing
1e5af736ea
Add smart router for moe (#3641)
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-04-23 12:21:59 +08:00
shaharmor98
5fff8f0935
Add running E2E LoRA flow (#3648)
* add passing E2E LoRA flow

Signed-off-by: Shahar Mor <smor@nvidia.com>

* add experimental feature

Signed-off-by: Shahar Mor <smor@nvidia.com>

* fix llma_args definition

Signed-off-by: Shahar Mor <smor@nvidia.com>

* decreased manually size of max loras to address OOM

Signed-off-by: Shahar Mor <smor@nvidia.com>

---------

Signed-off-by: Shahar Mor <smor@nvidia.com>
2025-04-23 11:19:41 +08:00
Alessio Netti
4728256bb6
chore: Move cv2 import inside load_video() function (#3768)
Signed-off-by: Netti, Alessio <netti.alessio@gmail.com>
2025-04-22 21:48:35 +02:00
Lucas Liebenwein
06b914e0f9
feat: [AutoDeploy] generalizing cudagraph to multiple dynamic inputs (#3589)
* generalizing cudagraph to multiple dynamic inputs

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* fix for failing test

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

---------

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-04-23 03:38:51 +08:00
Xianjie Qiao
ba4131f176
Add log_level for disaggregated_mpi_worker (#3765)
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
2025-04-22 09:14:46 -07:00
Zongfei Jing
7eee9a9d28
doc: Update doc for Deepseek min latency (#3717)
* Tidy code

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Update doc for min latency deepseek

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Throw exception for RouterKernel when not running on sm90+

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

---------

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-04-22 23:07:59 +08:00
Yukun He
0ae7017342
Unify two versions of AllReduce custom op (#3032)
* Rewrite unit test for unified allreduce op. Removing the legacy unit test.
* Revise formats, fusion_op bindings. Put all tensors as optional inputs.
* Move the MoeAllreduceOp to a separate custom op.
* Move all the fusion patterns to the new version of the AllReduce fusion kernel. Remove the AllReduce strategy config. Revise the AllReduce strategies and fusion pattern definitions.
* Add more TODOs, fixing minor bugs, and remove legacy code.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-04-22 21:58:42 +08:00
bhsueh_NV
b87f26ee2a
chore: remove useless allgather (#3751)
* remove useless allgather

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix pre-commit issue

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-04-22 21:26:22 +08:00
Enwei Zhu
353699a3b3
fix: fnmatch usage in modeling_utils.py (#3754)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-04-22 13:13:53 +08:00
Yi Zhang
98966cb45e
test: Unwaive Llama 3.1 with torch compile test (#3475)
* Fix log info

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

* Revert "test: Waive torch compile tests (#3471)"

This reverts commit 410f56357e.

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

* Update test_llm_api_pytorch.py

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

---------

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-04-22 10:41:56 +08:00
Kaiyu Xie
a32389b4cd
fix: Remove unnecessary max call (#3574)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-04-22 10:33:50 +08:00
Enwei Zhu
3fa19ffa4e
test [TRTLLM-4477,TRTLLM-4481]: Accuracy test improvement (Part 3.5): Support GSM8K and GPQA (#3483)
* add gsm8k

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix gsm8k

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* add gpqa

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* conditional import lm_eval

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* gpqa in lm_eval

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* system prompt

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* shuffle

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* update AA prompt and regex

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* revert AA prompt and regex

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* integration to tests

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* add DS-R1

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix and clean

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* update tests

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* update

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* clean up

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* free_gpu_memory_fraction=0.8

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-04-22 07:38:16 +08:00
bhsueh_NV
0c07d4dc21
Fix/executor bugs (#3681)
* fix bugs of py executor

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix bugs of py executor

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* revert changes about mpi_barrier()

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-04-22 07:23:27 +08:00
Kaiyu Xie
943f3ff8f6
Revert "Report number of context tokens in one iteration (#3691)" (#3740)
This reverts commit e0446a4dc0.

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-04-22 01:21:43 +08:00
Iman Tabrizian
af04b6f6aa
bug: Fix hang bug when context server doesn't have enough capacity for KV Cache (#3095)
* Fix hang bug when KV cache is low

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>

* Review comments

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>

* Fix attentiondp typo

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>

* Add CI test for this case

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>

* fix: Fix the insertion order for responder futures

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

* fix: Fix disagg CPP

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

---------

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-04-21 15:16:55 +08:00
katec846
eeb605abd6
feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode (#3380)
* Feat: Offload ptable to cpu if enable_chunk_context

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Feat: offload ptable to cpu for chunk context mode

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix and add comment

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Update Readme for multimodal and add a new param mm_embedding_offloading

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* fix: Correct prompt table offloading condition in PromptTuningBuffers

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Clean up the code

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Add commits to explain copy from cpu <-> gpu using pinned memory

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix namings based on comments

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix format based on precommit

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Modify --mm_embedding_offloading flag

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

---------

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-04-21 14:31:01 +08:00
yuxianq
faef37782a
fix: Remove ParallelConfig. (#3678)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-21 14:14:08 +08:00
HuiGao-NV
e0446a4dc0
Report number of context tokens in one iteration (#3691)
Report number of context tokens in one iteration
2025-04-21 13:45:28 +08:00
yuxianq
591f3d2be8
fix: Support TLLM_OVERRIDE_LAYER_NUM for llama4. (#3679)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-21 12:28:56 +08:00
hlu1
31624b079a
feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend (#3387)
* Add TRT-LLM Gen MOE to Deepseek

fix fused moe rebase bug.

Fix atol in test_fp4_gemm_quantize.py

fix fused moe rebase bug.

Fix FusedMoe.

Disable 2nd routing kernel preexit

Bump routing reduction to fp32

Disable PDL for fc1

[DEBUG] Lift token limit to 16k

[Bugfix] Token limit to 16k + fp32 routing + tanh

Make fp8 tileN 8

Fix FP8 MoE + Remove redundent temp output for FP4

[FP8-only] Avoid wasting CTAs for activation kernel

fix: unblock FP8 weightloading with trtllm-gen

Remove max_token limit for trtllm-gen path

perf: avoid type-conversion and fill_ from aten

Minor fix

Signed-off-by: Hao Lu <haolu@nvidia.com>

* Fix rebase issues

Signed-off-by: Hao Lu <haolu@nvidia.com>

* Fix compile issue

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* CI clean

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

---------

Signed-off-by: Hao Lu <haolu@nvidia.com>
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
Co-authored-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-04-21 10:01:33 +08:00
hlu1
17eba98445
Refactor Deepseek tp_size calculation (#3695)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-04-19 23:55:19 -07:00
brb-nv
c35d2a7532
test: Get Eagle tests working (#3593)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-04-20 00:50:57 +08:00
yuxianq
5346f53250
feat: Introduce feature properties for attention backend. (#3659)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-19 12:37:27 +08:00
hlu1
c861b6cf17
Clean up modeling_deepseek.py (#3640)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-04-18 17:54:33 -07:00
Yechan Kim
5460d18b10
feat: trtllm-serve multimodal support (#3590)
* feat: trtllm-serve multimodal support

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* remove disable argument

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* remove disable

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* add and separate tests and move the doc

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* remove block_resue arg from serve.py

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

---------

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-04-19 05:01:28 +08:00
pcastonguay
ae5671644a
feat: Disaggregated router class (#3584)
* Add draft scheduler class

Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>

* Refactor the design

Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>

* feat: Introduce router class for disaggregated server

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Add unit tests for router class

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Adding tests for disagg_utils

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing missing import

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing disagg integration tests

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Addressing MR review comments

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

---------

Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-04-19 00:34:12 +08:00
Zheng Duan
bce7ea8c38
test: add kv cache event tests for disagg workers (#3602) 2025-04-18 18:30:19 +08:00
Yan Chunwei
2a09826ec4
fix hmac in remote mpi session (#3649)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>
2025-04-18 17:47:51 +08:00
HuiGao-NV
d3608d6818
Remove dummy forward path (#3669)
Remove dummy forward path
2025-04-18 16:17:50 +08:00
Dom Brown
dbd9a83b0d
feat: Integrate GPUDirect Storage (GDS) into Executor API (#3582)
* feat: Integrate GPUDirect Storage (GDS) into Executor API

Squash of several dev commits

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-04-18 15:59:21 +08:00
Chang Liu
b8818b45be
fix: llama4: address couple of issues in llama4 attention module (#3491)
* fix attn module for llama4

* Address comments

* Rebase to accommodate latest attn refactor and refactor l4attn

* Remove aux_stream from classic attn

* Use RMSNorm for L2Norm

* Update tensorrt_llm/_torch/models/modeling_llama.py

Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
Signed-off-by: Chang Liu <lc9114@gmail.com>

* Add typing informations for _attn_qkv

* Remove redundant comment

* Simplify llama4 DecoderLayer logic

---------

Signed-off-by: Chang Liu <lc9114@gmail.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-04-18 01:54:59 +00:00
rakib-hasan
ff3b741045
feat: adding multimodal (only image for now) support in trtllm-bench (#3490)
* feat: adding multimodal (only image for now) support in trtllm-bench

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* fix: add  in load_dataset() calls to maintain the v2.19.2 behavior

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* re-adding prompt_token_ids and using that for prompt_len

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* updating the datasets version in examples as well

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* api changes are not needed

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* moving datasets requirement and removing a missed api change

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* addressing review comments

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* refactoring the quickstart example

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

---------

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
2025-04-18 07:06:16 +08:00
Frank
5a6cb2b985
fix: Correct reporting of text dtype for Llama 4 (#3494)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-04-18 00:07:49 +08:00
Yukun He
83b36ebecd
Fix fused_moe fallback issue. (#3652)
min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-04-17 23:17:04 +08:00
yuxianq
b9b1c1368c
feat: Support unfused rope in MLA. (#3610)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-17 16:50:49 +08:00
Netanel Haber
3c52ac098f
feat: allocate minimal blocks per window size (#3028)
* implement variable window attention by breaking the block manager into window block managers per window size

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* revert isCyclic to be true if the min attention window is reached, not per window size

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* add explanatory comment to mCyclicThreshold

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* load correct gemma config

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* don't shadow inputLength in addSequence - it should remain the function scope input length between window size loop iterations

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* fix KVCacheManagerVariableWindowAttentionWithReuseTest for multiple window block managers

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* if TYPE_CHECKING

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* set temp_attention_window_inputs to None explicitly

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* set temp_attention_window_inputs to None explicitly

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* pass dtype as well

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* test_gemma variable sliding window attention

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* allot a fraction of primary/secondaryBlocks to different window size heaps, depending on the window size's total contribution to the kvcache size (i.e., including all layers)

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* remove || mEnableBlockReuse which erroneously triggers beamsearch code for cyclic variable attention window code

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* turn off request delaying for MaxUtil

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* make comments better

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* windowSizesTotalSum using std::accumulate

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* fix error handling of forwardAsync - forwardAsync catch-all catch cleanup code that runs terminateRequest can also fail and must be caught

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* fix comments

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* remove assert that kills disagg tests, since it isn't necessary

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* fix corrupted expression: 'isNewTask && (peftCacheManager ?' -> '(isNewTask && peftCacheManager) ?' which caused boolean algebra. Main is correct

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* add Gemma3 to SUPPORTED_HF_ARCHITECTURES

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* support Gemma3

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* finally fix test_gemma - always spread at least {} into generate_summary_cmd, never None

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* finally fix test_gemma - always spread at least {} into generate_summary_cmd, never None

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* fix kvfactor field for deepseek

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* fix comment

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* fix gemma-3 entries in testlist to include vswa

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* only quantize gemma2 VSWA

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

remove misleading comment

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

fix test_gemma

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* fix test_gemma

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* fix test_gemma

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* in sendRequestInfo, fromOldAllocatedBlockIds->fromOldAllocatedBlockIds, like in main

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* fix: disable KV cache reuse if using attention sink (#3021)

* fix: disable KV cache reuse if using attention sink

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fix: disable KV cache reuse if sink bubble

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* add comment

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-04-17 16:04:57 +08:00
danielafrimi
0f084d9566
added loraOp into lora layer + test for mlp and comparison to lora plugin (#3455)
Loraop integration into torch modules

Signed-off-by: Ubuntu <dafrimi@nvidia.com>
2025-04-17 12:48:27 +08:00
yuxianq
239fe0ff26
chore: Use ellipsis as default value to detect whether residual argument is provided (#3626)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-17 12:31:58 +08:00
Luis Vega
a06bff5052
Fix rotary_emb param in NemotronH attention (#3646)
Signed-off-by: Luis Vega <vegaluisjose@users.noreply.github.com>
2025-04-16 21:03:07 -07:00
Luis Vega
0bda1f9780
feat: Nemotron-H model support (#3430)
* added files for nemotron-h

Signed-off-by: Luis Vega <lvega@nvidia.com>

* use try/except to import RMSNorm

Signed-off-by: Luis Vega <lvega@nvidia.com>

---------

Signed-off-by: Luis Vega <lvega@nvidia.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-04-16 14:05:56 -07:00
Mike Iovine
41a6c98544
Support CUDA graphs for EAGLE3 (#3176)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-04-17 04:53:50 +08:00
hlu1
b6bae33453
Clean up linear.py, mlp.py, gated_mlp.py (#3553)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-04-16 12:21:44 -07:00
Yibin Li
351808efeb
fix: Use hmac authentication for pickle encryption (#3384)
* hmac initial implementation to encrypt worker and proxy queue

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

* set different hmac key for each pair of server/client queue

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

* fix comments

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

* fix style

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

---------

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-04-17 00:40:13 +08:00
yuxianq
fd8ded2b2b
feat: Support cos_sin_cache in all cases. (#3517)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-16 13:48:44 +08:00
Jinyang Yuan
efabf6b443
chore: Add comments to modifications that fix TP size of DeepSeek-V3/R1 when using more than 16 GPUs (#3572)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-04-15 21:51:42 -07:00
Zhanrui Sun
9d88ee3e45
chore: bump version to 0.20.0rc0 (#3561)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-04-16 11:41:21 +08:00
Enwei Zhu
44da0e8d60
fix: LLM API _hf_model_dir for non-cached case (#3562)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-04-16 10:39:34 +08:00
Daniel Cámpora
41ce5440fe
chore: Mass integration of release/0.18 (#3421)
* [Infra][TRTLLM-4063] - Branch out for the TRT-LLM v0.18.0 release

Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com>
(cherry picked from commit de90312020e51c22ba5e75b3502c7ee90c059265)

* [Infra][TRTLLM-3652] - Update dependencies to TRT 10.9 / CUDA 12.8.1 / DLFW 25.03(Internal)

Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
(cherry picked from commit 58db1340ef7db22f1910f878d220a92be5b830d1)

* [None][Doc] - Update docs for v0.18.0

Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit d23e75bc95619ce3b116213d55319272888e0c88)

* [Infra] - Fix or WAR issues in the package sanity check stages

Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit e874e2b127515c52ba10c8df1cc2631627f74ffe)

* [https://nvbugs/5173454] [https://nvbugs/5173432] [https://nvbugs/5175863] fix chatglm tokenizer and tmp model path

Signed-off-by: Yuki Huang <yukih@nvidia.com>
(cherry picked from commit 731811d4e182d70a66193d646152cb71dfafe83a)

* cherry-pick 'test: Updat cluster and multi node test lists and trtllm-bench' test to fix perf drop issue

Signed-off-by: Ruodi Lu <ruodil@nvidia.com>
(cherry picked from commit 5214616283fbc15ae98871a1d84c78d8e1f2e6e8)

* Revert "Merge branch 'user/yukih/fix_5173454_5173432' into 'release/0.18'"

Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit 8d34831cb2b81ee2dfa8021b68e7158b33789a5f)

* [Infra]Restrict setuptools version to avoid sasb pip install issue

Signed-off-by: Emma Qiao <qqiao@nvidia.com>
(cherry picked from commit 1e60ad29e0dafec0e295bedb5d89b716a02a707c)

* [https://nvbugs/5173454] [https://nvbugs/5173432] [https://nvbugs/5175863] fix chatglm tokenizer and tmp model path

Signed-off-by: Yuki Huang <yukih@nvidia.com>
(cherry picked from commit 3ed8164e5bfea1d5aa2039b5408439fd6cf59dac)

* WAR for bug 5173448

Signed-off-by: Thor Johnsen <tjohnsen@nvidia.com>
(cherry picked from commit b6528b2ba15322b6c6a4c81a8b74c04d4973de4f)

* [Infra][TRTLLM-3652] - Update dependencies to CUDA 12.8.1 / DLFW 25.03

Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
(cherry picked from commit 6560983d132d9d257ee15849664eb055e94adaa9)

* [Docs] - Doc changes for v0.18.0

Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit 26769b61218a947c8f9d070f73b63d576fcc20c4)

* [Doc] - Doc change for v0.18.0

Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit 4b3b5ed6bfbc2300e3775fe75456083faad7b235)

* [Infra] update version to 0.18.1

Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com>
(cherry picked from commit 59e8326c75639275837d34de8e140358737a3365)

* Add back nemotron file.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix recurrentgemma reqs.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Adding WAR for bug 5173448.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Remove duplicated file.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update examples/prompt_lookup/requirements.txt

Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>

* Remove glm-4-9b from model dir in chatglm test.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Remove indent change.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>

* Revert changes on l0_test.groovy.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update dev images

Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>

* Remove duplicated import.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix custom op

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

* Fix flashinfer & vanilla backend

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

* Skip problematic case.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Skip problematic test_moe_w4a8_1_14336_4096_8_bfloat16_True_False case.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

---------

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Co-authored-by: Zhanrui Sun <zhanruis@nvidia.com>
Co-authored-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: Ruodi Lu <ruodil@nvidia.com>
Co-authored-by: Emma Qiao <qqiao@nvidia.com>
Co-authored-by: Thor Johnsen <tjohnsen@nvidia.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Co-authored-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>
2025-04-16 10:03:29 +08:00
xiweny
da47d5f27e
fix: nvbugs/5075538: fix cross attention mask when decoder input len > 1 (#3585)
* fix: nvbugs/5075538: fix cross attention mask when decoder input len > 1

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

* remove waiver

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

---------

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-04-16 08:31:33 +08:00
Kaiyu Xie
e037d3e99b
chore: Unify Python NVTX call (#3450)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-04-15 23:25:36 +08:00
bhsueh_NV
3aa37e6b72
fix bug (#3570)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-04-15 16:50:22 +08:00
Yuan Tong
d4c0423cdb
refactor: collect executor and decoder states into dataclass (#3234)
* fix: Proper error bubbling for PyExecutor

Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-04-15 16:31:45 +08:00
shaharmor98
ede7058544
Feat/ Integrate peftCacheManager in PyExecutor creation (#3372)
* integrate peftCacheManager in PyExecutor creation

Signed-off-by: Shahar Mor <smor@nvidia.com>
2025-04-15 15:14:43 +08:00
Yuan Tong
668a0335e4
fix: Proper error bubbling for PyExecutor (#3321)
* fix: Proper error bubbling for PyExecutor
* fix: Proper shutdown
* fix: multi gpu proper shutdown

Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-04-15 14:49:46 +08:00
Jinyang Yuan
0305942808
chore: Modifications that should have been included but were mistakenly overwritten in PR #3467 (#3557)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-04-15 14:08:07 +08:00
yuxianq
0e7e949feb
refactor: Split llama4 model from llama model. (#3530)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-15 13:41:05 +08:00
Jinyang Yuan
175adb94ab
chore: Log memory sizes of weights and activations separately (#3467)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-04-15 09:48:35 +08:00
nv-guomingz
b32ae7ac92
test:add fp8_kv_cache functionality test case. (#3457)
Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
2025-04-15 09:16:46 +08:00
QI JUN
112f716155
chore: move all distributed related codes into _torch.distributed directory (#3511)
* move all distributed related codes into _torch.distributed directory

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

---------

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-15 08:39:17 +08:00
brb-nv
098ca7f68c
test: Fix breaking Phi3 multimodal tests (#3544)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-04-15 08:02:34 +08:00
Pamela Peng
6cdfc54883
feat: Add FP8 support for SM 120 (#3248)
* Allow FP8 on SM120

Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>

* fix sm121

Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>

* fix

Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>

* fix pre-commit

Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>

* review update

Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>

---------

Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>
2025-04-14 16:05:41 -07:00
Aurelien Chartier
8cf2785bc6
chore: unify pp_layers helpers (#3429)
* chore: unify pp_layers helpers

Fix assumptions about equal number of layers per PP rank
in prepare_attention_inputs

Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-04-15 04:49:17 +08:00
Chang Liu
01cb3ccb04
use global expert idx to load expert weights (#3386)
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-04-14 11:58:30 -07:00
Chang Liu
1902d73eb5
fix: llama4: add an option apply_router_weight_on_input for in FusedMoE (#3492)
* apply a tenative fix to moe bypass kernel update

* Pass none to disable final stage in moe

Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
Signed-off-by: Chang Liu <lc9114@gmail.com>

---------

Signed-off-by: Chang Liu <lc9114@gmail.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-04-14 11:56:42 -07:00
Kaiyu Xie
b286b51118
feat: Support torch profiler (#3470)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-04-14 22:06:06 +08:00
Zhanrui Sun
714ff3eedd
chore: bump version to 0.19.0rc0 (#3535)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-04-14 18:11:20 +08:00
Zhanrui Sun
ee4ce0379d
chore: bump version to 0.19.0rc0 (#3514)
* chore: bump version to 0.19.0.rc0

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>

* Update README

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>

---------

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-04-14 17:32:30 +08:00
dongjiyingdjy
2fb1d65d43
fix: fix max_seq_len in executor_config (#3487)
Signed-off-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>
2025-04-14 15:13:29 +08:00
HuiGao-NV
9f41e826bf
fix: remove one duplicated line of code (#3523)
Signed-off-by: Hui Gao <huig@nvidia.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-04-14 14:52:46 +08:00
brb-nv
44090a5388
Add support for Phi-4-MM (#3296)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-04-14 14:24:10 +08:00
yuxianq
9d64b6b890
Cache sin cos in model instead of global LRU cache. (#3378)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-14 11:19:09 +08:00
pcastonguay
fe6f14b2b1
fix: Fixing issue with first gen token being returned twice in streaming (#3427)
* fix: Fixing issue with first gen token being returned twice with streaming

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing not_expectring_strings in test

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

---------

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-04-13 22:45:09 -04:00
yuxianq
baeec63dda
refactor: Remove _pp_forward. (#3496)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-14 09:49:44 +08:00
HuiGao-NV
d0f83d19f1
fix: add kv memory size per token of draft model to calculate max number of tokens of kv cache (#3497)
* fix: add kv memory size per token of draft model to calculate max number
of tokens of kv cache

Signed-off-by: Hui Gao 

* Fix code to get model_config of draft model

Signed-off-by: Hui Gao 

---------

Signed-off-by: Hui Gao
2025-04-13 23:02:14 +08:00
Yan Chunwei
b37c5c0a4d
make LLM-API slurm examples executable (#3402)
Signed-off-by: chunweiy <328693+Superjomn@users.noreply.github.com>
2025-04-13 21:42:45 +08:00
Yan Chunwei
74850c61e9
fix: switch ZMQ from file socket to tcp socket in RemoteMpiCommSession (#3462)
* switch ZMQ from file socket to tcp

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* fix comment

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

---------

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-04-13 09:15:55 +08:00
WeiHaocheng
c6081abb0e
feat: Make scaffolding Controller more generic #3408 (#3416)
Signed-off-by: fredw (generated by with_the_same_user script) <20514172+WeiHaocheng@users.noreply.github.com>
2025-04-12 21:35:38 +08:00
QI JUN
012fb9a1c4
remove useless max_num_tokens member in PyTorchConfig (#3493)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-12 21:09:58 +08:00
Robin Kobus
2ab71f9a80
refactor: decoder buffers (#3307)
* refactor: remove cumLogProbs and logProbs from DecoderBuffers

- Eliminated cumLogProbs and logProbs from DecoderBuffers, streamlining the buffer management.
- Updated related code in decoderBuffers.cpp and bindings.cpp to reflect these changes, ensuring that only host pointers are used for log probabilities.

These modifications enhance code clarity and maintainability by reducing redundancy in buffer management.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: streamline sequence length handling in GptDecoderBatched and StatefulGptDecoderBatched

- Updated GptDecoderBatched to directly use output.sequenceLengths for lengths assignment, removing unnecessary reshaping.
- Adjusted StatefulGptDecoderBatched to ensure sequence lengths are correctly shaped based on actual batch size and max beam width.

These changes enhance clarity and maintainability in the decoding process.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: integrate DecoderState for sequence length management in decoding process

- Updated DecoderBuffers to remove direct handling of sequence lengths, now utilizing DecoderState for this purpose.
- Adjusted MakeDecodingBatchInputOutput to accept DecoderState, enhancing clarity in the decoding input/output management.
- Refactored GptDecoderBatched and StatefulGptDecoderBatched to streamline sequence length handling, ensuring consistency across the decoding workflow.

refactor: update SlotDecoderBuffers to manage sequence lengths directly

- Introduced sequenceLengths and sequenceLengthsHost to SlotDecoderBuffers for better management of sequence lengths.
- Refactored asyncSend and recv methods to utilize the new sequenceLengths member, enhancing clarity and reducing redundancy.
- Updated TrtGptModelInflightBatching to align with the new structure, ensuring consistent handling of sequence lengths across the decoding process.

These changes improve maintainability and streamline the decoding workflow.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Delegate to asyncSend method in SlotDecoderBuffers

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-04-12 11:41:24 +02:00
yuxianq
29c5085400
fix: Fix PP for llama. (#3449)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-12 17:20:27 +08:00
nv-guomingz
adf60a8723
fix:update the default excluded_modules value for fp8rowwise recipe. (#3477)
Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
2025-04-12 16:00:21 +08:00
hlu1
4855431d3d
[Deepseek] Redesign multi-stream API (#3459)
Signed-off-by: Hao Lu <haolu@nvidia.com>
2025-04-11 23:00:25 -07:00
Iman Tabrizian
3041bbdab3
fix: Fix disagg MTP with overlap (#3406)
* fix: disagg overlap with MTP

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>

* Review comment

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>

---------

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
2025-04-12 12:27:24 +08:00
HuiGao-NV
c51e90d7d7
fix: don't perform memory estimation for start_attention (#3485)
* fix: don't perform memory estimation for start_attention

* Enable tests of unittest/_torch/multi_gpu

Signed-off-by: Hui Gao <huig@nvidia.com>
2025-04-12 11:34:46 +08:00
Iman Tabrizian
c539750d42
fix: Allow context_and_generation request type in disagg overlap (#3489)
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
2025-04-11 16:15:01 -07:00
QI JUN
d167cbd5bb
refactor: remove ParallelConfig in tensorrt_llm._torch.distributed module (#3370)
* remove tensorrt_llm._torch.distributed.ParallelConfig

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* clean

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix embedding test

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix comments

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* polish

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* rebase

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

---------

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-04-11 15:34:20 -07:00
Enwei Zhu
cf9ceea890
test: Add DeepSeek-V3-Lite PP=4 cases (#3454)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-04-12 00:09:12 +08:00
Fridah-nv
ec723fa993
feat:[AutoDeploy] Enhance RoPE support (#3115)
* add test to map flashinfer rope op with triton custom rope ops and pytorch rope in fused_mha

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* add rope matcher and unit tests

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* capture cos and sin from graph

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* revert fuse_mha op change

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* minor update to address comment and remove redundant unit test

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* move view and transpose into graph nodes and update unit test to test custom op directly

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* move view into custom op, update bfs with bound, update custom op return type to be half precision

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* custom op update to support 3D input

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* handle bnsd and bsnd format, update tests, handle 3D cos/sin input to the custom op

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* add llama4 rope test, update custom op with is_neox flag

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* add llama4 style rope to matcher and update unit test

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* separate into two transformations

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* fix when num_head != num_kv_head; add support for cached position_ids and cos_sin_cache in graph; update unit tests

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* minor update, cache locally and propagate meta info of qk nodes

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* minor: fix cos_sin_cache not float

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* minor: move cache into matcher

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

---------

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
2025-04-11 23:51:24 +08:00
Yukun He
ff82aef99b
Fix the issues related to fused moe path. (#3435)
* One of the tactic is not supported during dispatch.
* final_hidden_states should be unpacked if it is not min_latency_mode.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-04-11 21:41:15 +08:00
liji-nv
b168adba70
feat: Add NVFP4 UB pattern optimization pass in torch compile (#3371)
* feat: Add NVFP4 UB pattern optimization pass in torch compile

* Add an additional flag for UB fp4 pattern to avoid inverse the scale
* Add NVFP4 related UB patterns

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

* Update atol, some points fails for B200 umbriel.

Signed-off-by: liji-nv <59594262+liji-nv@users.noreply.github.com>

---------

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: liji-nv <59594262+liji-nv@users.noreply.github.com>
2025-04-11 21:25:29 +08:00
Shunkangz
ea050084ad
feat: Add support of chat completion in PD (#2985)
* Add support of chat completion in PD

Add support of include_usage in PD


Reformat


* Remove redundant code

Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>

* Refactor code

Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>

* Add chat completion test

Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>

* Refactor code

Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>

---------

Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-04-11 17:53:28 +08:00
Yechan Kim
5bc6f093c8
fix: mllama e2e pytorch flow fix (#3397)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-04-11 17:33:15 +08:00
Ivy Zhang
d998832b33
test: add torch flow test case in qa test list (#3404)
Signed-off-by: Ivy Zhang <yanzh@nvidia.com>
2025-04-11 16:57:41 +08:00
Zhihan Jiang
8300218d21
feat: support llama4 nope layers; support FP8 checkpoint loading; (#3382)
* Enable NOPE, Fix a rotary embedding bug for gptj_stype_rope, Address PR comment, Properly skip the rotary_embdding for Llama4 ROPE layers

* Add support for FP8 checkpoint, Fix ckpt weighting loading for FP8

* Temporarily disable min_latency_mode for llama4

---------

Co-authored-by: Yilin Fan <yilinf@nvidia.com>
Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>
2025-04-10 10:16:42 -07:00
amitz-nv
a6a2ae6cc1
chore: Rename nvsmall to nemotron nas (#3447)
* Rename nvsmall to nemotron NAS

* Revert nvsmall to nemotron_nas rename in paths in tests that access llm_models_root/nvsmall/tests

* Add NemotronNAS to pytorch supported models table

Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-04-10 23:16:52 +08:00
wm2012011492
af05749e90
feat: add qwen2 moe to torch flow; fix wrong imported KvCacheConfig in gpqa… (#3369)
* add qwen2 moe to torch flow; fix wrong imported KvCacheConfig in gpqa_llmapi.py

Signed-off-by: mengw <12670782+wm2012011492@users.noreply.github.com>

* fix coding style

Signed-off-by: mengw <12670782+wm2012011492@users.noreply.github.com>

* add unittest

Signed-off-by: mengw <12670782+wm2012011492@users.noreply.github.com>

---------

Signed-off-by: mengw <12670782+wm2012011492@users.noreply.github.com>
Co-authored-by: mengw <12670782+wm2012011492@users.noreply.github.com>
2025-04-10 22:45:57 +08:00
Yan Chunwei
c5e803ba48
chore: code cleanup for error logging and SharedMemory in proxy.py (#3432)
* cleanup log

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* remove shared-memory

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* remove ExecutorResponse

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* add assert for postproc

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

---------

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-04-10 21:57:06 +08:00
HuiGao-NV
3ade9375ba
feat: Run PyExecutor's inference flow to estimate max_num_tokens for kv_cache_manager (#3092)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-04-10 18:29:40 +08:00
yuxianq
16c8f39fc5
feat: Support TLLM_OVERRIDE_LAYER_NUM and TLLM_TRACE_MODEL_FORWARD for debugging (#3417)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-10 13:18:30 +08:00
hlu1
fbcf954d9c
[MLA] Deallocate tensors after use (#3286)
Signed-off-by: Hao Lu <haolu@nvidia.com>
2025-04-09 21:36:07 -07:00
brb-nv
c59abae436
feat: Add Gemma3 text-only model support (#3247)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-04-10 12:34:58 +08:00
Frank
9307ff95ae
fix: Add nested aliases for Llama 4 (#3381)
* Add nested aliases for Llama 4

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Fix missed alias.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

---------

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
2025-04-10 10:18:53 +08:00
Yechan Kim
943218b54a
feat: Add Qwen2.5-VL and refactor Qwen2-VL (#3156)
* feat: Add Qwen2.5-VL and refactor Qwen2-VL

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* fix yapf and codespell

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* add test

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* fix test_e2e

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* generalize get_rope_index

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* fix qwen2.5-vl on REAME

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* fix test

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* fix image test

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

---------

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-04-10 04:09:03 +08:00
Maximiliano Levi
996696203f
fix: #3137 speculative decoding and multimodal input support (#3276)
* fix: broadcast embeddings input when using speculative decoding

Signed-off-by: Maximiliano Levi <maxilevi77@gmail.com>

* fix: use shape tensor instead of tuple

Signed-off-by: Maximiliano Levi <maxilevi77@gmail.com>

* fix: comment

Signed-off-by: Maximiliano Levi <maxilevi77@gmail.com>

---------

Signed-off-by: Maximiliano Levi <maxilevi77@gmail.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-04-09 23:40:19 +08:00
danielafrimi
47f5cf6c0d
lora_tests (#3201)
LoRA tests and layers

Signed-off-by: Ubuntu <dafrimi@nvidia.com>
Co-authored-by: Ubuntu <dafrimi@nvidia.com>
2025-04-09 18:06:52 +03:00
WeiHaocheng
6eee15900e
feat: Enhance the integrated robustness of scaffolding with __init__.py #3305 (#3312)
Signed-off-by: fredw (generated by with_the_same_user script) <20514172+WeiHaocheng@users.noreply.github.com>
2025-04-09 21:13:47 +08:00
Tracin
2a2b7bfc66
Fix miss bias add for FP4Linear. (#3361)
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-04-09 09:17:54 +08:00
Mike Iovine
5bdf997963
Add Llama 4 (#3302)
Signed-off-by: Mike Iovine <miovine@nvidia.com>
2025-04-09 03:35:21 +08:00
yuxianq
7225bd8b91
chore: Refine attention backend interface. (#3271)
Refine attention backend interface.

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-09 02:34:53 +08:00
wili
54ad95eaa8
Feat: Variable-Beam-Width-Search (VBWS) part3 (#3338)
* feat/Variable-Beam-Width-Search-Part3, v1.0

Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com>

* feat/Variable-Beam-Width-Search-Part3, v1.1

Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com>

* feat/Variable-Beam-Width-Search-Part3, v1.2

Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com>

---------

Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@user.noreply.github.com>
2025-04-08 23:51:27 +08:00
sugunav14
84fc07b011
feat: [TRTLLM-3510] DeepseekV3 support in AutoDeploy (#3281)
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
2025-04-08 21:47:57 +08:00
Zhanrui Sun
63b0194c50
chore: bump version to 0.19.0.dev2025041500 (#3360)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-04-08 20:45:27 +08:00
yuxianq
7b03350527
Add thread leak check and fix thread/memory leak issues. (#3270)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-08 19:03:18 +08:00
liji-nv
dca6397d1e
feat: Introduce UB allocator for pytorch flow (#3257)
* Instead of allocating UserBuffers at beginning of runtime, UB buffers
  are now managed with global allocator. The allocator will dynamically
assign free UB buffer or allocate new buffer for torch tensor. It makes
userbuffers easier to use.

* In common usecase, the Userbuffers will be allocated correctly during
  warm up stage. There is no dynamic allocation during inference.

* UB fusion pattern is rewroten using the new UB Allocator. It contains
  following passes:

1. Fuse Quant with allreduce, replace with UB impl, and insert a
   copy_to_userbuffers. Currently the normal allreduce still does not
   support FP8 quant. So this need to be done in UB pass
2. Convert all supported allreduce with UB and insert copy_to_userbuffers.
3. Fuse op before ar with the copy_to_userbuffers. So the op directly
   writes to the userbuffer
4. Remove userbuffers finalize if the output is connect to another UB
   allreduce.

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-04-08 18:39:49 +08:00
Yan Chunwei
deb876ecdb
clean up trtllm-llmapi-launch logs (#3358)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-04-08 16:00:59 +08:00
Pengyun Lin
60e02a3684
Use llm.tokenizer in OpenAIServer (#3199)
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
2025-04-08 14:55:02 +08:00
Yukun He
c678774c99
feat: Apply the new torch-flow compatible AutoTuner to both Fused MoE and NVFP4 Linear operators. (#3151)
* Several optimizations and fixings on the Autotuner.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Apply the new Python side Autotuner on current linear for nvFP4 data type.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Apply the new Python side Autotuner on MoE op
* Remove routers from cache key to improve inference perf
* Prevent unnecessary code profiling. Use do_preparation keyword to select which part should be executed during before evaluating any tactic.
* Remove try-catch inside moe profiling process.
* Move default tactic -1 to 0 transforms in cpp runner.
* Revise relavant tests.
* Predefined the bucketizing strategy for fused_moe

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Add specific_profile support for AutoTuner to bypass the standard cache search process for perf optimization
* Add specific_profile for moe
* Add specific profile for linear

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Fixing and revising according to reviewer's suggestions.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Use lru_cache for inference pref optimization.
* Revert gen_custom_cache_key feature

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Replace runner with runner id to achieve a serializable cache.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Code clean up and minor fixings.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Move all tunable runners and custom ops into torch_custom_ops.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Treat min_latency_mode as a independent dynamic tensor. Modify get_valid_tactics to suit for it.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

---------

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-04-08 14:28:36 +08:00
Kaiyu Xie
0a4e1d5a55
breaking change: perf: Make ipc_periodically the default responses_handler (#3102)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-04-08 10:36:39 +08:00
pcastonguay
add5e5cd93
feat: Add option to run disaggregated serving without ctx servers,… (#3243)
* feat: Add option to run disaggregated serving without ctx servers, to benchmark gen only

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing comment in sanity check

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

---------

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-04-07 21:56:03 -04:00
Ivan Sorokin
d40fce474a
fix: redrafter sampling (#3278)
* Fix redrafter sampling

Signed-off-by: Ivan Sorokin <isorokin@nvidia.com>

* Rename redrafter bream search var

Signed-off-by: Ivan Sorokin <isorokin@nvidia.com>

* Remove _beam_search_candidates_v0

Signed-off-by: Ivan Sorokin <isorokin@nvidia.com>

* Remove unused import

Signed-off-by: Ivan Sorokin <isorokin@nvidia.com>

---------

Signed-off-by: Ivan Sorokin <isorokin@nvidia.com>
2025-04-08 07:49:32 +08:00
amitz-nv
e5407ea89a
Fix torch nvsmall through pyexecutor and fix its TP support (#3238)
* Fix NemotronNAS support

Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-04-07 11:53:17 +03:00
pansicheng
ef1ba468a1
feat: support abort disconnected requests (#3214)
Signed-off-by: pansicheng <sicheng.pan.chn@gmail.com>
2025-04-07 16:14:58 +08:00
Bo Li
515dd0d78f
feat: Add support for FP8 MLA on Hopper and Blackwell. (#3190)
* fp8 kv + bf16 ctx MLA + fp8 gen MLA

Use BF16 for context MLA.
mFP8GenerationMLA and mFP8ContextFMHA shouldn't be enabled together.

Allow mSM==90 for mFP8GenerationMLA==true.
For FMHA, dataTypeKv should be FP8.

For FP8 MLA generation, the output is still in BF16.

Refine debug info for FMHA kernel metadata.

Use inputType, outputType, SM together to hash kernel list.

Add FP8 MLA generation FMHA kernel.

Special WAR of NUM_COMPUTE_GROUPS for MLA generation kernel.

Separate the implementation of fused_multihead_attention_v2.h to CPP and print some debug info if checkIfKernelExist fails.

Refine debug info in fused_multihead_attention_v2.cpp

Correct FP8 MLA metadata.

New kernel provided by Yuxin, which outputs BF16.

smem size is not set correctly, which will lead to illegal mem access.

Yuxin fixed the error in FMHA MLA kernel: previously the BF16 isn't correctly written: some parts are repeatedly written, while some others are untouched.

There are two bmm1 scales that should be set correctly.

New kernel generated by Yuxin.

Modificatiosn to common/attentionOp for FP8 MLA on Hopper using FMHA.

Not necessary. If mFP8GenerationMLA, is_fp8_out is false, so mFP8ContextFMHA is false.

Skip a check in fmhaDispatcher.

Modifications in fmhaRunner:
- Debug dump.
- if (!isFP8GenerationMLA) skips a lot of flag setting.
- TMA descriptor modification for qo (by Yuxin).

Cleanup debug output.

Clean up o tma descriptor modifications.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Resolve conflicts.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Apply the patch of FP8 FlashMLA and resolve conflicts.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Fix compilation error.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Fix compile error.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* pick blackwell support

Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>

* Add copyright notice to fused_multihead_attention_v2.cpp.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Add license.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Add missing license.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Exclude building flashMLA kernels under sm90.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Revert "Exclude building flashMLA kernels under sm90."

    This reverts commit f0c859d459.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Use macro to skip compiling FlashMLA for non sm90 targets.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

---------

Signed-off-by: Bo Li <bobboli0202@gmail.com>
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
Co-authored-by: Dylan Chen <ziqingc@nvidia.com>
Co-authored-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-04-07 15:14:13 +08:00
Shunkangz
62bc13430e
fix: fix attentionDP padding request type (#3299)
* Fix attentionDP padding request type

Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>

* Refactor import

Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>

---------

Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-04-07 13:28:21 +08:00
Fanrong Li
e8b97341de
fix the py_decoding_iter update in decoder. (#3297)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-04-07 11:18:33 +08:00
tburt-nv
7a659885e3
chore: remove usernames from comments (#3291)
Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>
2025-04-05 13:44:28 +08:00
Yan Chunwei
b21cfcfed1
chore: refactor the LlmArgs with Pydantic and migrate remaining pybinding configs to python (#3025)
* make LlmArgs Pydantic

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* amending doc

fix api_stability

fix tests

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* restore yaml groups

refine StackTrace

singleton

clean tests

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* fix trtllm-bench

fix pytorch

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* fix serve distagg

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* fix

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

---------

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-04-05 13:31:48 +08:00
Frank
f8a4cc0629
perf: Add total token throughput metric. (#3212)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-04-05 13:17:59 +08:00
Robin Kobus
e12e7a753d
refactor: Expose DecoderState via bindings and integrate in TRTLLMDecoder (#3139)
* refactor: Expose DecoderState via bindings and integrate in TRTLLMDecoder

- Introduced a new `DecoderState` class in the C++ bindings, encapsulating key functionalities for managing decoding state.
- Adjusted the Python `TRTLLMDecoder` to access properties from `decoder_state`, ensuring consistency and clarity in the decoding process.

These changes streamline the decoder's architecture and enhance maintainability.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: Remove unused new_tokens from DecoderState bindings

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-04-05 07:42:35 +08:00
qixiang-99
0d4d50a745
feat: no-cache attention in PyTorch workflow (#3085)
* init trtllm attn no cache

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* fix: fix the seq_len issue and attn metadata prepare for qwen reward model test

fix: fix minor bugs after rebase
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: remove unnecessary debug logs and clean up commented code

refactor: update max_seq_len documentation and remove max_seq_len for decoder model contructor in PyTorchModelEngine
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: update calculate_ref_result function to accept tensor inputs and mask type, enhance test_attention_no_cache to support FULL and CAUSAL masks

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: remove unused BERT attention metadata conversion method and add type assertion for no cache attention in PyTorchModelEngine

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: remove use_kv_cache parameter from attention function and related classes, update documentation for KV cache handling

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: implement setAttentionMaskType method for better mask type handling and remove unused conversion function

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: streamline KV cache handling by replacing direct member access with useKVCache method and simplify token per block assignment

remove Debug code.

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: Resolve comments for Python code

Simplify no cache attention metadata preparation and streamline related attributes in TrtllmAttentionMetadata

Removed the private method for converting to no cache attention metadata and integrated its logic into the prepare method. Updated the test for BERT sequence classification to reflect these changes and ensure proper handling of attention metadata.

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* docs: Add is_dummy_attention field to attention metadata for simulation operations

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: add KVCacheParams to attention backend interface and import relevant metadata classes

Updated the attention backend interface to include KVCacheParams and imported TrtllmAttentionMetadata and VanillaAttentionMetadata in model_engine.py for enhanced functionality.

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* fix: fix rebase format issue

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* fix: extend attention mask type handling in MHARunnerFixedParams

Added support for additional attention mask types (BIDIRECTIONAL, BIDIRECTIONALGLM, BLOCKSPARSE) in the MHARunnerFixedParams structure to fix the mapping issue between ContextAttentionMaskType and AttentionMaskType

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* fix: enhance attention mask type handling in TllmGenFmhaRunnerParams

Updated the setAttentionMaskType method to include a switch-case structure for better handling of attention mask types, ensuring proper mapping and error handling for invalid types.

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

---------

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
2025-04-05 01:54:32 +08:00
Jinyang Yuan
1128dc2a5a
perf: Use pinned H2D to reduce bubbles (#3147)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-04-04 22:23:10 +08:00
yuanjings-nvda
5776b99b70
fix vila test (#3042)
Signed-off-by: Yuanjing Shi <yuanjings@nvidia.com>
2025-04-04 14:30:06 +08:00
shaharmor98
ee4aab72ec
feat: Support PeftCacheManager in Torch (#3186)
* Add PeftCacheManager implementation

Signed-off-by: Shahar Mor <smor@nvidia.com>
2025-04-04 12:38:08 +08:00
Tracin
bb6c338730
AWQ support Modelopt ckpts. (#3258)
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-04-04 08:10:35 +08:00
pcastonguay
b763051ba4
chore: Refactor disaggregated serving scripts (#3073)
* chore: Refactor to reduce duplicated code in disagg server, reuse trtllm-serve

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Updating README, removing launch script

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing integration tests

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Adding scripts to populate urls section of disagg config based on SLURM env vars

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

---------

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-04-03 14:55:05 -04:00
Fanrong Li
1fe64b90be
fix: fix the acceptance rate of pytorch workflow in trtllm-bench (#3240)
* fix acceptance rate of pytorch workflow.
* revert the RequestOutput API change.

---------

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-04-03 15:12:24 +08:00
Frank
2d80db4c36
chore: Remove build config from Pytorch kwargs. (#3210)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-04-03 15:00:29 +08:00
Zongfei Jing
dcc0ebd273
Fix warning (#3254)
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-04-03 13:30:23 +08:00
Jinyang Yuan
2fdfa39ea8
fix: Fix an error related to dummy request when MTP is used (#3146) 2025-04-03 11:08:12 +08:00
Anurag Mukkara
d998339855
Raise error for PP + MTP (#3244)
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
2025-04-03 04:45:31 +08:00
QI JUN
abcb0486dc
fix deepseek failure with pipeline parallelism (#3225)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-02 22:56:39 +08:00
Enwei Zhu
3cf7066350
test: Accuracy test improvement (Part 3.2): Move Qwen tests (NvBug 5135332) (#3219)
* remove test_llm_models_multi_gpu.py

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* qwen 2.5

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* upgrade

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-04-02 17:29:57 +08:00
QI JUN
bb10cdcfb8
chore: refine fetch new requests method (#3213)
* refine broadcast new requests method

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* refine fetch new requests method

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

---------

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-02 10:46:00 +08:00
Zheng Duan
35b828ca2d
fix streaming in dist-serving (#3087)
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
2025-04-02 10:08:07 +08:00
Zongfei Jing
c7548ad72c
perf: Add optimizations for deepseek in min latency mode (#3093)
* Add optimizations for deepseek min latency

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Fix compile error

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Update internal cutlass kernel libs

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Format code

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Resolve conflicts

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

---------

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-04-02 09:05:24 +08:00
brb-nv
1fe3e30356
Add support for Phi-4-mini (#2990)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-04-02 08:34:39 +08:00
Zhanrui Sun
42963baacd
chore: bump version to 0.19.0.dev2025040800 (#3171)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-04-02 08:21:55 +08:00
QI JUN
8fe2e5865e
refine broadcast new requests method (#3198)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-02 08:05:20 +08:00
Enwei Zhu
b2f69db507
test: Accuracy test improvement (Part 3.1): Extend accuracy test suite with LLM API and initial implementation of trtllm-eval (#3167)
* add eval_llmapi

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

tmp commit

port to CLI tool

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

move

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

setup llmapi

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix spec_dec_algo

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

_update_from_hf_quant_config

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

migrate test_pytorch.py

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix fp8 block scales

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix fp8 rowwise

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

adj alpha

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

move test_pytorch.py cases

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

move

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

rename test_accuracy.py to test_cli.py

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

clean

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix cnn_dailymail

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* renaming to cli flow

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* rename MMLU

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* rename

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* add error

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-04-01 22:20:29 +08:00
amirkl94
bf02b9144f
feature: Add LoRA support for gemma (#3068)
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
2025-04-01 19:15:55 +08:00
WeiHaocheng
ff35af77ea
feat: refactor scaffolding worker and support openai api worker (#3166)
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
Signed-off-by: fredw <20514172+WeiHaocheng@users.noreply.github.com>
2025-04-01 18:31:52 +08:00
Jinyang Yuan
992d513bc6
feat: Optionally split MoE inputs into chunks to reduce GPU memory usage (#3104)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
Co-authored-by: raccoonliukai <raccoonliu@tencent.com>
2025-04-01 16:07:02 +08:00
brb-nv
727d78e785
Support prequantized fp8 ckpt for nemotron-mini-4b-instruct (#3046)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-04-01 14:52:09 +08:00
dongjiyingdjy
22ff81b047
fix:fix illeagel memory access when mtp >= 2 (#3006)
* fix - fix illeagel memory access when mtp > 2

---------

Signed-off-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-04-01 13:36:45 +08:00
Shunkangz
dda7354d1a
Refactor return of first gen token in PD (#2986)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-04-01 12:28:27 +08:00
jiahanc
c4ee14e43a
fix: Reverse cuda graph size order (#3116)
Signed-off-by: jiahanc <jiahanc@nvidia.com>
2025-04-01 11:28:36 +08:00
Aurelien Chartier
14e194433c
chore: cleanup py_executor code (#3132)
* chore: cleanup py_executor code

* Add common loop cleanup function
* Remove checks for attention DP if nothing to queue
* Remove extra return statements
* Remove extra variables
* Remove commented debug print

Signed-off-by: Aurelien Chartier <achartier@nvidia.com>

* rename cleanup function

Signed-off-by: Aurelien Chartier <achartier@nvidia.com>

---------

Signed-off-by: Aurelien Chartier <achartier@nvidia.com>
2025-04-01 09:27:04 +08:00
Anurag Mukkara
435cd2983d
perf: Optimisations for PP + attention DP (#3134)
* Minor tp_rank fix

Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>

* Delete unused function

Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>

* PP broadcast for ADP new requests

Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>

* Sync request finish point for intermediate and last pp ranks

Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>

* Use local PP layers only for KV cache estimation

Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>

---------

Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
2025-04-01 08:59:16 +08:00
Frank
8bb3eea285
perf: Readd iteration logging for trtllm-bench. (#3039)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-04-01 08:13:09 +08:00
WeiHaocheng
f665f83256
feat: improve scaffolding shutdown process (#3084) 2025-03-31 20:39:20 +08:00
Zhanrui Sun
36ac5e78ed
chore: bump version to 0.19.0.dev2025040100 (#3152)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-03-31 16:36:06 +08:00
Quanfeng Li
839aad4d6e
fix: Add missing parameter for WeightOnlyQuantRowLinear module (#2768)
Signed-off-by: Quanfeng Li <liquanfeng7@foxmail.com>
2025-03-31 16:20:30 +08:00
QI JUN
9560fcd5ec
Chore: waive tests and fix multi-GPU tests (#3157)
* waive tests

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* update

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* clean up

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

---------

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-03-31 16:05:45 +08:00
liji-nv
e0d0dde058
None - Add one-shot version for UB AR NORM FP16/BF16 (#2995)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-03-31 11:16:03 +08:00
Yan Chunwei
794f61c997
fix: fix single-node cannot quit issue on slurm (#3140)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-03-31 10:15:27 +08:00
Mike Iovine
5416966ddb
Add initial EAGLE-3 implementation (#3035)
Signed-off-by: Mike Iovine <miovine@nvidia.com>
2025-03-29 22:31:24 +08:00
Erin
c75d7cd684
move BuildConfig functional args to llmargs (#3036)
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-03-29 02:20:18 +08:00
Aurelien Chartier
3de82c41cd
Pytorch PP + attention DP support (#3044)
Signed-off-by: Aurelien Chartier <achartier@nvidia.com>
2025-03-28 00:11:19 +08:00
Fanrong Li
ec03159e60
fix: Waive twoshot to fix acc issue (#3066)
* waive twoshot to fix acc issue

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

---------

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-03-27 21:38:52 +08:00
Yan Chunwei
87ab794aa2
fix: fix hang in mgmn with trtllm-llmapi-launch command (#3119)
* init

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* restore

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

---------

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-03-27 18:45:43 +08:00
Fanrong Li
0976360204
add support for MTP+cuda_graph_padding. (#3096)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-03-27 16:06:14 +08:00
Yan Chunwei
82edd90350
fix gpus_per_node in trtllm-bench when world_size < device_count (#3007)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-03-27 09:31:40 +08:00
Suyog Gupta
047f2b234d
perf: [AutoDeploy] Enable AutoDeploy as a backend in trtllm-bench (#3041)
* Enable AutoDeploy as a backend in trtllm-bench

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* update how caches are resized

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* fix: files permission from 100755 to 100644

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* some comments

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* lint

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* lint

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* lint

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* lint

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* Fix function name

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* refactor

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* Remove spurious change

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* Add cursor generated doc strings

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* re-enable ad test

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* some perf cleanup

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* debug ci

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* ensure that overlap scheduler is enabled

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* Reorder the tests

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

---------

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-03-26 14:33:14 -07:00
wili
3e035f2219
v1.2 (#3082)
Signed-off-by: wili <wili@nvidia.com>
2025-03-26 23:31:29 +08:00
Jinyang Yuan
6b583f6f83
perf: Enable CUDA graphs when attention DP is used and active requests on different GPUs are uneven (#3010)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
Co-authored-by: raccoonliukai <raccoonliu@tencent.com>
2025-03-26 21:09:25 +08:00
Enwei Zhu
224469b096
test: [TRTLLM-4334] Create 1.0 criteria scope from API stability references (#3069)
* committed APIs validation

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* clean name

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* separate

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* add TODOs

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix naming

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-03-26 18:14:35 +08:00
Kaiyu Xie
ea3739ee62
Fix: fuse message not aligned on different processes (#3067)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-03-26 17:15:27 +08:00
Yechan Kim
3c7cb6629c
Add EXAONE-Deep (#3054)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-03-26 14:24:04 +08:00
DylanChen-NV
1ac0566a93
fix: fix for cp > kvHeadNum (#3002)
* fix for cp > kvHeadNum

Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>

* fix for None kv_head_num

Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>

---------

Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
2025-03-26 12:39:02 +08:00
HuiGao-NV
25f2434495
fix: Set correct draft_token_nums to dummy requests for torch compilation with MTP (#3053)
Set correct draft_token_nums to dummy requests for torch compilation with MTP

Signed-off-by: Hui Gao <huig@nvidia.com>
2025-03-26 11:32:57 +08:00
yuxianq
268933b5cc
Refactor imports inside tensorrt_llm._torch. (#3015)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-03-26 11:01:07 +08:00
WeiHaocheng
7ac04ada2a
doc: Add README.md for scaffolding (#3048)
* Add README.md for scaffolding

Signed-off-by: fredw <20514172+WeiHaocheng@users.noreply.github.com>

* Update tensorrt_llm/scaffolding/README.md

Co-authored-by: dongxuy04 <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: WeiHaocheng <20514172+WeiHaocheng@users.noreply.github.com>

---------

Signed-off-by: fredw <20514172+WeiHaocheng@users.noreply.github.com>
Signed-off-by: WeiHaocheng <20514172+WeiHaocheng@users.noreply.github.com>
Co-authored-by: dongxuy04 <78518666+dongxuy04@users.noreply.github.com>
2025-03-25 13:58:01 +08:00
Aurelien Chartier
ef78518310
Only gather responses on rank 0 (#3040)
Signed-off-by: Aurelien Chartier <achartier@nvidia.com>
2025-03-24 21:54:51 -07:00
Zhanrui Sun
c2ffce7dbd
chore: bump version to "0.19.0.dev2025032500" (#3019)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-03-25 10:04:17 +08:00
bhsueh_NV
11f9ecb2fd
chore: remove useless param (#3023)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-03-25 08:36:45 +08:00
Netanel Haber
da0b0e0ee3
fix: disable kv cache reuse when minimum window size is reached, instead of maximum window size (#2983)
* fix variable window size reuse - disable when *min attention window* starts sliding, not max

* isPreCyclic -> isCyclic, and invert logic, for clarity

* getDecoderState()

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-03-24 22:49:52 +08:00
Yan Chunwei
531b98ed62
feat: Add several pure python configs to LlmArgs (#2997)
* add SchedulerConfig
* add PeftCacheConfig
2025-03-24 16:16:17 +08:00
Kaiyu Xie
2631f21089
Update (#2978)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-03-23 16:39:35 +08:00
Kaiyu Xie
3aa6b11d13
Update TensorRT-LLM (#2936)
* Update TensorRT-LLM

---------

Co-authored-by: changcui <cuichang147@gmail.com>
2025-03-18 21:25:19 +08:00
Kaiyu Xie
9b931c0f63
Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
Kaiyu Xie
77d7fe1eb2
Update TensorRT-LLM (#2849)
* Update TensorRT-LLM

---------

Co-authored-by: aotman <chenhangatm@gmail.com>
2025-03-04 18:44:00 +08:00
Kaiyu Xie
ab5b19e027
Update TensorRT-LLM (#2820) 2025-02-25 21:21:49 +08:00
Kaiyu Xie
2ea17cdad2
Update TensorRT-LLM (#2792)
* Update TensorRT-LLM

---------

Co-authored-by: jlee <jungmoolee@clika.io>
2025-02-18 21:27:39 +08:00
Kaiyu Xie
e88da961c5
Update TensorRT-LLM (#2783) 2025-02-13 18:40:22 +08:00
Dan Blanaru
16d2467ea8 Update TensorRT-LLM (#2755)
* Update TensorRT-LLM

---------

Co-authored-by: Denis Kayshev <topenkoff@gmail.com>
Co-authored-by: akhoroshev <arthoroshev@gmail.com>
Co-authored-by: Patrick Reiter Horn <patrick.horn@gmail.com>

Update
2025-02-11 03:01:00 +00:00
Denis Kayshev
d93a2dde84
Fix kwarg name (#2691) 2025-01-20 12:18:26 +08:00
Kaiyu Xie
be17881062
Update TensorRT-LLM (#2582) 2024-12-16 21:50:47 -08:00
Kaiyu Xie
aaacc9bd68
Update TensorRT-LLM (#2562)
* Update TensorRT-LLM

---------

Co-authored-by: Starrick Liu <73152103+StarrickLiu@users.noreply.github.com>
2024-12-11 00:31:05 -08:00
石晓伟
548b5b7310
Update TensorRT-LLM (#2532)
* blossom-ci.yml: run vulnerability scan on blossom

* open source efb18c1256f8c9c3d47b7d0c740b83e5d5ebe0ec

---------

Co-authored-by: niukuo <6831097+niukuo@users.noreply.github.com>
Co-authored-by: pei0033 <59505847+pei0033@users.noreply.github.com>
Co-authored-by: Kyungmin Lee <30465912+lkm2835@users.noreply.github.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2024-12-04 21:16:56 +08:00
Kyungmin Lee
4420547017
Fix typo (#2473) 2024-12-02 10:11:27 +08:00
Kaiyu Xie
385626572d
Update TensorRT-LLM (#2502)
* Update TensorRT-LLM

---------

Co-authored-by: 岑灿 <yunyi.hyy@alibaba-inc.com>
2024-11-26 16:51:34 +08:00
Kaiyu Xie
535c9cc673
Update TensorRT-LLM (#2460) 2024-11-19 18:30:34 +08:00
Kaiyu Xie
c629546ce4
Update TensorRT-LLM (#2436) 2024-11-12 15:27:49 +08:00
Kaiyu Xie
b7868dd1bd
Update TensorRT-LLM (#2413) 2024-11-05 16:27:06 +08:00
Kaiyu Xie
f14d1d433c
Update TensorRT-LLM (#2389)
* Update TensorRT-LLM

---------

Co-authored-by: Alessio Netti <netti.alessio@gmail.com>
2024-10-29 22:24:38 +08:00
Kaiyu Xie
1730a587d8
Update TensorRT-LLM (#2363)
* Update TensorRT-LLM

---------

Co-authored-by: tonylek <137782967+tonylek@users.noreply.github.com>
2024-10-22 20:27:35 +08:00
Kaiyu Xie
75057cd036
Update TensorRT-LLM (#2333)
* Update TensorRT-LLM

---------

Co-authored-by: Puneesh Khanna <puneesh.khanna@tii.ae>
Co-authored-by: Ethan Zhang <26497102+ethnzhng@users.noreply.github.com>
2024-10-15 15:28:40 +08:00
Kaiyu Xie
8681b3a4c0
open source 4dbf696ae9b74a26829d120b67ab8443d70c8e58 (#2297)
* Update TensorRT-LLM

---------

Co-authored-by: Bhuvanesh Sridharan <bhuvanesh.sridharan@sprinklr.com>
Co-authored-by: Qingquan Song <ustcsqq@gmail.com>
2024-10-08 12:19:19 +02:00
Dan Blanaru
48686bca3a
open source 7f370deb0090d885d7518c2b146399ba3933c004 (#2273)
* Update TensorRT-LLM

---------
Co-authored-by: Qingquan Song <ustcsqq@gmail.com>
2024-09-30 13:51:19 +02:00
Kaiyu Xie
40274aac39
Bump version to 0.14.0.dev2024092401 (#2258) 2024-09-26 10:26:16 +08:00
Kaiyu Xie
e153372759
Update TensorRT-LLM (#2253)
* Update TensorRT-LLM

---------

Co-authored-by: Ivan Sorokin <isorokin@nvidia.com>
Co-authored-by: lkm2835 <lkm2835@gmail.com>
2024-09-24 17:27:31 +02:00
Kaiyu Xie
a65dba7aaf
Bump version to 0.14.0.dev2024091700 (#2234) 2024-09-18 08:58:36 +08:00
Kaiyu Xie
fe7dc6ad4e
Update TensorRT-LLM (#2230)
* Update TensorRT-LLM

---------

Co-authored-by: Yi Wang <yi.wang.2005@gmail.com>
Co-authored-by: lkm2835 <lkm2835@gmail.com>
2024-09-17 14:39:09 +08:00
Kaiyu Xie
31ac30e928
Update TensorRT-LLM (#2215)
* Update TensorRT-LLM

---------

Co-authored-by: Sherlock Xu <65327072+Sherlock113@users.noreply.github.com>
2024-09-10 18:21:22 +08:00
Kaiyu Xie
78f5c2936b
Update TensorRT-LLM (#2184) 2024-09-03 12:14:23 +02:00
石晓伟
b8fc6633ba
Update TensorRT-LLM (#2156)
Co-authored-by: Bruno Magalhaes <bruno.magalhaes@synthesia.io>
2024-08-27 18:20:59 +08:00
石晓伟
32ed92e449
Update TensorRT-LLM
Co-authored-by: Rong Zhou <130957722+ReginaZh@users.noreply.github.com>
Co-authored-by: Onur Galoglu <33498883+ogaloglu@users.noreply.github.com>
Co-authored-by: Fabian Joswig <fjosw@users.noreply.github.com>
2024-08-20 18:55:15 +08:00
Kaiyu Xie
74b324f667
Update TensorRT-LLM (#2110) 2024-08-13 22:34:33 +08:00
Kaiyu Xie
be9cd719f7
Update TensorRT-LLM (#2094)
* Update TensorRT-LLM

---------

Co-authored-by: akhoroshev <arthoroshev@gmail.com>
Co-authored-by: Fabian Joswig <fjosw@users.noreply.github.com>
Co-authored-by: Tayef Shah <tayefshah@gmail.com>
Co-authored-by: lfz941 <linfanzai941@gmail.com>
2024-08-07 16:44:43 +08:00
Kaiyu Xie
a681853d38
Update TensorRT-LLM (#2053) 2024-07-30 21:25:01 +08:00
Kaiyu Xie
93293aa46d
open source 315e9f5ccd286e906d4c0d402fefbf2f69a1febe (#2033) 2024-07-26 16:19:24 +08:00
Kaiyu Xie
5fa9436e17
Update TensorRT-LLM (#2016) 2024-07-24 19:50:28 +08:00
dongxuy04
5f26e44ead
open source 3706e7395b9b58994412617992727c8ff2d14c9f (#2010)
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2024-07-24 05:48:06 +08:00
Kaiyu Xie
bca9a33b02
Update TensorRT-LLM (#2008)
* Update TensorRT-LLM

---------

Co-authored-by: Timur Abishev <abishev.timur@gmail.com>
Co-authored-by: MahmoudAshraf97 <hassouna97.ma@gmail.com>
Co-authored-by: Saeyoon Oh <saeyoon.oh@furiosa.ai>
Co-authored-by: hattizai <hattizai@gmail.com>
2024-07-23 23:05:09 +08:00
Kaiyu Xie
2d234357c6
Update TensorRT-LLM (#1954)
* Update TensorRT-LLM

---------

Co-authored-by: Altair-Alpha <62340011+Altair-Alpha@users.noreply.github.com>
2024-07-16 15:30:25 +08:00
Kaiyu Xie
a96cccafcf
Update TensorRT-LLM (#1918) 2024-07-09 14:42:22 +08:00
Kaiyu Xie
9dbc5b38ba
Update TensorRT-LLM (#1891)
* Update TensorRT-LLM

---------

Co-authored-by: Marks101 <markus.schnoes@gmx.de>
Co-authored-by: lkm2835 <lkm2835@gmail.com>
2024-07-04 14:37:19 +08:00
Kaiyu Xie
9691e12bce
Update TensorRT-LLM (#1835)
* Update TensorRT-LLM

---------

Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com>
2024-06-25 21:10:30 +08:00
石晓伟
2a115dae84
Update TensorRT-LLM (#1793)
Co-authored-by: DreamGenX <x@dreamgen.com>
Co-authored-by: Ace-RR <78812427+Ace-RR@users.noreply.github.com>
Co-authored-by: bprus <39293131+bprus@users.noreply.github.com>
Co-authored-by: janpetrov <janpetrov@icloud.com>
2024-06-18 18:18:23 +08:00
Kaiyu Xie
db4edea1e1
Update TensorRT-LLM (#1763)
* Update TensorRT-LLM

---------

Co-authored-by: Kota Tsuyuzaki <bloodeagle40234@gmail.com>
Co-authored-by: Pzzzzz <hello-cd.plus@hotmail.com>
Co-authored-by: Patrick Reiter Horn <patrick.horn@gmail.com>
2024-06-11 16:59:02 +08:00
Kaiyu Xie
b777bd6475
Update TensorRT-LLM (#1725)
* Update TensorRT-LLM

---------

Co-authored-by: RunningLeon <mnsheng@yeah.net>
Co-authored-by: Tlntin <TlntinDeng01@Gmail.com>
Co-authored-by: ZHENG, Zhen <zhengzhen.z@qq.com>
Co-authored-by: Pham Van Ngoan <ngoanpham1196@gmail.com>
Co-authored-by: Nathan Price <nathan@abridge.com>
Co-authored-by: Tushar Goel <tushar.goel.ml@gmail.com>
Co-authored-by: Mati <132419219+matichon-vultureprime@users.noreply.github.com>
2024-06-04 20:26:32 +08:00