Yuan Tong
a2f271c8e0
[TRTLLM-4406][feat] LLM sleep & wakeup Part 1: virtual device memory ( #5034 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-08-04 13:51:01 +08:00
Yechan Kim
ee6ab5be96
chore: add EXAONE4 accuracy test ( #6397 )
...
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-08-04 10:14:16 +08:00
Jinyang Yuan
df90202b51
[fix] Fix DeepSeek w4a8 weight loading ( #6498 )
...
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-08-04 10:12:06 +08:00
Chuang Zhu
542f552d0b
use cudaSetDevice to create context ,fix nvbug 5394497 ( #6403 )
...
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-08-03 13:32:55 -04:00
Shunkangz
67a3fd858b
[None][feat] Add support of scheduling attention dp request ( #6246 )
...
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-08-01 20:38:01 -04:00
Richard Huo
31802de0b0
[None][fix] Serialize the window_size in the kv event ( #6526 )
...
Signed-off-by: richardhuo-nv <rihuo@nvidia.com>
2025-08-01 15:25:18 -07:00
Lucas Liebenwein
5247df6ae2
[AutoDeploy] merge feat/ad-2025-07-22 ( #6520 )
...
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Signed-off-by: Gal Agam <ghubaraagam@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: haoguo <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com>
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Co-authored-by: Gal Agam <ghubaraagam@cw-dfw-h100-004-328-012.cm.cluster>
Co-authored-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Co-authored-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-08-01 08:51:08 -07:00
brb-nv
7447d6ed85
[TRTLLM-6657][feat] Add LoRA support for Gemma3 ( #6371 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-08-01 09:19:54 -04:00
liji-nv
1daa8c3232
[ https://nvbugs/5340941 ][ https://nvbugs/5375785 ] - fix: Wrap attentio… ( #6355 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-08-01 07:38:06 -04:00
Yukun He
90856bf97d
[ https://nvbugs/5419069 ][fix] Fix the mismatched layer name components. ( #6417 )
...
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-08-01 16:32:39 +08:00
Zero Zeng
48768fd720
fix: Fix missing key ( #6471 )
...
Signed-off-by: Zero Zeng <38289304+zerollzeng@users.noreply.github.com>
2025-08-01 14:25:58 +08:00
Robin Kobus
d3c14682f0
refactor: Remove unused buffers and bindings from sampler ( #6484 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-08-01 00:43:03 -04:00
Jaedeok Kim
fbee279909
fix: remove duplicate layer multiplication in KV cache size calculation ( #6481 )
...
Signed-off-by: Jaedeok Kim <jaedeokk@nvidia.com>
2025-07-31 22:34:34 -04:00
Zongfei Jing
7bb0a78631
Deepseek R1 FP8 Support on Blackwell ( #6486 )
...
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-08-01 10:26:28 +08:00
Venky
8c165fd27a
[TRTLLM-6611][feat] Add warnings and stricter validation to LoraManager adapter loading ( #6453 )
...
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
2025-07-31 22:22:51 -04:00
Yukun He
00059de380
chore: Improve the AutoTuner log information. ( #6368 )
...
* Change the fallback alert from DEBUG to WARNING level and only do it once.
* Add debug information for profiling cache right after the warmup phase.
* Change the level of exception message during tactic profiling from ERROR to WARNING level. All exception details are pushed to the DEBUG level.
* Other trivial refinements and cleanups.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-08-01 09:19:52 +08:00
brb-nv
2eca0d5925
fix: Fix poor generation with FP8 Gemma3 1B checkpoint ( #6499 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-31 17:18:23 -07:00
Simeng Liu
8cf3faa26a
[feat] Auto-enable ngram with concurrency <= 32. ( #6232 )
...
Signed-off-by: Simeng Liu <simengl@nvidia.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
Signed-off-by: Mike Iovine <mike.iovine7@gmail.com>
Co-authored-by: Mike Iovine <miovine@nvidia.com>
Co-authored-by: Mike Iovine <mike.iovine7@gmail.com>
2025-07-31 18:45:51 -04:00
Ziyi Xiong
8062e0fe7c
[TRTLLM-6392][feat] Support turning on/off spec decoding dynamically ( #6363 )
...
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-07-31 15:31:39 -04:00
shaharmor98
0c42f54a39
Bugfix/fix nemotron nas lora support ( #6380 )
...
Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-07-31 13:39:35 -04:00
amitz-nv
1ee7a08d2b
[5830][feat] Improve LoRA cache memory control ( #6220 )
...
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-07-31 09:26:38 +03:00
dongjiyingdjy
17e0d0fb1a
fix: fix illeagel memory access ( #6437 )
...
Signed-off-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>
2025-07-31 10:01:34 +08:00
Enwei Zhu
4b299cb77e
feat: Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21 ( #6408 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-07-31 09:53:52 +08:00
Vadim Gimpelson
25cd4f215e
[PERF] Move calculation Qwen2-VL's rotary_cos_sin to LLM worker process ( #6004 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@centml.ai>
2025-07-31 09:35:24 +09:00
shaharmor98
f9cf683e39
add propagation of trust_remote_code to OpenAIServer ( #6446 )
...
Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-07-30 15:25:41 -04:00
Wanli Jiang
9632dba02e
feat: TRTLLM-6450 update long rope for phi3.5/phi4-mini/phi4-mm ( #6353 )
...
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-07-30 09:20:16 -07:00
NVShreyas
e67f4da9b5
[Perf]: Add residual, norm for nemotron_nas models ( #6455 )
...
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
2025-07-30 09:10:38 -07:00
Chang Liu
b4065d8ca6
[TRTLLM-6654][feat] Add support for external multimodal embeddings ( #6263 )
...
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
2025-07-30 10:00:15 -04:00
pcastonguay
e7ae5e2824
feat: Add support for disaggregation with pp with pytorch backend ( #6369 )
...
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Signed-off-by: raayandhar <rdhar@nvidia.com>
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Signed-off-by: pcastonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: raayandhar <rdhar@nvidia.com>
Co-authored-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-07-30 09:42:13 -04:00
tomeras91
a2514d93fc
[nvbug 5380101][fix] Fix nemotronNAS loading for TP>1 ( #6447 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-07-30 07:22:32 -04:00
QI JUN
2fe9cc0889
chore: remove draft_model_engine from init parameter list of PyExecutor ( #6325 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-07-30 03:31:49 -04:00
QI JUN
1f39a11af0
chore: clean code of PyExecutor ( #6445 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-07-30 02:11:43 -04:00
2ez4bz
d6eed1b624
[fix] Switch placement of image placeholder for mistral 3.1 ( #6435 )
...
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-07-30 14:10:36 +08:00
Jinyang Yuan
a427f5bece
[fix] Fix wide EP when using DeepEP with online EPLB ( #6429 )
...
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-07-30 00:13:18 -04:00
Zheng Duan
c9ed1ab436
[TRTLLM-6549] chore: record delay introduced by disaggregated serving in kv cache measure ( #6135 )
...
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-07-30 10:39:40 +08:00
peaceh-nv
5b420ad267
Rename layer to comply with deepseek ( #6393 )
...
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
2025-07-30 10:00:48 +08:00
Yechan Kim
d6eb8e2366
fix: support mixture of text & multimodal prompts ( #6345 )
...
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-07-30 08:52:31 +08:00
Yunfan Fan
1a8e28d295
[FIX] fix bugs caused by None attention_bias during Qwen3 model convert engine ( #6344 )
...
Signed-off-by: fanyunfan <2569548856@qq.com>
Co-authored-by: fanyunfan <2569658856@qq.com>
2025-07-30 07:13:44 +08:00
Yan Chunwei
ad662ddcdd
chore: disallow arbitrary in llm_args.Configs ( #6367 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-29 16:16:52 -04:00
Michal Guzek
7efe3cb0cd
[fix] Add detokenization-based stop word logic to LLM API ( #5948 )
...
Signed-off-by: moraxu <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
2025-07-29 10:16:59 -07:00
Yukun He
0eee2e2850
[5385981] fix: Update the usage of VisionAttention init API. ( #6413 )
...
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-07-29 16:41:48 +08:00
QI JUN
13e24ab1cb
chore: remove unused code in PyExecutor ( #6351 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-07-29 16:24:26 +08:00
Frank
d2a04abb95
[fix] Fixes to parameter usage and low latency configuration. ( #6343 )
2025-07-29 01:36:13 -04:00
nv-guomingz
49044733e1
chore: delete useless gitkeep files. ( #6400 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-07-28 11:38:30 -04:00
QI JUN
4efc6496b7
chore: add _prepare_and_schedule_batch function in PyExecutor ( #6365 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-07-28 05:50:27 -04:00
Yan Chunwei
45d441e60c
[TRTLLM-5061] chore: add status tags to LLM API reference ( #5707 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-28 15:57:07 +08:00
Zero Zeng
c9b8b6180f
Add Acceptance Rate calculation to benchmark_serving ( #6240 )
...
Signed-off-by: Zero Zeng <38289304+zerollzeng@users.noreply.github.com>
2025-07-28 14:00:58 +08:00
Jinyang Yuan
97f7e12588
[fix] Fix perf regression caused by MoE autotuner when using DeepEPLowLatency ( #6288 )
...
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-07-28 01:37:11 -04:00
Chang Liu
dc757799e1
[nvbugs/5401156][fix] Avoid import all models when import trtllm._common ( #6266 )
2025-07-27 23:29:21 -04:00
Void
f172face98
DeepEP LL dispatch FP4 ( #6296 )
...
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
2025-07-28 11:25:42 +08:00