Commit Graph

916 Commits

Author SHA1 Message Date
shaharmor98
f9cf683e39
add propagation of trust_remote_code to OpenAIServer (#6446)
Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-07-30 15:25:41 -04:00
Wanli Jiang
9632dba02e
feat: TRTLLM-6450 update long rope for phi3.5/phi4-mini/phi4-mm (#6353)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-07-30 09:20:16 -07:00
NVShreyas
e67f4da9b5
[Perf]: Add residual, norm for nemotron_nas models (#6455)
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
2025-07-30 09:10:38 -07:00
Chang Liu
b4065d8ca6
[TRTLLM-6654][feat] Add support for external multimodal embeddings (#6263)
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
2025-07-30 10:00:15 -04:00
pcastonguay
e7ae5e2824
feat: Add support for disaggregation with pp with pytorch backend (#6369)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Signed-off-by: raayandhar <rdhar@nvidia.com>
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Signed-off-by: pcastonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: raayandhar <rdhar@nvidia.com>
Co-authored-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-07-30 09:42:13 -04:00
tomeras91
a2514d93fc
[nvbug 5380101][fix] Fix nemotronNAS loading for TP>1 (#6447)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-07-30 07:22:32 -04:00
QI JUN
2fe9cc0889
chore: remove draft_model_engine from init parameter list of PyExecutor (#6325)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-07-30 03:31:49 -04:00
QI JUN
1f39a11af0
chore: clean code of PyExecutor (#6445)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-07-30 02:11:43 -04:00
2ez4bz
d6eed1b624
[fix] Switch placement of image placeholder for mistral 3.1 (#6435)
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-07-30 14:10:36 +08:00
Jinyang Yuan
a427f5bece
[fix] Fix wide EP when using DeepEP with online EPLB (#6429)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-07-30 00:13:18 -04:00
Zheng Duan
c9ed1ab436
[TRTLLM-6549] chore: record delay introduced by disaggregated serving in kv cache measure (#6135)
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-07-30 10:39:40 +08:00
peaceh-nv
5b420ad267
Rename layer to comply with deepseek (#6393)
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
2025-07-30 10:00:48 +08:00
Yechan Kim
d6eb8e2366
fix: support mixture of text & multimodal prompts (#6345)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-07-30 08:52:31 +08:00
Yunfan Fan
1a8e28d295
[FIX] fix bugs caused by None attention_bias during Qwen3 model convert engine (#6344)
Signed-off-by: fanyunfan <2569548856@qq.com>
Co-authored-by: fanyunfan <2569658856@qq.com>
2025-07-30 07:13:44 +08:00
Yan Chunwei
ad662ddcdd
chore: disallow arbitrary in llm_args.Configs (#6367)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-29 16:16:52 -04:00
Michal Guzek
7efe3cb0cd
[fix] Add detokenization-based stop word logic to LLM API (#5948)
Signed-off-by: moraxu <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
2025-07-29 10:16:59 -07:00
Yukun He
0eee2e2850
[5385981] fix: Update the usage of VisionAttention init API. (#6413)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-07-29 16:41:48 +08:00
QI JUN
13e24ab1cb
chore: remove unused code in PyExecutor (#6351)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-07-29 16:24:26 +08:00
Frank
d2a04abb95
[fix] Fixes to parameter usage and low latency configuration. (#6343) 2025-07-29 01:36:13 -04:00
nv-guomingz
49044733e1
chore: delete useless gitkeep files. (#6400)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-07-28 11:38:30 -04:00
QI JUN
4efc6496b7
chore: add _prepare_and_schedule_batch function in PyExecutor (#6365)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-07-28 05:50:27 -04:00
Yan Chunwei
45d441e60c
[TRTLLM-5061] chore: add status tags to LLM API reference (#5707)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-28 15:57:07 +08:00
Zero Zeng
c9b8b6180f
Add Acceptance Rate calculation to benchmark_serving (#6240)
Signed-off-by: Zero Zeng <38289304+zerollzeng@users.noreply.github.com>
2025-07-28 14:00:58 +08:00
Jinyang Yuan
97f7e12588
[fix] Fix perf regression caused by MoE autotuner when using DeepEPLowLatency (#6288)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-07-28 01:37:11 -04:00
Chang Liu
dc757799e1
[nvbugs/5401156][fix] Avoid import all models when import trtllm._common (#6266) 2025-07-27 23:29:21 -04:00
Void
f172face98
DeepEP LL dispatch FP4 (#6296)
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
2025-07-28 11:25:42 +08:00
Yukun He
93a0fd0a23
[TRTLLM-6445] feat: Enable AllReduce-associated fusion patterns in Llama3/4. (#6205)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-07-28 09:36:26 +08:00
YueWeng
2dd3186727
fix: remove cudaStreamSynchronize when using relaxed acceptance (#5262)
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
2025-07-28 09:18:41 +08:00
Ziyi Xiong
d853811190
[https://nvbugs/5402719][fix]: Add cuda graph dummy requests to the spec_resource_manager (#6258)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-07-26 20:32:39 -04:00
Michal Guzek
08d57123f9
[nvbug/5374773] chore: Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache (#5974)
Signed-off-by: moraxu <mguzek@nvidia.com>
2025-07-25 18:10:40 -04:00
ameynaik-hub
1e5e71aa42
Mtp optimizations round1 (#5689)
Signed-off-by: Amey Naik <212485788+ameynaik-hub@users.noreply.github.com>
Co-authored-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com>
2025-07-25 13:48:27 -04:00
nv-guomingz
b8d4cb8beb
feat: Support JSON Schema in OpenAI-Compatible API (#6321)
Signed-off-by: noiji <52301388+noiji@users.noreply.github.com>
2025-07-25 12:55:56 -04:00
xiaoqi
a0aecf0476
[feat]: support logit_bias (#5354)
Signed-off-by: xq25478 <xq25478@qq.com>
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: hexiao.xq <hexiao.xq@antgroup.com>
Co-authored-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
Co-authored-by: hexiao.xq <hexiao.xq@antgroup.com>
Co-authored-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-07-25 09:37:41 +00:00
liji-nv
e07fff4f78
[https://nvbugs/5340941] - fix: Correct custom ops used by Qwen3 Moe … (#6285)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-07-25 14:49:45 +08:00
Mike Iovine
0f2f11f90b
[TRTLLM-6453][feat] Support chunked prefill on spec decode 2 model (#6104)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-07-24 21:50:11 -04:00
Linda
9a99e6d6d7
fix: integration tests with nanobind (#6326)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-25 09:23:20 +08:00
Shiyu Li
375f74ecb2
[fix][nvbugs/5399355] Fix Lamport buffer clear issue for MNNVL TwoShot Allreduce and add FP16 support. (#6237)
Signed-off-by: Shiyu Li <shili@nvidia.com>
2025-07-25 08:01:40 +08:00
Frank
f8f5ba65fc
[fix] Update to remove popping of KV cache and other args. (#6310)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-07-24 15:54:33 -04:00
Stefan Niebler
0df758ec9f
[TRTLLM-6650][feat] Enhance beam search support with CUDA graph integration (#6217)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-07-24 18:04:41 +02:00
bhsueh_NV
7b6aadc800
[Fix][nvbug 5401163][nvbug 5404726][Qwen3] Fix bug of MoE on tp > 1 with trtllm moe backend (#6235)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-07-24 21:47:37 +08:00
liji-nv
14d94a3856
feat: Add non UB AR + Residual + Norm + Quant fusion (#6320)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-07-24 05:51:43 -04:00
Lizhi Zhou
a63a1ac7f9
[TRTLLM-6444] Add some UCX trouble shooting docs and print UCX related logs (#6085)
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2025-07-24 16:21:01 +08:00
QI JUN
428e34080f
chore: remove unused variables in pyexecutor (#6280)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-07-24 13:16:15 +08:00
Stefan Niebler
2486eb778e
[TRTLLM-6651][feat] Enable Overlap scheduler + Beam Search in TRTLLM Sampler (#6223)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-07-23 12:30:50 +02:00
YueWeng
ed62a06eef
[nvbug/5322354] fix PD + MTP + overlap scheduler accuracy issue (#6136)
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
2025-07-23 14:53:37 +08:00
QI JUN
a8253b942f
chore: remove duplicate should_stop_processing check (#6242)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-07-23 14:11:23 +08:00
Yechan Kim
83c3ed128b
chore: set default device to cpu on Multimodal models (#5994)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-07-22 21:45:31 -07:00
Erin
5636c67388
fix: nvbug_5398806 (#6239) 2025-07-23 11:45:11 +08:00
Venky
9538c8d0e5
Add basic Nemo Ckpt Lora Loading in pytorch flow (#6019) 2025-07-22 19:42:45 -07:00
wili
8ecdeee300
[refactor] Simplification of Speculative decoding configs - Part 2 (#5936)
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-07-23 09:20:27 +08:00