TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Zero Zeng	48768fd720	fix: Fix missing key (#6471 ) Signed-off-by: Zero Zeng <38289304+zerollzeng@users.noreply.github.com>	2025-08-01 14:25:58 +08:00
Robin Kobus	d3c14682f0	refactor: Remove unused buffers and bindings from sampler (#6484 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-08-01 00:43:03 -04:00
Jaedeok Kim	fbee279909	fix: remove duplicate layer multiplication in KV cache size calculation (#6481 ) Signed-off-by: Jaedeok Kim <jaedeokk@nvidia.com>	2025-07-31 22:34:34 -04:00
Zongfei Jing	7bb0a78631	Deepseek R1 FP8 Support on Blackwell (#6486 ) Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-08-01 10:26:28 +08:00
Venky	8c165fd27a	[TRTLLM-6611][feat] Add warnings and stricter validation to LoraManager adapter loading (#6453 ) Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>	2025-07-31 22:22:51 -04:00
Yukun He	00059de380	chore: Improve the AutoTuner log information. (#6368 ) * Change the fallback alert from DEBUG to WARNING level and only do it once. * Add debug information for profiling cache right after the warmup phase. * Change the level of exception message during tactic profiling from ERROR to WARNING level. All exception details are pushed to the DEBUG level. * Other trivial refinements and cleanups. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-08-01 09:19:52 +08:00
brb-nv	2eca0d5925	fix: Fix poor generation with FP8 Gemma3 1B checkpoint (#6499 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-07-31 17:18:23 -07:00
Simeng Liu	8cf3faa26a	[feat] Auto-enable ngram with concurrency <= 32. (#6232 ) Signed-off-by: Simeng Liu <simengl@nvidia.com> Signed-off-by: Mike Iovine <miovine@nvidia.com> Signed-off-by: Mike Iovine <mike.iovine7@gmail.com> Co-authored-by: Mike Iovine <miovine@nvidia.com> Co-authored-by: Mike Iovine <mike.iovine7@gmail.com>	2025-07-31 18:45:51 -04:00
Ziyi Xiong	8062e0fe7c	[TRTLLM-6392][feat] Support turning on/off spec decoding dynamically (#6363 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-07-31 15:31:39 -04:00
shaharmor98	0c42f54a39	Bugfix/fix nemotron nas lora support (#6380 ) Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>	2025-07-31 13:39:35 -04:00
amitz-nv	1ee7a08d2b	[5830][feat] Improve LoRA cache memory control (#6220 ) Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>	2025-07-31 09:26:38 +03:00
dongjiyingdjy	17e0d0fb1a	fix: fix illeagel memory access (#6437 ) Signed-off-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>	2025-07-31 10:01:34 +08:00
Enwei Zhu	4b299cb77e	feat: Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21 (#6408 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-07-31 09:53:52 +08:00
Vadim Gimpelson	25cd4f215e	[PERF] Move calculation Qwen2-VL's rotary_cos_sin to LLM worker process (#6004 ) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@centml.ai>	2025-07-31 09:35:24 +09:00
shaharmor98	f9cf683e39	add propagation of trust_remote_code to OpenAIServer (#6446 ) Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>	2025-07-30 15:25:41 -04:00
Wanli Jiang	9632dba02e	feat: TRTLLM-6450 update long rope for phi3.5/phi4-mini/phi4-mm (#6353 ) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2025-07-30 09:20:16 -07:00
NVShreyas	e67f4da9b5	[Perf]: Add residual, norm for nemotron_nas models (#6455 ) Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>	2025-07-30 09:10:38 -07:00
Chang Liu	b4065d8ca6	[TRTLLM-6654][feat] Add support for external multimodal embeddings (#6263 ) Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>	2025-07-30 10:00:15 -04:00
pcastonguay	e7ae5e2824	feat: Add support for disaggregation with pp with pytorch backend (#6369 ) Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> Signed-off-by: raayandhar <rdhar@nvidia.com> Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com> Signed-off-by: pcastonguay <55748270+pcastonguay@users.noreply.github.com> Co-authored-by: raayandhar <rdhar@nvidia.com> Co-authored-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>	2025-07-30 09:42:13 -04:00
tomeras91	a2514d93fc	[nvbug 5380101][fix] Fix nemotronNAS loading for TP>1 (#6447 ) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>	2025-07-30 07:22:32 -04:00
QI JUN	2fe9cc0889	chore: remove draft_model_engine from init parameter list of PyExecutor (#6325 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-07-30 03:31:49 -04:00
QI JUN	1f39a11af0	chore: clean code of PyExecutor (#6445 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-07-30 02:11:43 -04:00
2ez4bz	d6eed1b624	[fix] Switch placement of image placeholder for mistral 3.1 (#6435 ) Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>	2025-07-30 14:10:36 +08:00
Jinyang Yuan	a427f5bece	[fix] Fix wide EP when using DeepEP with online EPLB (#6429 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-07-30 00:13:18 -04:00
Zheng Duan	c9ed1ab436	[TRTLLM-6549] chore: record delay introduced by disaggregated serving in kv cache measure (#6135 ) Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>	2025-07-30 10:39:40 +08:00
peaceh-nv	5b420ad267	Rename layer to comply with deepseek (#6393 ) Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>	2025-07-30 10:00:48 +08:00
Yechan Kim	d6eb8e2366	fix: support mixture of text & multimodal prompts (#6345 ) Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>	2025-07-30 08:52:31 +08:00
Yunfan Fan	1a8e28d295	[FIX] fix bugs caused by None attention_bias during Qwen3 model convert engine (#6344 ) Signed-off-by: fanyunfan <2569548856@qq.com> Co-authored-by: fanyunfan <2569658856@qq.com>	2025-07-30 07:13:44 +08:00
Yan Chunwei	ad662ddcdd	chore: disallow arbitrary in llm_args.Configs (#6367 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-07-29 16:16:52 -04:00
Michal Guzek	7efe3cb0cd	[fix] Add detokenization-based stop word logic to LLM API (#5948 ) Signed-off-by: moraxu <mguzek@nvidia.com> Signed-off-by: Michal Guzek <mguzek@nvidia.com>	2025-07-29 10:16:59 -07:00
Yukun He	0eee2e2850	[5385981] fix: Update the usage of VisionAttention init API. (#6413 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-07-29 16:41:48 +08:00
QI JUN	13e24ab1cb	chore: remove unused code in PyExecutor (#6351 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-07-29 16:24:26 +08:00
Frank	d2a04abb95	[fix] Fixes to parameter usage and low latency configuration. (#6343 )	2025-07-29 01:36:13 -04:00
nv-guomingz	49044733e1	chore: delete useless gitkeep files. (#6400 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-07-28 11:38:30 -04:00
QI JUN	4efc6496b7	chore: add _prepare_and_schedule_batch function in PyExecutor (#6365 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-07-28 05:50:27 -04:00
Yan Chunwei	45d441e60c	[TRTLLM-5061] chore: add status tags to LLM API reference (#5707 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-07-28 15:57:07 +08:00
Zero Zeng	c9b8b6180f	Add Acceptance Rate calculation to benchmark_serving (#6240 ) Signed-off-by: Zero Zeng <38289304+zerollzeng@users.noreply.github.com>	2025-07-28 14:00:58 +08:00
Jinyang Yuan	97f7e12588	[fix] Fix perf regression caused by MoE autotuner when using DeepEPLowLatency (#6288 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-07-28 01:37:11 -04:00
Chang Liu	dc757799e1	[nvbugs/5401156][fix] Avoid import all models when import trtllm._common (#6266 )	2025-07-27 23:29:21 -04:00
Void	f172face98	DeepEP LL dispatch FP4 (#6296 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>	2025-07-28 11:25:42 +08:00
Yukun He	93a0fd0a23	[TRTLLM-6445] feat: Enable AllReduce-associated fusion patterns in Llama3/4. (#6205 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-07-28 09:36:26 +08:00
YueWeng	2dd3186727	fix: remove cudaStreamSynchronize when using relaxed acceptance (#5262 ) Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>	2025-07-28 09:18:41 +08:00
Ziyi Xiong	d853811190	[https://nvbugs/5402719 ][fix]: Add cuda graph dummy requests to the spec_resource_manager (#6258 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-07-26 20:32:39 -04:00
Michal Guzek	08d57123f9	[nvbug/5374773] chore: Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache (#5974 ) Signed-off-by: moraxu <mguzek@nvidia.com>	2025-07-25 18:10:40 -04:00
ameynaik-hub	1e5e71aa42	Mtp optimizations round1 (#5689 ) Signed-off-by: Amey Naik <212485788+ameynaik-hub@users.noreply.github.com> Co-authored-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com>	2025-07-25 13:48:27 -04:00
nv-guomingz	b8d4cb8beb	feat: Support JSON Schema in OpenAI-Compatible API (#6321 ) Signed-off-by: noiji <52301388+noiji@users.noreply.github.com>	2025-07-25 12:55:56 -04:00
xiaoqi	a0aecf0476	[feat]: support logit_bias (#5354 ) Signed-off-by: xq25478 <xq25478@qq.com> Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com> Signed-off-by: hexiao.xq <hexiao.xq@antgroup.com> Co-authored-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com> Co-authored-by: hexiao.xq <hexiao.xq@antgroup.com> Co-authored-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>	2025-07-25 09:37:41 +00:00
liji-nv	e07fff4f78	[https://nvbugs/5340941 ] - fix: Correct custom ops used by Qwen3 Moe … (#6285 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-07-25 14:49:45 +08:00
Mike Iovine	0f2f11f90b	[TRTLLM-6453][feat] Support chunked prefill on spec decode 2 model (#6104 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-07-24 21:50:11 -04:00
Linda	9a99e6d6d7	fix: integration tests with nanobind (#6326 ) Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>	2025-07-25 09:23:20 +08:00

1 2 3 4 5 ...

930 Commits