TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Stefan Niebler	0df758ec9f	[TRTLLM-6650][feat] Enhance beam search support with CUDA graph integration (#6217 ) Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>	2025-07-24 18:04:41 +02:00
Lizhi Zhou	a63a1ac7f9	[TRTLLM-6444] Add some UCX trouble shooting docs and print UCX related logs (#6085 ) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>	2025-07-24 16:21:01 +08:00
QI JUN	428e34080f	chore: remove unused variables in pyexecutor (#6280 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-07-24 13:16:15 +08:00
Stefan Niebler	2486eb778e	[TRTLLM-6651][feat] Enable Overlap scheduler + Beam Search in TRTLLM Sampler (#6223 ) Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>	2025-07-23 12:30:50 +02:00
YueWeng	ed62a06eef	[nvbug/5322354] fix PD + MTP + overlap scheduler accuracy issue (#6136 ) Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>	2025-07-23 14:53:37 +08:00
QI JUN	a8253b942f	chore: remove duplicate should_stop_processing check (#6242 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-07-23 14:11:23 +08:00
Venky	9538c8d0e5	Add basic Nemo Ckpt Lora Loading in pytorch flow (#6019 )	2025-07-22 19:42:45 -07:00
wili	8ecdeee300	[refactor] Simplification of Speculative decoding configs - Part 2 (#5936 ) Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>	2025-07-23 09:20:27 +08:00
Fanrong Li	c66941036f	fix: fix index out of bounds error in spec decoding (#5954 )	2025-07-22 12:48:00 +08:00
Shunkangz	ee45e0c63f	feat: Refactor the fetching request logic (#5786 ) Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>	2025-07-22 09:16:28 +08:00
Chang Liu	7381f1dba7	[TRTLLM-5059][feat] Add KV cache reuse support for multimodal models (#5444 ) Only supports qwen in this PR	2025-07-21 16:11:58 -07:00
liji-nv	3e0fb60e50	[TRTLLM-4279] feat: Multistream initial support for torch compile flow (#5847 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-07-21 19:10:22 +08:00
amitz-nv	98428f330e	[TRTLLM-5826][feat] Support pytorch LoRA adapter eviction (#5616 ) Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>	2025-07-20 08:00:14 +03:00
Ziyi Xiong	66030ef815	[TRTLLM-6452][feat]: Two-model engine KV cache reuse support (#6133 ) Signed-off-by: ziyixiong-nv <fxiong@nvidia.com> Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-07-19 13:17:15 +08:00
Netanel Haber	d9a3530048	[nvbug/5393888][nvbug/5393042] Always use `py_seq_slot` (#6147 ) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2025-07-18 22:45:16 +03:00
Stefan Niebler	6d7874a467	[nvbugs/5369799] fix: Update disaggregation handling in sampler (#5762 ) Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>	2025-07-19 01:40:46 +08:00
Stefan Niebler	fd6ce7f20e	[ci] Speedup beam search unit tests with fixtures for LLM (#5843 ) Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>	2025-07-18 22:54:49 +08:00
Erin	9522cde464	fix: NVBug 5385576 py_batch_idx issue (#6153 ) Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>	2025-07-18 22:36:43 +08:00
Robin Kobus	ec2b953e7e	refactor: Enhanced handling of decoder requests and logits within the batch manager (#6055 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-18 12:12:08 +02:00
qixiang-99	2c90203c36	Refactor KVCacheManager: Simplify token availability calculation and … (#6134 ) Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>	2025-07-17 13:33:33 -07:00
Iman Tabrizian	10dbf4f0f4	[fix] Remove duplicated KVCache transmission check (#6022 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-07-17 12:02:19 -04:00
Ziyi Xiong	58d22a72f1	[TRTLLM-6352][feat] Migrate EAGLE3 and draft/target speculation to Drafter (#6007 ) Signed-off-by: ziyixiong-nv <fxiong@nvidia.com>	2025-07-17 21:15:01 +08:00
Enwei Zhu	21efb50068	[TRTLLM-6406] feat: Enable guided decoding with overlap scheduler (#6000 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-07-17 17:46:10 +08:00
Chuang Zhu	44c70c88f9	chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service (#5234 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-07-17 17:42:07 +08:00
Iman Tabrizian	d4d21a106e	[fix] Release slots with spec decode + disagg (#5975 ) (#6032 ) Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com> Signed-off-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com> Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-07-17 12:58:18 +08:00
qixiang-99	e09e409dfb	Fix: Enhance ModelConfig for kv cache size calculations (#5868 ) Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>	2025-07-16 14:41:31 -07:00
Mike Iovine	fa34cb7234	[refactor] Clean up drafter/resource manager creation logic (#5805 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-07-16 12:45:46 -07:00
shaharmor98	e0836f9ca9	[TRTLLM-5493] Add core infrastructure to enable loading of custom checkpoint formats (#5372 ) Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>	2025-07-17 00:50:30 +08:00
Fanrong Li	7a1af1c738	Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/5947 (#5989 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-07-16 01:33:12 +09:00
Jaedeok Kim	ab1c54709d	fix: adjust window sizes of VSWA at torch backend (#5880 ) Signed-off-by: Jaedeok Kim <jaedeokk@nvidia.com>	2025-07-15 17:41:54 +08:00
nv-guomingz	4e4d18826f	chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie… (#6003 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-07-15 15:50:03 +09:00
ixlmar	f225f5cd2e	[nvbugs-5318143] fix: restrict PyTorch memory usage to avoid OOMs (#5964 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-07-15 06:49:42 +08:00
Robin Kobus	5a61d64b5b	[nvbugs/5345391] fix: chunked prefill + overlap scheduling (#5761 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-14 17:17:30 +08:00
Iman Tabrizian	c8874a7f94	[nvbug/5337601][fix] Fix disagg + speculative decoding (#5558 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Co-authored-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-07-14 17:17:30 +08:00
WeiHaocheng	4d8920982a	fix: set allreduce strategy to model config (#5955 ) Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>	2025-07-14 17:59:11 +09:00
dominicshanshan	c9e7f831dc	Breaking change: perf: [TRTLLM-4662] Enable cuda graph by default (#5480 ) Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-07-14 16:42:23 +08:00
QI JUN	ce39409530	fix cancel request logic (#5800 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-07-14 10:23:20 +08:00
Thor Johnsen	041f1fa513	[TRTLLM-6264] Fix flaky test_e2e.py::test_openai_lora (#5885 ) Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>	2025-07-11 16:20:41 -07:00
wili	2e3cf42e03	[refactor] Simplification of Speculative decoding configs (#5639 ) Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>	2025-07-10 11:37:30 -04:00
Yan Chunwei	07f6da763d	[TRTLLM-5530] chore: rename LLM.autotuner_enabled to enable_autotuner (#5876 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-07-10 11:31:35 +08:00
Wanli Jiang	3f7cedec7c	Update transformers to 4.53.0 (#5747 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com> Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2025-07-09 09:32:24 -07:00
DylanChen-NV	74dca0aa7b	[NVBUG-5304516/5319741]Qwen2.5VL FP8 support (#5029 ) Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>	2025-07-09 23:16:42 +08:00
tomeras91	5aa958a11a	[TRTLLM-5838][fix] fix max batch size and max tokens in kv cache estimations for Nemotron-H (#5371 ) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>	2025-07-09 11:30:15 +03:00
Omer Ullman Argov	d6d2ab2c99	[fix] Catch inference failures in `trtllm-bench` (#5841 ) Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>	2025-07-09 03:53:03 +03:00
Kaiyu Xie	bb5b16fcb9	feat: Return context response immediately when stream_interval > 1 (#5836 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-07-09 00:19:57 +09:00
Raayan Dhar	e3268a4221	[TRTLLM-5847][feat] Support n-gram speculative decoding with disagg (#5732 ) Signed-off-by: raayandhar <rdhar@nvidia.com>	2025-07-08 09:39:58 -04:00
xiweny	eaf8bec88b	fix: Disaggregate serving with attention DP (#4993 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-07-08 16:15:03 +08:00
nv-guomingz	0be41b6524	Revert "chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie…" (#5818 )	2025-07-08 13:15:30 +09:00
Yechan Kim	5bc3a15f10	feat: add MultimodalParams & putting all multimodal params into it and refactor HyperCLOVAX & Qwen2/2.5-VL (#5522 ) Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>	2025-07-07 18:03:12 -07:00
nv-guomingz	5a8173c121	chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie… (#5795 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-07-08 08:52:36 +08:00

1 2 3 4 5 ...

293 Commits