TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Netanel Haber	d9a3530048	[nvbug/5393888][nvbug/5393042] Always use `py_seq_slot` (#6147 ) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2025-07-18 22:45:16 +03:00
Stefan Niebler	6d7874a467	[nvbugs/5369799] fix: Update disaggregation handling in sampler (#5762 ) Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>	2025-07-19 01:40:46 +08:00
Stefan Niebler	fd6ce7f20e	[ci] Speedup beam search unit tests with fixtures for LLM (#5843 ) Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>	2025-07-18 22:54:49 +08:00
Erin	9522cde464	fix: NVBug 5385576 py_batch_idx issue (#6153 ) Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>	2025-07-18 22:36:43 +08:00
Robin Kobus	ec2b953e7e	refactor: Enhanced handling of decoder requests and logits within the batch manager (#6055 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-18 12:12:08 +02:00
qixiang-99	2c90203c36	Refactor KVCacheManager: Simplify token availability calculation and … (#6134 ) Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>	2025-07-17 13:33:33 -07:00
Iman Tabrizian	10dbf4f0f4	[fix] Remove duplicated KVCache transmission check (#6022 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-07-17 12:02:19 -04:00
Ziyi Xiong	58d22a72f1	[TRTLLM-6352][feat] Migrate EAGLE3 and draft/target speculation to Drafter (#6007 ) Signed-off-by: ziyixiong-nv <fxiong@nvidia.com>	2025-07-17 21:15:01 +08:00
Enwei Zhu	21efb50068	[TRTLLM-6406] feat: Enable guided decoding with overlap scheduler (#6000 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-07-17 17:46:10 +08:00
Chuang Zhu	44c70c88f9	chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service (#5234 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-07-17 17:42:07 +08:00
Iman Tabrizian	d4d21a106e	[fix] Release slots with spec decode + disagg (#5975 ) (#6032 ) Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com> Signed-off-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com> Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-07-17 12:58:18 +08:00
qixiang-99	e09e409dfb	Fix: Enhance ModelConfig for kv cache size calculations (#5868 ) Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>	2025-07-16 14:41:31 -07:00
Mike Iovine	fa34cb7234	[refactor] Clean up drafter/resource manager creation logic (#5805 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-07-16 12:45:46 -07:00
shaharmor98	e0836f9ca9	[TRTLLM-5493] Add core infrastructure to enable loading of custom checkpoint formats (#5372 ) Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>	2025-07-17 00:50:30 +08:00
Fanrong Li	7a1af1c738	Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/5947 (#5989 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-07-16 01:33:12 +09:00
Jaedeok Kim	ab1c54709d	fix: adjust window sizes of VSWA at torch backend (#5880 ) Signed-off-by: Jaedeok Kim <jaedeokk@nvidia.com>	2025-07-15 17:41:54 +08:00
nv-guomingz	4e4d18826f	chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie… (#6003 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-07-15 15:50:03 +09:00
ixlmar	f225f5cd2e	[nvbugs-5318143] fix: restrict PyTorch memory usage to avoid OOMs (#5964 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-07-15 06:49:42 +08:00
Robin Kobus	5a61d64b5b	[nvbugs/5345391] fix: chunked prefill + overlap scheduling (#5761 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-14 17:17:30 +08:00
Iman Tabrizian	c8874a7f94	[nvbug/5337601][fix] Fix disagg + speculative decoding (#5558 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Co-authored-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-07-14 17:17:30 +08:00
WeiHaocheng	4d8920982a	fix: set allreduce strategy to model config (#5955 ) Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>	2025-07-14 17:59:11 +09:00
dominicshanshan	c9e7f831dc	Breaking change: perf: [TRTLLM-4662] Enable cuda graph by default (#5480 ) Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-07-14 16:42:23 +08:00
QI JUN	ce39409530	fix cancel request logic (#5800 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-07-14 10:23:20 +08:00
Thor Johnsen	041f1fa513	[TRTLLM-6264] Fix flaky test_e2e.py::test_openai_lora (#5885 ) Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>	2025-07-11 16:20:41 -07:00
wili	2e3cf42e03	[refactor] Simplification of Speculative decoding configs (#5639 ) Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>	2025-07-10 11:37:30 -04:00
Yan Chunwei	07f6da763d	[TRTLLM-5530] chore: rename LLM.autotuner_enabled to enable_autotuner (#5876 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-07-10 11:31:35 +08:00
Wanli Jiang	3f7cedec7c	Update transformers to 4.53.0 (#5747 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com> Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2025-07-09 09:32:24 -07:00
DylanChen-NV	74dca0aa7b	[NVBUG-5304516/5319741]Qwen2.5VL FP8 support (#5029 ) Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>	2025-07-09 23:16:42 +08:00
tomeras91	5aa958a11a	[TRTLLM-5838][fix] fix max batch size and max tokens in kv cache estimations for Nemotron-H (#5371 ) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>	2025-07-09 11:30:15 +03:00
Omer Ullman Argov	d6d2ab2c99	[fix] Catch inference failures in `trtllm-bench` (#5841 ) Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>	2025-07-09 03:53:03 +03:00
Kaiyu Xie	bb5b16fcb9	feat: Return context response immediately when stream_interval > 1 (#5836 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-07-09 00:19:57 +09:00
Raayan Dhar	e3268a4221	[TRTLLM-5847][feat] Support n-gram speculative decoding with disagg (#5732 ) Signed-off-by: raayandhar <rdhar@nvidia.com>	2025-07-08 09:39:58 -04:00
xiweny	eaf8bec88b	fix: Disaggregate serving with attention DP (#4993 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-07-08 16:15:03 +08:00
nv-guomingz	0be41b6524	Revert "chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie…" (#5818 )	2025-07-08 13:15:30 +09:00
Yechan Kim	5bc3a15f10	feat: add MultimodalParams & putting all multimodal params into it and refactor HyperCLOVAX & Qwen2/2.5-VL (#5522 ) Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>	2025-07-07 18:03:12 -07:00
nv-guomingz	5a8173c121	chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie… (#5795 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-07-08 08:52:36 +08:00
Robin Kobus	30a19fcf7c	[TRTLLM-6291] feat: Add user-provided speculative decoding support (#5204 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-07 16:30:43 +02:00
Daniel Cámpora	1260e2f33f	feat: Optimize TRTLLM Sampler perf single beam single step (#5550 ) Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-07-07 15:44:47 +02:00
Yan Chunwei	dfce61f4b9	[TRTLLM-5530][BREAKING CHANGE] refactor: LLM arglist rename mixed_sampler to enable_mixed_sampler (#5751 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-07-07 17:05:14 +08:00
Robin Kobus	ae27261094	refactor: decoding inputs (#5679 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-06 08:21:02 +02:00
Stefan Niebler	d1112aac37	[TRTLLM-3442] feat: added beam search support to the PyTorch Workflow (#5333 ) Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>	2025-07-05 01:35:13 +09:00
Shunkangz	a79d8c9f5e	Fix none response in PD (#5422 ) Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>	2025-07-04 14:25:10 +08:00
Netanel Haber	aa72d39b72	MTP and derivatives: Align sample state with trtllm sampler sample state (#5675 ) This PR moves MTPSampler and derivatives to use the universal seq_slot indexing for sampling. This is the last piece of the puzzle: After this, all of the samplers will use this format. See: `6ee94c7` Signed-off-by: Netanel Haber <nhaber@nvidia.com>	2025-07-03 19:55:48 +02:00
Rashid Kaleem	2b0c87e613	[ModelLoad] Concurrent load model (#5291 ) Signed-off-by: Rashid K <rkaleem@nvidia.com> Co-authored-by: Zhihan Jiang <68881590+nvzhihanj@users.noreply.github.com>	2025-07-03 22:18:04 +08:00
tomeras91	7dbecf7272	[TRTLLM-4923][feat] Enable CUDA graphs for Nemotron-H (#5646 ) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>	2025-07-03 11:07:51 +03:00
Jhao-Ting Chen	77082cde38	[https://nvbugspro.nvidia.com/bug/5329655 ] [feat] Pytorch path add spec dec param to attention op (#5146 ) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>	2025-07-02 04:54:43 -04:00
qixiang-99	ca7b6ec8d8	Feat/pytorch vswa kvcachemanager (#5151 ) Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>	2025-07-02 15:58:00 +08:00
Aurelien Chartier	fa95e402a5	feat: add LLmArgs option to force using dynamic quantization (#5346 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-07-01 12:16:09 -07:00
liji-nv	c345f5876c	[feat] Support torch compile for attention dp (#5086 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-07-01 13:48:52 -04:00
Wanli Jiang	3789ba1d37	feat: TRTLLM-5941 Upgrade xgrammar to 0.1.18 (#5364 ) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2025-07-01 20:12:55 +08:00

1 2 3 4 5 ...

279 Commits