TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-23 12:12:39 +08:00

Author	SHA1	Message	Date
Yuxian Qiu	04b112651b	[None][feat] Hang detection for executor loop and worker. (#10480 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2026-01-13 02:34:32 -05:00
Iman Tabrizian	48b09e5a25	[https://nvbugs/5689235 ][fix] Fix cancellation+chunked prefill+disagg (#10111 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2026-01-12 18:23:26 -05:00
Yuxian Qiu	80f261ea36	[https://nvbugs/5622938 ][feat] Run sample_async on extra stream. (#10215 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2026-01-09 18:15:18 +08:00
JadoTu	4c498bfe58	[TRTLLM-9676][fix] Fix mamba_cache_manager when enabling cuda_graph_padding and let test cover this case (#9873 ) Signed-off-by: JadoTu <107457950+JadoTu@users.noreply.github.com>	2026-01-09 14:50:16 +08:00
Mike Iovine	db2614ef10	[https://nvbugs/5772414 ][fix] Fix draft token tree depth=1 corner case (#10385 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2026-01-05 17:20:14 +01:00
Mike Iovine	bedfff4f00	[https://nvbugs/5772521 ][fix] Fix draft token tree chain crash (#10386 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2026-01-05 17:18:44 +01:00
HuiGao-NV	2f768b76f8	[https://nvbugs/5715568 ][fix] Force release torch memory when LLM is destroyed (#10314 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2026-01-05 15:30:18 +08:00
Faraz	8e2065b4d9	[https://nvbugs/5670469 ][fix] Filter 0s and choose min of kv_head for Nemotron model (#10206 ) Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>	2026-01-05 08:42:53 +08:00
Jaedeok Kim	a4dcc6a711	[TRTLLM-10171][fix] Correct attention handling in ModelConfig and KVCacheManager (#10330 ) Signed-off-by: Jaedeok Kim <jaedeokk@nvidia.com>	2026-01-04 06:07:30 -05:00
Izzy Putterman	bdf6953ddc	[None][feat] Eagle: MLA Based Eagle (#9677 ) Signed-off-by: Izzy Putterman <iputterman@nvidia.com>	2026-01-02 13:45:07 -05:00
Balaram Buddharaju	4a1b742aa0	[TRTLLM-9467][fix] Fix PP+CP combination with helix parallelism (#10312 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2026-01-01 13:42:53 -05:00
Simeng Liu	84d107b2f0	[https://nvbugs/5717993 ][fix] Add execution_stream across PyExecutor, KVCacheManager, PeftCacheManager to ensure proper CUDA stream synchronization between KV cache transfer operations and model forward kernels. (#10060 ) Signed-off-by: SimengLiu-nv <simengl@nvidia.com>	2025-12-31 09:22:54 -08:00
Jin Li	34c2fd50a9	[https://nvbugs/5707359 ][fix] Unwaive OOM case that should be fixed by #9446 (#10334 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-12-31 10:41:39 +08:00
Yuxian Qiu	1f3afb8e6f	[None][feat] Implement send_object for TorchDist. (#10213 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-12-31 10:40:52 +08:00
Ziyi Xiong	c59aa8bec5	[TRTLLM-9962][feat] Some optimizations for two-model spec dec (#10208 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-12-28 12:52:04 +08:00
Olya Kozlova	55f3cda66d	[None][fix] Fix request_id for best_of/n case (#8368 ) Signed-off-by: Olya Kozlova <okozlova@nvidia.com>	2025-12-26 22:20:24 +01:00
Jin Li	7e4cef9def	[None][fix] Cherry-pick conflict changes for PR 7999 PR 8515 (#9446 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-12-25 10:23:04 -05:00
Ziyi Xiong	d8b5aeb061	[https://nvbugs/5652062 ][fix] Rewind kv_cache and reset draft tokens (#10160 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-12-25 09:13:51 -05:00
Yukun He	595daa5089	[TRTLLM-9615][feat] Support synchronization through PP ranks in the distributed tuning system (#10011 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-12-24 15:03:10 +08:00
Yuxian Qiu	696f754ef4	[None][fix] avoid implicit cudaStreamSynchronize in sample_async. (#10120 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-12-23 10:15:40 +08:00
Tailing Yuan	648196f8ae	[TRTLLM-9432][feat] Reduce synchronization and recompilation for qwen3-next (#9691 ) Signed-off-by: Tailing Yuan <yuantailing@gmail.com>	2025-12-23 10:14:29 +08:00
William Zhang	a6a88985cf	[TRTLLM-9409][feat] Pass MRoPE tensors for EPD disagg (#9758 ) * Why? Certain VLMs like the Qwen family need more than just the multimodal embeddings in the language model, and need MRoPE position IDs and deltas. Prior to this commit, only the embeddings could be communicated from the encoder worker to the prefill worker. * What? This commit extends the `DisaggregatedParams` to include the MRoPE information. It also adjusts several pieces of code required to communicate that between E, P and D workers. Closes TRTLLM-9409. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>	2025-12-22 06:32:49 -05:00
Balaram Buddharaju	5266475014	[None][feat] Cudagraph updates for helix parallelism (#10141 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-12-21 15:21:52 -05:00
Yuxian Qiu	e75331480f	[None][fix] fix draft_lengths for CUDA graph capture. (#10004 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-12-20 09:04:48 +08:00
Ziyi Xiong	70b4d282c6	[TRTLLM-7736][feat] Incrementally update the inputs of target and draft models (#9708 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-12-19 15:11:25 +08:00
Wangjue Yao	9f283f330b	[None][feat] Support Mooncake transfer engine as a cache transceiver backend (#8309 ) Signed-off-by: wjueyao <wyao123@terpmail.umd.edu> Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>	2025-12-19 10:09:51 +08:00
Yuan Tong	f7e245668b	[TRTLLM-9680][perf] Optimize TRTLLMSampler log_probs performance (Core fix has been merged via #9353 ) (#9655 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-12-17 17:56:01 +08:00
Aurelien Chartier	7175d89b48	[None][fix] Fix iteration stats for spec-dec (#9855 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-12-16 14:11:38 -08:00
ruodil	07f307d131	[https://nvbugs/5652552 ][fix] cherry-pick add printing for llm args (#9206 ) Signed-off-by: Ruodi Lu <ruodil@users.noreply.github.com> Co-authored-by: Ruodi Lu <ruodil@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com>	2025-12-16 13:33:20 -05:00
Yechan Kim	8ba8699f66	[TRTLLM-8310][feat] Add Qwen3-VL-MoE (#9689 ) Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>	2025-12-15 20:05:20 -08:00
Michal Guzek	e6187d8109	[https://nvbugs/5708810 ][fix] Fix TRTLLMSampler (#9710 ) Signed-off-by: Michal Guzek <mguzek@nvidia.com>	2025-12-15 23:26:52 +01:00
Yukun He	9e7182b603	[TRTLLM-9615][feat] Implement a distributed tuning system (#9621 ) Four distinct strategies are implemented to accommodate different distributed tuning scenarios, including BROADCAST, INDEPENDENT, MERGE, PARALLEL. * Distributed tuning is disabled by default, with the INDEPENDENT strategy as the fallback. This conservative approach prevents unexpected behavior in standard use cases. * Only operations with significant tuning time overhead have been assigned the PARALLEL strategy, which allows the same tensor parallelism (TP) rank to tune tactics concurrently across different ranks. This targeted approach balances performance gains with stability. * Operations with nested tuning structures, such as NVFP4GemmUnifiedRunner, currently support only the INDEPENDENT strategy. This restriction exists because the synchronization mechanism is optimized only for leaf operations and doesn't yet handle nested hierarchies. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-12-15 21:08:53 +08:00
Ziyi Xiong	f2aee0db03	[TRTLLM-9854][feat] Optimize the host overhead of _sample_async (#9935 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-12-15 13:28:54 +08:00
Fanrong Li	8f144d9282	[TRTLLM-9416][feat] Skip DS-v3.2 indexer MQA and Top-K for short sequences. (#9524 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-12-15 12:42:25 +08:00
Mike Iovine	383b13e0e5	[None][feat] Implement sampling on 1-model EAGLE3 (#9885 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com>	2025-12-13 07:38:22 -08:00
jellysnack	079ef8ae77	[None][feat] Graceful Error Handling for Guided Decoder (#9078 ) Signed-off-by: jellysnack <oleg.jellysnack@gmail.com> Signed-off-by: jellysnack <158609015+jellysnack@users.noreply.github.com> Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-12-13 19:57:59 +08:00
Balaram Buddharaju	6a6e41f802	[TRTLLM-9468][chore] Update disagg benchmarking scripts to support context parallelism (#9720 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-12-12 22:29:41 -08:00
bhsueh_NV	e49c70f6df	[None][feat] Support Mistral Large3 LLM part (#9820 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-12-13 11:44:27 +08:00
Balaram Buddharaju	af315d8ef1	[TRTLLM-5972][chore] Load balance decode token KV cache with helix parallelism (#9757 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-12-12 22:29:05 +08:00
jthomson04	4f6d4da035	[None][perf] Fix TPOT when `min_tokens` set (#9862 ) Signed-off-by: jthomson04 <jwillthomson19@gmail.com>	2025-12-11 13:55:31 -08:00
Erin	89dabf5aa1	[TRTLLM-9736][feat] AsyncLLM and verl integ (#9353 ) Signed-off-by: Liwei Ma <liweim@nvidia.com> Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Co-authored-by: Liwei Ma <liweim@nvidia.com> Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Co-authored-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-12-11 09:33:25 -08:00
dhansen-nvidia	2d33ae94d5	[https://nvbugs/5508301 ][feat] Move D->H copies to a worker thread whe… (#8463 ) Signed-off-by: Dan Hansen <1+dhansen-nvidia@users.noreply.github.com> Signed-off-by: dhansen-nvidia <218031328+dhansen-nvidia@users.noreply.github.com> Co-authored-by: Dan Hansen <1+dhansen-nvidia@users.noreply.github.com>	2025-12-09 18:51:31 -05:00
Mike Iovine	07c76a5fac	[None][feat] Make 2-model spec dec use the 1-model kernels (Hopper) (#8810 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-12-09 11:06:31 -05:00
Stefan Niebler	d600b9f851	[TRTLLM-6756][feat] Update BeamSearch for TorchSampler (#9660 ) Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>	2025-12-09 10:44:01 +01:00
Robin Kobus	76f49c903b	[None][fix] Additional model outputs for pipeline parallelism (#9794 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-12-09 10:41:22 +01:00
Jiagan Cheng	4a3a66b124	[https://nvbugs/5677746 ][fix] Use first PP rank's schedule result in other PP ranks to fix PP hang (#9659 ) Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>	2025-12-08 18:43:52 -08:00
Thor Johnsen	f9380581c5	[https://nvbugs/5508267 ][fix] Proper handling of inactive canceled requests (#9280 ) Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>	2025-12-08 13:11:44 -08:00
Ludwig Schneider	41ce14ab04	[None][feat] Enable NCCL_SYMMETRIC as default fallback for AllReduce (#9314 ) Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>	2025-12-07 09:43:26 -08:00
Jonas Li	2645a78f34	[TRTLLM-9660][feat] Convert cuteDSL GEMM to opt-in feature (#9682 ) Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-12-06 02:24:51 -08:00
Robin Kobus	eb0b426e5d	[None][refactor] Improve request processing function in sampler (#9671 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-12-05 16:41:49 +01:00

1 2 3 4 5 ...

603 Commits