TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
brb-nv	2bd09ed2d4	fix: Skip rope scaling for local layers in Gemma3 VLM (#5857 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-07-09 10:10:33 +08:00
Omer Ullman Argov	d6d2ab2c99	[fix] Catch inference failures in `trtllm-bench` (#5841 ) Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>	2025-07-09 03:53:03 +03:00
Iman Tabrizian	c508b994b6	Fix lost requests for disaggregated serving (#5815 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-07-09 08:42:45 +09:00
Kaiyu Xie	bb5b16fcb9	feat: Return context response immediately when stream_interval > 1 (#5836 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-07-09 00:19:57 +09:00
Raayan Dhar	e3268a4221	[TRTLLM-5847][feat] Support n-gram speculative decoding with disagg (#5732 ) Signed-off-by: raayandhar <rdhar@nvidia.com>	2025-07-08 09:39:58 -04:00
Yukun He	e104f8bbb5	[5305318] fix: Fix the accuracy issue when reduce_fusion is enabled for GEMMA model. (#5801 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-07-08 19:51:05 +08:00
Yegor	b01d1c28f7	[feat] Detokenize option in /v1/completions request (#5382 ) Signed-off-by: Yegor <75512761+Wokzy@users.noreply.github.com> Signed-off-by: Yegor Yershov <yegor6741@gmail.com>	2025-07-08 19:36:04 +08:00
xiweny	eaf8bec88b	fix: Disaggregate serving with attention DP (#4993 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-07-08 16:15:03 +08:00
Yiqing Yan	5203a0f6df	chore: bump version to 1.0.0rc3 (#5819 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-07-08 16:04:40 +09:00
Zhenhuan Chen	dee6644ed9	feat(scaffolding): add streaming scaffolding_llm.generate_async support (#5345 ) Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>	2025-07-08 15:08:40 +09:00
nv-guomingz	0be41b6524	Revert "chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie…" (#5818 )	2025-07-08 13:15:30 +09:00
Yechan Kim	5bc3a15f10	feat: add MultimodalParams & putting all multimodal params into it and refactor HyperCLOVAX & Qwen2/2.5-VL (#5522 ) Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>	2025-07-07 18:03:12 -07:00
nv-guomingz	5a8173c121	chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie… (#5795 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-07-08 08:52:36 +08:00
Robin Kobus	30a19fcf7c	[TRTLLM-6291] feat: Add user-provided speculative decoding support (#5204 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-07 16:30:43 +02:00
Tailing Yuan	85b4a6808d	Refactor: move DeepEP from Docker images to wheel building (#5534 ) Signed-off-by: Tailing Yuan <yuantailing@gmail.com>	2025-07-07 22:57:03 +09:00
Daniel Cámpora	1260e2f33f	feat: Optimize TRTLLM Sampler perf single beam single step (#5550 ) Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-07-07 15:44:47 +02:00
DylanChen-NV	5ca2b9bb15	[TRTLLM-5812][feat] support FP8 row-wise dense GEMM in torch flow (#5615 ) Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>	2025-07-07 18:04:57 +08:00
Yan Chunwei	dfce61f4b9	[TRTLLM-5530][BREAKING CHANGE] refactor: LLM arglist rename mixed_sampler to enable_mixed_sampler (#5751 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-07-07 17:05:14 +08:00
Zheng Duan	de10774c2e	chore: log stack trace on error in openai server (#5749 ) Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>	2025-07-07 14:54:36 +08:00
Daniel Stokes	ec6c7dff1a	feat: Add support for MXFP8xMXFP4 in pytorch (#5535 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-07-06 15:32:06 -07:00
Robin Kobus	ae27261094	refactor: decoding inputs (#5679 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-06 08:21:02 +02:00
Xianjie Qiao	b1976c2add	Add wide-ep benchmarking scripts (#5760 ) Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com> Signed-off-by: Xianjie Qiao <5410381+qiaoxj07@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-07-05 19:29:39 +08:00
Xianjie Qiao	089fd55eda	Add dummy all_reduce for kernel breakdown (#5745 ) Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>	2025-07-05 13:08:58 +09:00
Frank	d61893dc77	[fix] Update to properly set cuda graphs in trtllm-bench overrides. (#5634 ) Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>	2025-07-05 05:19:16 +09:00
Stefan Niebler	d1112aac37	[TRTLLM-3442] feat: added beam search support to the PyTorch Workflow (#5333 ) Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>	2025-07-05 01:35:13 +09:00
HuiGao-NV	3ed3bbcb5d	Fix: pass allreduce strategy to pytorchConfig (#5746 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-07-04 21:32:13 +09:00
Shunkangz	32339d1b20	Raise shut down error for each request (#4936 ) Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>	2025-07-04 18:58:24 +09:00
Tailing Yuan	e134a52e07	Perf: reduce DeepEPLowLatency memory and time (#5712 ) Signed-off-by: Tailing Yuan <yuantailing@gmail.com>	2025-07-04 14:46:28 +08:00
Shunkangz	a79d8c9f5e	Fix none response in PD (#5422 ) Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>	2025-07-04 14:25:10 +08:00
brb-nv	cdaa6abce7	fix: Investigate Gemma3 1B decoder output discrepancy (#5564 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-07-04 13:14:13 +08:00
Frank	819ae903de	[https://nvbugspro.nvidia.com/bug/5351333 ][fix] Update to chunking calculation. (#5625 ) Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>	2025-07-04 13:14:13 +08:00
Clay	7a319524da	feat: support more parameters in openai worker of scaffolding (#5115 ) Signed-off-by: Clay <ccs96307@gmail.com>	2025-07-04 09:35:34 +08:00
Lucas Liebenwein	24ac9b5f69	[AutoDeploy] merge feat/ad-2025-06-29 (#5737 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> Co-authored-by: Neta Zmora <nzmora@nvidia.com> Co-authored-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>	2025-07-04 10:21:18 +09:00
Netanel Haber	aa72d39b72	MTP and derivatives: Align sample state with trtllm sampler sample state (#5675 ) This PR moves MTPSampler and derivatives to use the universal seq_slot indexing for sampling. This is the last piece of the puzzle: After this, all of the samplers will use this format. See: `6ee94c7` Signed-off-by: Netanel Haber <nhaber@nvidia.com>	2025-07-03 19:55:48 +02:00
Zhenhuan Chen	528ff52ef4	[https://nvbugs/5365714 ] fix(scaffolding): use default LLM rather than trt backend LLM (#5705 ) Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>	2025-07-03 23:54:20 +09:00
Rashid Kaleem	2b0c87e613	[ModelLoad] Concurrent load model (#5291 ) Signed-off-by: Rashid K <rkaleem@nvidia.com> Co-authored-by: Zhihan Jiang <68881590+nvzhihanj@users.noreply.github.com>	2025-07-03 22:18:04 +08:00
nv-guomingz	8dad22cbe7	chore: refine the default value by using pydantic default instead of … (#5695 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-07-03 22:41:29 +09:00
tomeras91	7dbecf7272	[TRTLLM-4923][feat] Enable CUDA graphs for Nemotron-H (#5646 ) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>	2025-07-03 11:07:51 +03:00
Yiqing Yan	3c9dd5cd66	chore: bump version to 1.0.0rc2 (#5645 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-07-03 12:35:28 +08:00
Enwei Zhu	3a46cf275b	fix: Fix missing arg to alltoall_prepare_maybe_dispatch (#5669 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-07-02 21:41:55 -04:00
Jhao-Ting Chen	77082cde38	[https://nvbugspro.nvidia.com/bug/5329655 ] [feat] Pytorch path add spec dec param to attention op (#5146 ) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>	2025-07-02 04:54:43 -04:00
qixiang-99	ca7b6ec8d8	Feat/pytorch vswa kvcachemanager (#5151 ) Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>	2025-07-02 15:58:00 +08:00
Yan Chunwei	2d69b55fe8	chore: enhance yaml loading arbitrary options in LlmArgs (#5610 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-07-02 14:21:37 +08:00
HuiGao-NV	10c50515c2	fix: Add back allreduce_strategy parameter into TorchLlmArgs (#5637 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-07-02 09:49:20 +08:00
Perkz Zheng	ba2ab5098b	[Bug] attention DP doesn't work with embedding TP (#5642 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-07-02 08:57:46 +08:00
Aurelien Chartier	efef911f5e	fix: add missing self. from PR #5346 (#5653 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-07-01 20:38:55 -04:00
Aurelien Chartier	fa95e402a5	feat: add LLmArgs option to force using dynamic quantization (#5346 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-07-01 12:16:09 -07:00
liji-nv	c345f5876c	[feat] Support torch compile for attention dp (#5086 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-07-01 13:48:52 -04:00
Kaiyu Xie	f9a455651b	perf: Use tokenizers API to optimize incremental detokenization perf (#5574 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-07-01 09:35:25 -04:00
Anurag Mukkara	93edfea2b8	[nvbug/5354825] Fix nougat test image url (#5496 ) Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>	2025-07-01 20:12:55 +08:00
Wanli Jiang	3789ba1d37	feat: TRTLLM-5941 Upgrade xgrammar to 0.1.18 (#5364 ) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2025-07-01 20:12:55 +08:00
brb-nv	4ef60d5fbb	nvbugs-5331031; nvbugs-5344203 - address intermittent issues with Mistral Small multimodal for BS=8 (#5453 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-07-01 20:12:55 +08:00
danielafrimi	7a617ad1fe	feat: W4A16 GEMM (#4232 ) Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>	2025-07-01 10:36:05 +03:00
Netanel Haber	6ee94c7ac8	Reintroduce with perf fixes: feature: unify new_tokens format sample state to trtllm samper tokens format (#5513 ) `58a8a8f` - these changes were previously merged to main here. `6aef149` - the changes were temporarily reverted in main, due to a significant perf regression in models using the TorchSampler (observed by @byshiue). This PR is meant to re-merge these changes along with a fix to prevent the regression. The first commit of this PR is actually just the reverted revert - filter it out of the changes to see previously unmerged changes. Signed-off-by: Netanel Haber <nhaber@nvidia.com>	2025-06-30 11:58:59 -07:00
Wei-Ming Chen	f28cd3056e	feat: AutoDeploy fp8 quantization support for bmm (#3849 ) Signed-off-by: Wei-Ming Chen <17592131+meenchen@users.noreply.github.com>	2025-06-30 12:36:34 -04:00
nv-guomingz	6e48ac25a6	chore: remove cuda_graph_ prefix from cuda_graph_config filed members. (#5585 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-06-30 12:23:14 -04:00
Li Min	16fc99391f	refactor: [TRTLLM-6150] Refactor moe permute and finalize op by removing duplicated code (#5557 ) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>	2025-06-30 08:48:04 -07:00
Yan Chunwei	98a7c24062	chore [TRTLLM-6009]: remove ptuning knobs from TorchLlmArgs (#5595 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-06-30 20:40:23 +08:00
Robin Kobus	9bdc5951f8	refactor: decoder state setup (#5093 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-30 11:09:43 +02:00
Fanrong Li	6cbc9a5297	[nvbug/5354946][fix] Fix mtp vanilla draft inputs (#5568 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-30 15:59:12 +08:00
WeiHaocheng	42a9385d02	[TRTLLM-5331] perf: Replace allgaher with AllToAllPrepare (#5570 ) Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>	2025-06-30 13:06:09 +08:00
dongjiyingdjy	852b79053d	feat : support duplicate_kv_weight for qwen3 blockwise scale (#5459 ) Signed-off-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>	2025-06-30 11:49:22 +08:00
nv-guomingz	578430e64c	[TRTLLM-5530][BREAKING CHANGE]: enhance the llm args pytorch config part 1(cuda_graph_config) (#5014 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-06-30 11:05:40 +08:00
Bo Li	6000380a0c	perf: Avoid reswizzle_sf after allgather. (#5504 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-06-29 21:25:50 +08:00
Talor Abramovich	70e34a3291	[TRTLLM-5831][feat] Add LoRA support for pytorch backend in trtllm-serve (#5376 ) Signed-off-by: Talor Abramovich <talora@nvidia.com>	2025-06-29 12:46:30 +00:00
amirkl94	de9779900c	feat: Add support for YARN in NemotronNAS models (#4906 ) Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>	2025-06-29 09:45:49 +03:00
Lucas Liebenwein	619709fc33	[AutoDeploy] merge feat/ad-2025-06-13 (#5556 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2025-06-29 03:52:14 +08:00
Li Min	6021a439ab	Make moe permute and final as custom op (#5412 ) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>	2025-06-27 15:48:33 -07:00
Darragh Hanley	5437075def	ReDrafter support for Qwen (#4875 ) Signed-off-by: darraghdog <darragh.hanley@gmail.com> Signed-off-by: Darragh Hanley <darragh.hanley@gmail.com> Co-authored-by: rakib-hasan <rhasan@nvidia.com>	2025-06-28 02:33:10 +08:00
Robin Kobus	a8141a4513	refactor: Speculative decoding buffers part 2 (#5316 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-27 17:41:48 +02:00
Aurelien Chartier	833c0dea4a	[TRTLLM-6104] feat: add request_perf_metrics to LLMAPI (#5497 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-06-27 17:03:05 +02:00
wili	56cdfe5c6c	[TRTLLM-5000][feat] NGrams V2 (#4569 ) Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>	2025-06-27 23:00:17 +08:00
peaceh-nv	cb58073ab7	Fix : fix build for sm120 (#5265 ) Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>	2025-06-27 20:42:47 +08:00
Daniel Cámpora	73b8a95049	feat: Use inference mode in update_requests to improve perf of TRTLLM Sampler (#5538 )	2025-06-27 18:40:53 +08:00
Daniel Stokes	83a1f60556	feat: Expose bias and FP8_MXFP4 MOE CUTLASS backend features to pytorch (#5410 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-06-27 12:29:34 +08:00
Yuxian Qiu	dc36228f52	fix: Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-06-27 11:03:38 +08:00
Yibin Li	0f3bd7800e	[TRTLLM-4971]: Use safe deserialization in ParallelConfig (#4630 ) Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>	2025-06-27 09:58:41 +08:00
Frank	aa6e015ef8	Update trtllm-bench to support new Pytorch default. (#5491 ) Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>	2025-06-26 17:05:43 -07:00
Venky	0083228d2a	fix: Mapping rank boundary check bug (#4935 ) Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>	2025-06-27 07:27:59 +08:00
jmydurant	8836990bde	[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) (#5475 ) Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-06-26 22:18:08 +08:00
Robin Kobus	8dfa31c71d	refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead (#5384 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-26 19:45:52 +08:00
Rashid Kaleem	3a1f4d4001	[feat] Add progress bar to benchmark (#5173 ) Signed-off-by: Rashid Kaleem <rkaleem@nvidia.com> Co-authored-by: Zhihan Jiang <68881590+nvzhihanj@users.noreply.github.com> Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com>	2025-06-26 18:39:45 +08:00
Kaiyu Xie	2eb6502b1d	feat: Add support for TRTLLM CustomDataset (#5511 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-06-26 18:27:37 +08:00
Bo Li	1bab9000a6	perf: Optimize swizzle_sf, unswizzle_sf, reswizzle_sf (#5318 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-06-26 14:03:56 +08:00
dongxuy04	490d2e5819	feat: large-scale EP(part 8: Online EP load balancer integration for PCIe fp8) (#5226 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-25 22:25:13 -07:00
amitz-nv	e0bb123ae7	[TRTLLM-5921][feat] Prevent serialization of entire LoRA adapters in each request (#5080 ) Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>	2025-06-26 08:15:06 +03:00
Yukun He	9ee33605bb	[TRTLLM-6019] feat: Remove cutlass min latency code from AutoTuner. (#5394 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-06-26 13:12:03 +08:00
Netanel Haber	6aef14943c	Revert "feature: unify new_tokens format sample state to trtllm samper new_tokens format (#4401 )" (#5474 ) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2025-06-25 20:56:04 -07:00
jmydurant	578dbc8d9a	feat: chunked prefill for MLA (Blackwell) (#4651 ) Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-06-26 09:01:00 +08:00
Yukun He	3fc57543e2	[5356427] fix: Remove the seq_len of 4096 from FP8 block scale MoE tuning configs. (#5485 ) The seq_len of 4096 will cause some unknown CUDA illegal memory access issue if run with some other tests consecutively. Put a saturated upper bound for any sequence length larger than it.	2025-06-26 08:38:35 +08:00
Xianjie Qiao	1e4fa13d33	Add sleep function for disagg gen-only benchmarking (#5398 ) Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>	2025-06-26 07:32:16 +08:00
QI JUN	3a2c4ca77b	chore: split _build_model method for TorchLlm and TrtLlm (#5418 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com> Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-06-26 04:32:46 +08:00
Mike Iovine	5bc8c894f7	[chore] Disable block reuse when draft model speculation is being used (#5448 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-06-26 03:51:20 +08:00
Daniel Cámpora	205c97a4ae	[TRTLLM-5974][feat] Support disaggregated serving in TRTLLM Sampler (#5328 ) Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-06-25 17:41:36 +02:00
Kaiyu Xie	c5ae3272b9	feat: Make benchmark_serving part of the library (#5428 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-06-25 23:13:56 +08:00
HuiGao-NV	b3a4c1f404	feat: Remove not used padding_idx in models (#5385 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-06-25 17:19:59 +08:00
Yiqing Yan	f3cfe86dd1	chore: bump version to 1.0.0rc1 (#5460 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-06-25 16:21:34 +08:00
Enwei Zhu	fc7a81ceb0	test: Add LLGuidance test and refine guided decoding (#5348 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-25 14:12:56 +08:00
Enwei Zhu	76da7fed86	fix (NvBug 5354925): Fix static EPLB (#5411 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-25 13:14:40 +08:00
Shunkangz	d5354897c0	feat: Dynamically remove servers in PD (#5270 ) Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>	2025-06-25 09:50:04 +08:00
Lucas Liebenwein	5cffb7e0ec	[AutoDeploy] Merge feat/ad_2025_06_13 feature branch (#5454 ) Signed-off-by: Grzegorz Kwasniewski <213329731+greg-kwasniewski1@users.noreply.github.com> Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com> Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> Co-authored-by: Grzegorz Kwasniewski <213329731+greg-kwasniewski1@users.noreply.github.com> Co-authored-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com> Co-authored-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>	2025-06-25 09:30:13 +08:00
bhsueh_NV	73ba4fc320	fix: fix bug of qwen3 + eagle3 + finalize_moe_fusion (#5369 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-06-25 09:20:23 +08:00
dongxuy04	699520082b	Add MTP support for Online EPLB (#5213 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-25 07:58:13 +08:00
QI JUN	d93a5e04b5	Chore: remove unused variables (#5314 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-24 22:27:32 +08:00
HuiGao-NV	35a92f6bab	Add debug hook to support dump tensor data and add new debug functions easily (#5182 ) Signed-off-by: Hui Gao	2025-06-24 17:45:28 +08:00
Luis Vega	d26040e5d9	chore: delete mamba hybrid, since it is now called NemotronH (#5409 ) Signed-off-by: Luis Vega <vegaluisjose@users.noreply.github.com>	2025-06-24 16:27:31 +08:00
Robin Kobus	e2a8cbc80b	refactor: manage cache indirection in decoder state (#5315 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-24 09:15:59 +02:00
HuiGao-NV	e16c1bef6e	[fix] Add 1 and draft_token_num to seq_len when overlap scheduling is enabled during memory estimation (#5343 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-06-24 11:43:43 +08:00
Netanel Haber	58a8a8fd37	feature: unify new_tokens format sample state to trtllm sampler new_tokens format (#4401 ) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2025-06-23 10:38:37 -07:00
dongxuy04	4f0f17ac8a	feat: Misc Opt for large scale EP (#5374 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-20 13:11:31 +08:00
Fanrong Li	5d4ab47d5b	fix: refactor and fix mtp vanilla (#4762 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-20 05:23:39 +08:00
Yan Chunwei	9bd42ecf9b	[TRTLLM-5208][BREAKING CHANGE] chore: make pytorch LLM the default (#5312 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-06-20 03:01:10 +08:00
Kaiyu Xie	7246fd75d1	feat: Support stream_interval (#5284 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-06-19 21:57:10 +08:00
Fanrong Li	c7af650d5a	Fix: fix the deterministic issue in the MTP Eagle path (#5285 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-19 18:08:40 +08:00
Frank	68687a9f56	[WAR][nvbug/5321947] Add an async sleep to unblock event loop. (#5342 ) Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>	2025-06-19 17:25:18 +08:00
hlu1	b558232ce1	Refactor CutlassFusedMoE (#5344 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>	2025-06-19 00:04:07 -07:00
amitz-nv	1753202b61	[TRTLLM-5825][fix] Fix torch LoRA TP (#5338 ) Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>	2025-06-19 09:12:00 +03:00
Yiqing Yan	dedce8ab0e	chore: bump version to 1.0.0rc0 (#5326 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-06-19 12:02:28 +08:00
nv-guomingz	6a388b105a	chore: remove torch_compile prefix for TorchCompileConfig field members (#5261 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-06-19 09:21:51 +08:00
Zongfei Jing	2b23cd56ce	[feat] Fusion finalize and allreduce for qwenmoe model (#5223 ) Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> Co-authored-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com>	2025-06-19 08:03:58 +08:00
Yan Chunwei	3946e798db	fix[nvbug5298640]: trtllm-llmapi-launch multiple LLM instances (#4727 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-06-19 06:13:53 +08:00
jellysnack	0623ffe3bc	feat: Add LLGuidance Support for PyTorch Backend (#5214 ) Signed-off-by: jellysnack <oleg.jellysnack@gmail.com> Signed-off-by: jellysnack <158609015+jellysnack@users.noreply.github.com> Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-18 19:33:34 +08:00
Zhanrui Sun	516bd4dc05	chore: bump version to 0.21.0rc3 (#5309 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-06-18 15:59:53 +08:00
Robin Kobus	38547b92f3	refactor: Introduce ResourceManagerType enum for resource management (#5246 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-18 09:55:59 +02:00
Yukun He	6711ad9cf3	[TRTLLM-5589] feat: Minor optimizations for tunable FP8 batched GEMM op. (#5139 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-06-18 14:33:46 +08:00
Yan Chunwei	724e495254	chore: partition LLM class into TorchLLM and TrtLLM (#4900 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-06-18 14:01:25 +08:00
Yi Zhang	e44f7687af	feat: Add no_kv_cache_reuse option and streaming support for trtllm serve bench (#4971 ) Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>	2025-06-18 13:37:31 +08:00
QI JUN	855036d8ee	update LlmRequest.is_dummy property (#5283 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com> Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-18 10:52:13 +08:00
Robin Kobus	627062c265	refactor: Update decoder buffer and logits management (#4450 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-18 08:10:32 +08:00
Mike Iovine	9bf69c9fdb	[chore] Remove BaseDraftTokenManager (#5251 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-06-17 11:57:52 -04:00
QI JUN	f899c4d294	Re-implement LlmResponse in Python to reduce host overhead of pybind (#5224 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-17 21:28:09 +08:00
Dom Brown	44fb3c1673	[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner (#5207 ) - Adds a new Python custom op (fp8_block_scale_moe_runner) and a FP8BlockScaleMoERunner class for autotuning. - Updates C++ MoE and batched GEMM kernels to accept a configIndex for workspace sizing and execution. - Extends the unit test to run both autotuned and non-autotuned code paths. Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-06-17 21:01:56 +08:00
amirkl94	8451a87742	chore: Mass integration of release/0.20 (#5082 ) Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com> Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Co-authored-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com> Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Co-authored-by: Erin <14718778+hchings@users.noreply.github.com> Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com> Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>	2025-06-17 14:32:02 +03:00
liji-nv	13eef642e6	[feat] Piecewise cuda graph support for MLA (#4467 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-06-17 18:58:38 +08:00
Yilin Fan	498fadceb4	[feat] Add EAGLE3 support for Qwen3 (#5206 ) Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>	2025-06-17 17:07:06 +08:00
Enwei Zhu	4b82b8b4c7	[TRTLLM-5330] perf: Optimize MoE supplementary kernels for large-scale EP (#5215 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-17 15:23:24 +08:00
Izzy Putterman	e607768e45	Speculation: Draft Target in new FW (#4558 ) Signed-off-by: Izzy Putterman <iputterman@nvidia.com>	2025-06-17 02:26:08 +08:00
tomeras91	cea5dd1e38	[TRTLLM-5835][feat] Optimized Mamba2Mixer prefill (#5128 ) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>	2025-06-16 16:29:17 +03:00
Yilin Fan	dd29063538	[feat] Add llm args to tune python gc threshold (#5141 ) Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>	2025-06-16 17:45:22 +08:00
Robin Kobus	b6ca677741	refactor: remove decoder request from decoder interface (#5129 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-16 09:12:30 +02:00
Robin Kobus	dda64166cd	refactor: Scheduling based on KV cache state (#4865 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-16 08:14:58 +02:00
Tracin	ef3fdc8051	feat: Add w4a8_mxfp4_fp8 quantization recipe. (#4867 ) Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>	2025-06-16 11:30:57 +08:00
Enwei Zhu	babdd9ce06	test: Add json_mode_eval for guided decoding evaluation (#5179 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-16 10:03:55 +08:00
Yilin Fan	7a5e0fd300	[fix] Fix Llama4 min-latency import error (#5209 ) Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>	2025-06-16 10:03:07 +08:00
Yan Chunwei	c84e41fd9d	fix: build_config in TorchLlmArgs and avoid arbitrary args (#4972 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-06-15 17:51:56 -07:00
amitz-nv	109c426077	Enable trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA in PyT flow (#5130 )	2025-06-15 18:54:04 +03:00
Fanrong Li	39bba63758	[TRTLLM-4983] feat: enable overlap scheduler between draft forwards (#4802 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-15 23:09:16 +08:00
Fanrong Li	159ffc584e	fix: fix cuda graph max batch size for spec decoding cases. (#5076 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-15 14:57:28 +08:00
Kaiyu Xie	dce1dcc4f9	feat: Support post_proc for bench (#5122 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-06-15 13:02:38 +08:00
Enwei Zhu	63bc62ddf4	feat: Enable EPLB to existing MoE models (#5203 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-15 11:48:06 +08:00
Yuan Tong	6bce7337a9	perf: avoid dynamic import overhead in is_llm_response with duck typing (#5110 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-06-15 07:45:02 +08:00
ixlmar	e055af1bc9	chore: improve disagg test failure detection (#4738 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-06-15 01:28:26 +08:00
Aurelien Chartier	1389f5a4d3	feat: Add support for fp8 rowwise quantization (#4876 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com> Co-authored-by: aikitoria <151776613+aikitoria@users.noreply.github.com>	2025-06-14 06:37:48 -07:00
2ez4bz	dc52b67492	linting(python): Enable ruff on more files (wave 1/N) (#5140 ) Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>	2025-06-14 19:19:34 +08:00
Tailing Yuan	0b60da2c45	feat: large-scale EP(part 7: DeepEP integration) (#4792 ) Signed-off-by: Tailing Yuan <yuantailing@gmail.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-06-14 19:12:38 +08:00
yunruis	b99c5ce8c1	Feat/ds r1 min latency opt round3, add router gemm, fused a gemm, PDL (#4560 ) Signed-off-by: yunruis <yunruis@nvidia.com> Signed-off-by: kduan <176893526+Kefeng-Duan@users.noreply.github.com> Signed-off-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com> Co-authored-by: kduan <176893526+Kefeng-Duan@users.noreply.github.com>	2025-06-14 17:36:22 +08:00
nv-guomingz	3b7b5a5ad5	refactor [BREAKING CHANGE]: enhance the llm args pytorch config part 3(torch_compile_config) (#5032 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-06-14 14:23:13 +08:00
Yilin Fan	06342ffb4d	[feat] Implement model-agnostic one-engine eagle3 (#4778 ) Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>	2025-06-13 08:11:41 -07:00
Mike Iovine	25aa3881d7	[nvbug/5319281][fix] Stop drafting when we hit the draft model's max seq len (#4879 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-06-13 11:06:36 -04:00
brb-nv	089be8912a	feat: Basic skeleton for Gemma3 VLM (#5108 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-06-13 17:27:04 +08:00
nv-guomingz	b959618579	refactor [BREAKING CHANGE]:: remove the redundant use_kv_cache field from PytorchConfig (#5031 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-06-13 16:34:24 +08:00
yunruis	30c5b4183a	refactoring: port customized kernels with public cutlass version (#5027 ) Signed-off-by: yunruis Merge this to unblock others since the full CI has been run through	2025-06-13 16:19:31 +08:00
Zheng Duan	4d0a5ad384	chore: gracefully exit disagg process in tests; better startup and logging (#5109 ) Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>	2025-06-13 14:03:55 +08:00
Yibin Li	b79eb34bfe	[fix]: Fall back to HMAC to Avoid IPC Serialization Churn (#5074 ) Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>	2025-06-13 11:37:50 +08:00
zhhuang-nv	a891013e3c	[feat] Optimize KV Cache Reuse for MLA (#4869 ) Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>	2025-06-13 11:03:05 +08:00
Fanrong Li	38a907aaca	[TRTLLM-5278][feat] Add attention dp support to MTP relaxed acceptance (#5119 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-13 08:58:44 +08:00
pcastonguay	3a04c9fa7b	chore: Include prompt_token_ids only for context-only disagg requests (#5055 ) Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>	2025-06-12 15:00:08 -04:00
Mike Iovine	690873ba1a	[nvbug/5334370][fix] Fix one model EAGLE3 (#5134 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-06-12 10:28:14 -04:00
HuiGao-NV	dfeeaf6746	Move allreduce_strategy from committed api to reference (#5147 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-06-12 21:00:20 +08:00
brb-nv	8cfb567182	fix: Updates to yarn implementation (#5105 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-06-12 20:45:34 +08:00
nv-guomingz	58d4ca2385	fix:remove duplicated trust_remote_code knob from trtllm-serve (#5143 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-06-12 19:48:24 +08:00
liji-nv	10ab9791ec	[fix] Do not reuse dummy request KVCache (#4804 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-06-12 15:24:50 +08:00
Daniel Cámpora	e46267765f	Fix logprobs issues. (#5136 ) Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-06-12 15:07:01 +08:00
Lucas Liebenwein	49d7268acc	[nvbugs/5331013] fix AutoDeploy for PyTorch 25.05 dependency upgrade (#5106 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2025-06-12 13:07:27 +08:00
Netanel Haber	e692779ead	Solve underallocation in VSWA+/VGQA (#4667 ) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2025-06-12 12:12:46 +08:00
HuiGao-NV	43192379af	Use backend to replace macro to control enablement of MNNVL all reduce (#4635 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-06-12 11:22:49 +08:00
Zheng Duan	c592798f64	fix: limit process pool size when prefetching (#5088 ) Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>	2025-06-12 10:52:52 +08:00
liji-nv	8282d6c1a7	[fix] Fix llama4 min latency (#5117 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-06-11 15:44:38 +08:00
Zhanrui Sun	e2863a3159	chore: bump version to 0.21.0rc2 (#5112 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-06-11 15:08:14 +08:00
Daniel Cámpora	fdf1c47d1d	[TRTLLM-4995][feat] TRTLLM Sampler log probs support (#4836 ) Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-06-11 08:18:13 +02:00
nvpohanh	7b210ae9c3	test: add unit tests for Llama4 min_latency code (#4980 ) Signed-off-by: Po-Han Huang <pohanh@nvidia.com>	2025-06-10 12:10:26 -07:00
Lucas Liebenwein	7ddc4d6282	[AutoDeploy] Merge Feature Branch Week 3 (#5054 ) Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>	2025-06-11 00:20:43 +08:00
Tracin	6c91f1c7ac	Mxfp8xmxfp4 quant mode(#4978 ) Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-10 22:01:37 +08:00
Zongfei Jing	6d1f2d0fd7	[TRTLLM-3927] [feat] Finalize + Allreduce + add + rmsnorm fusion (#4756 ) Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-06-10 19:55:16 +08:00
Yuxian Qiu	08dc369a4d	fix: pytorch_backend_config is deprecated in update_llm_args_with_extra_dict. (#4890 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-06-10 18:40:29 +08:00
tomeras91	f121f13ddf	[nvbug 5325284][fix] Increase Nemotron-H warmup request robustness (#4954 ) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>	2025-06-10 11:09:37 +03:00
Xiaowei Wang	ec6b1821c7	[fix] Fix W4A8 weight loading error in WInt4AFP8FusedMoEMethod (#5026 ) Signed-off-by: Xiaowei Wang <100599594+xiaoweiw-nv@users.noreply.github.com>	2025-06-10 15:09:06 +08:00
Daniel Cámpora	d68b8180d3	feat: port MakeDecodingBatchInputOutput to python in TRTLLMSampler (#4828 ) Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-06-10 07:28:34 +08:00
Chang Liu	f70815c945	[TRTLLM-5007][feat] Add multimodal hashing support (image hashing) (#4145 ) Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com> Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>	2025-06-10 01:59:56 +08:00
Yuxian Qiu	e79527d195	chore: Refine weight prefetching. (#4893 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-06-09 21:24:16 +08:00
Mike Iovine	f4d9c87c51	[nvbug/5314469][feat] Include the executor's max batch size in CUDA g… (#4843 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-06-09 08:31:35 -04:00
Yukun He	137fe35539	fix: Fix warmup phase batch size out of range. (#4986 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-09 19:19:16 +08:00
Yuxian Qiu	88480197da	ci: [nvbugs/5280806] Unwaive unittests/_torch. (#4951 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-06-09 19:04:11 +08:00
Dom Brown	9c012d5bf8	[TRTLLM-5589] feat: Integrate TRT-LLM Gen FP8 Batched GEMM with Pytorch workflow kernel autotuner (#4872 ) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-06-09 11:02:48 +01:00
ChristinaZ	f45aff2b7d	Add customized renormalized moe routing kernel for moe cutlass backend (#4955 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-06-09 17:38:50 +08:00
Bo Li	c104388d37	chore: Refactor apply_rope. (#4918 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>	2025-06-09 16:51:59 +08:00
Daniel Stokes	3a4851b7c3	feat: Add Mixture of Experts FP8xMXFP4 support (#4750 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-06-09 13:25:04 +08:00
amitz-nv	77e8d739f1	[TRTLLM-4987][feat] Support generation logits in TRTLLMSampler (#4819 )	2025-06-09 06:30:01 +03:00
Yechan Kim	8b4104d34a	feat: add HyperCLOVAX-SEED-Vision support in refactored way (#4799 ) Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>	2025-06-09 11:04:04 +08:00
Omer Ullman Argov	8731f5f14f	chore: Mass integration of release/0.20 (#4898 ) Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Hui Gao <huig@nvidia.com> Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com> Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com> Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com> Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com> Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com> Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> Signed-off-by: moraxu <mguzek@nvidia.com> Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Co-authored-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Co-authored-by: HuiGao-NV <huig@nvidia.com> Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com> Co-authored-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Co-authored-by: Bo Li <22713281+bobboli@users.noreply.github.com> Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com> Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com> Co-authored-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com> Co-authored-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> Co-authored-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com> Co-authored-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> Co-authored-by: Faraz <58580514+farazkh80@users.noreply.github.com> Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com> Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com> Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com>	2025-06-08 23:26:26 +08:00
Mike Iovine	ec0d984656	[nvbug/5280806][fix] Fix 2 model spec decode flow (#4807 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-06-08 07:40:02 -04:00
dongxuy04	1e369658f1	feat: large-scale EP(part 6: Online EP load balancer integration for GB200 nvfp4) (#4818 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> Co-authored-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>	2025-06-08 10:25:18 +08:00
QI JUN	5ee0de7f2a	Resubmit #4894 (#4969 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-08 04:42:15 +08:00
Bo Li	f414a079ad	chore: Change the type annotations of input_ids and position_ids to int32. (#4632 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-06-07 16:10:47 +08:00
nv-guomingz	0c7dd660d8	fix:https://nvbugs/5324248 (#4973 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-06-07 04:14:07 +08:00
Fanrong Li	75d020cf07	fix: fix cuda graph padding for spec decoding (#4853 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-06 22:21:42 +08:00
Anthony Chang	eeb555e37b	chore: memoize weight shuffle index to speed up weight preproc in moe_backend=TRTLLM (#4826 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-06-06 16:13:54 +08:00
QI JUN	ec50684d80	Revert "fix a bug of global cuda graph dummy request" (#4970 )	2025-06-06 08:54:45 +08:00
QI JUN	bfa877a22e	Fix: fix autodeploy (#4957 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-05 21:06:55 +08:00
QI JUN	154f7cc40a	fix a bug of global cuda graph dummy request (#4894 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-05 19:47:40 +08:00
Shunkangz	3eae58ca36	Add disaggregated unittest (#4899 ) Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>	2025-06-05 19:14:31 +08:00
QI JUN	b8c5e3892b	Revert "fix: build_config in TorchLlmArgs and avoid invalid args" (#4949 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-05 17:43:30 +08:00
Lucas Liebenwein	743fb0a159	[AutoDeploy] _AutoDeployLlmArgs as primary config object (#4891 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2025-06-05 17:20:55 +08:00
ixlmar	6437756da8	fix: handle OOMs during KV cache estimation (#4690 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-06-05 10:02:26 +02:00
Shiyu Li	b0d287c9b7	[TRTLLM-4647][fix] Fix the no fusion allreduce hanging (#4594 ) Signed-off-by: Shiyu Li <shili@nvidia.com>	2025-06-04 18:26:13 -07:00
Yuxian Qiu	6b3242654e	fix: Fix broken vanilla moe since FusedMoE refactor. (#4897 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-06-05 03:56:41 +08:00
Yi Zhang	1fca654bfd	tests: Update gb200 test case (#4754 ) Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>	2025-06-04 18:49:20 +08:00
ixlmar	2bbb6b5976	chore: introduce KvCacheCreator (#4581 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-06-04 11:03:17 +02:00
Xianjie Qiao	325ccaae3d	Fix trtllm-bench iter_stats and cuda_graph_batch_sizes error errors. (#4827 ) Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com> Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com>	2025-06-04 16:36:07 +08:00
Zhanrui Sun	35e87b99f3	chore: bump version to 0.21.0rc1 (#4896 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-06-04 14:31:18 +08:00
tomeras91	8d31e16877	[TRTLLM-4923][feat] Paged mamba cache (#4822 ) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>	2025-06-04 09:27:08 +03:00
Omer Ullman Argov	e71de2a13e	chore: Mass integration of release/0.20. (#4871 ) Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com> Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>	2025-06-04 14:12:27 +08:00
Yan Chunwei	ac20159d32	fix: build_config in TorchLlmArgs and avoid invalid args (#4600 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-06-04 13:17:29 +08:00
QI JUN	e2eea80c1d	Chore: refine comments of prepare inputs method of model engine (#4837 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-04 12:14:13 +08:00
Yukun He	5fa6fbd989	feat: Enhance AutoTuner inference path and code readability (#4466 ) Fix AutoTuner warmup request generating. * The current warmup phase creates one request, which is insufficient for the warmup to cover the max_num_tokens. Revise the warmup phase to a batch of requests to cover the max_num_tokens to eliminate potential fallback cases. Refactor AutoTuner API and reduce host overhead. Refine (min, opt, max) values of optimization profile setup for get_valid_tactics to achieve the correct canImplement definition. * Refine cache key assembly process to reduce host overhead and simplify API. * Fix lru_cache usage to reduce host overhead. * Move tuning config initialization as a one-time object in tunable runner to reduce host overhead. Improve tuning config readability. * Use dataclass to define tuning config. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-06-04 10:53:11 +08:00
Shi Xiaowei	b13f8c9cba	Fix: NVBug 5302895 (#4835 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-06-04 09:31:39 +08:00
Shunkangz	c835f06371	Refactor the first token response in PD (#4692 ) Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>	2025-06-04 09:11:23 +08:00
Mike Iovine	73389d6531	[fix] Fix llama 4 long context (#4809 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-06-04 07:48:08 +08:00
Nikita Korobov	8043d7a03c	feat: update DeepSeek FP8 TRT-LLM Gen cubins (#4643 ) Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>	2025-06-03 14:07:54 -07:00
rakib-hasan	d0eb47d33a	[TRTLLM-5053] Refactoring and Unifying the Multimodal input preparation (#4506 ) * refactoring the multimodal input prep Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * adding out-of-tree override option Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * adding exceptional case for llava-next Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * fixing typo Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * addressing review comments, adding placement option, handling tokenizer variations Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * addressing pytest-asyncio behavior change Signed-off-by: Rakib Hasan <rhasan@nvidia.com> --------- Signed-off-by: Rakib Hasan <rhasan@nvidia.com>	2025-06-03 12:02:07 -07:00
hlu1	b4ed4b22f3	[Arch] Freeze model_config (#4814 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com> Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>	2025-06-04 02:51:35 +08:00
Yan Chunwei	80b4026775	chore: remove request_error ipc in LLM.submit (#4763 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-06-03 20:55:59 +08:00
pcastonguay	01f29ce38b	[nvbug 5294316] fix: Fix queued request stats (#4714 ) Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>	2025-06-03 08:33:08 -04:00
Shunkangz	ae9a6cf24f	feat: Add integration of etcd (#3738 ) Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Signed-off-by: BatshevaBlack <132911331+BatshevaBlack@users.noreply.github.com> Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Co-authored-by: Batsheva Black <bblack@login-eos01.eos.clusters.nvidia.com> Co-authored-by: BatshevaBlack <132911331+BatshevaBlack@users.noreply.github.com>	2025-06-03 20:01:44 +08:00
Enwei Zhu	3fe4a1842a	fix: Register MoeLoadBalancerConfig to serialization.py (#4864 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-03 19:22:36 +08:00
Frank	80f9989a1e	[enhanchment] Add beam width to low latency. (#4812 ) Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>	2025-06-03 17:24:55 +08:00
Robin Kobus	3de02582dd	refactor: Separate DecoderState from GptDecoderBatched (#4700 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-03 09:42:01 +02:00
Robin Kobus	b9263a8e10	fix: max_num_sequences calculation with overlap scheduling (#4532 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> Co-authored-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-06-03 09:31:22 +02:00
hlu1	320195dc0d	[Architecture] Refactor FusedMoE (#4790 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com> Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>	2025-06-03 14:02:19 +08:00
Yuxian Qiu	ec796e44e4	feat: add heuristics for checkpoint files prefetching. (#4765 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-06-03 12:10:37 +08:00
Yan Chunwei	e013c8cbc2	fix [nvbug5256044]: bench hang due to llmapi ipc (#4798 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-06-03 10:10:53 +08:00
Fanrong Li	380a5d1690	[https://nvbugs/5271281 ][fix] fix a pd+mtp accuracy issue (#4536 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-03 10:03:34 +08:00
Tian Zheng	9832787050	[feat] Enable NVFP4 output for TRTLLM attention kernels (#4737 ) Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>	2025-06-03 10:00:17 +08:00
Yilin Fan	eb2d51a429	[fix] Fix llama4 min-latency mode (#4810 )	2025-06-02 08:50:01 +08:00
Enwei Zhu	5b4852b7b5	feat: large-scale EP(part 5: Static EP load balancer with offline statistics) (#4695 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-02 01:25:02 +08:00
Fanrong Li	7d356efc7d	fix: fix accuracy and illegal memory access issues when using mtp + attention dp (#4379 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-02 00:35:52 +08:00
tomeras91	bf9cd11fd4	[TRTLLM-4783][feat] Mamba2 kernel updates for Nemotron-H (#4494 ) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>	2025-06-01 13:56:44 +03:00
Lucas Liebenwein	491a09b0c6	[AutoDeploy] Increased Model Coverage Mass Migration Week 2 (#4817 ) Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com> Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com> Co-authored-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com> Co-authored-by: sugunav14 <178320438+sugunav14@users.noreply.github.com> Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>	2025-06-01 14:40:29 +08:00
Enwei Zhu	0087bd27ba	[fix] Fix SamplingParams check on n and best_of (#4655 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-01 09:11:55 +08:00
Daniel Cámpora	69c7fe8905	[TRTLLM-4987][feat] Partial support of context logits in TRTLLMSampler (#4538 ) Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-06-01 03:32:43 +08:00

... 3 4 5 6 7 ...

961 Commits