Commit Graph

732 Commits

Author SHA1 Message Date
brb-nv
cdaa6abce7 fix: Investigate Gemma3 1B decoder output discrepancy (#5564)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-04 13:14:13 +08:00
Frank
819ae903de [https://nvbugspro.nvidia.com/bug/5351333][fix] Update to chunking calculation. (#5625)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-07-04 13:14:13 +08:00
Clay
7a319524da
feat: support more parameters in openai worker of scaffolding (#5115)
Signed-off-by: Clay <ccs96307@gmail.com>
2025-07-04 09:35:34 +08:00
Lucas Liebenwein
24ac9b5f69
[AutoDeploy] merge feat/ad-2025-06-29 (#5737)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: Neta Zmora <nzmora@nvidia.com>
Co-authored-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2025-07-04 10:21:18 +09:00
Netanel Haber
aa72d39b72
MTP and derivatives: Align sample state with trtllm sampler sample state (#5675)
This PR moves MTPSampler and derivatives to use the universal seq_slot indexing for sampling.
This is the last piece of the puzzle: After this, all of the samplers will use this format.
See: 6ee94c7
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
2025-07-03 19:55:48 +02:00
Zhenhuan Chen
528ff52ef4
[https://nvbugs/5365714] fix(scaffolding): use default LLM rather than trt backend LLM (#5705)
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-07-03 23:54:20 +09:00
Rashid Kaleem
2b0c87e613
[ModelLoad] Concurrent load model (#5291)
Signed-off-by: Rashid K <rkaleem@nvidia.com>
Co-authored-by: Zhihan Jiang <68881590+nvzhihanj@users.noreply.github.com>
2025-07-03 22:18:04 +08:00
nv-guomingz
8dad22cbe7
chore: refine the default value by using pydantic default instead of … (#5695)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-07-03 22:41:29 +09:00
tomeras91
7dbecf7272
[TRTLLM-4923][feat] Enable CUDA graphs for Nemotron-H (#5646)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-07-03 11:07:51 +03:00
Yiqing Yan
3c9dd5cd66
chore: bump version to 1.0.0rc2 (#5645)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-07-03 12:35:28 +08:00
Enwei Zhu
3a46cf275b
fix: Fix missing arg to alltoall_prepare_maybe_dispatch (#5669)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-07-02 21:41:55 -04:00
Jhao-Ting Chen
77082cde38
[https://nvbugspro.nvidia.com/bug/5329655] [feat] Pytorch path add spec dec param to attention op (#5146)
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
2025-07-02 04:54:43 -04:00
qixiang-99
ca7b6ec8d8
Feat/pytorch vswa kvcachemanager (#5151)
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
2025-07-02 15:58:00 +08:00
Yan Chunwei
2d69b55fe8
chore: enhance yaml loading arbitrary options in LlmArgs (#5610)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-02 14:21:37 +08:00
HuiGao-NV
10c50515c2
fix: Add back allreduce_strategy parameter into TorchLlmArgs (#5637)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-07-02 09:49:20 +08:00
Perkz Zheng
ba2ab5098b
[Bug] attention DP doesn't work with embedding TP (#5642)
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-07-02 08:57:46 +08:00
Aurelien Chartier
efef911f5e
fix: add missing self. from PR #5346 (#5653)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-07-01 20:38:55 -04:00
Aurelien Chartier
fa95e402a5
feat: add LLmArgs option to force using dynamic quantization (#5346)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-07-01 12:16:09 -07:00
liji-nv
c345f5876c
[feat] Support torch compile for attention dp (#5086)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-07-01 13:48:52 -04:00
Kaiyu Xie
f9a455651b
perf: Use tokenizers API to optimize incremental detokenization perf (#5574)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-07-01 09:35:25 -04:00
Anurag Mukkara
93edfea2b8 [nvbug/5354825] Fix nougat test image url (#5496)
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
2025-07-01 20:12:55 +08:00
Wanli Jiang
3789ba1d37 feat: TRTLLM-5941 Upgrade xgrammar to 0.1.18 (#5364)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-07-01 20:12:55 +08:00
brb-nv
4ef60d5fbb nvbugs-5331031; nvbugs-5344203 - address intermittent issues with Mistral Small multimodal for BS=8 (#5453)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-01 20:12:55 +08:00
danielafrimi
7a617ad1fe
feat: W4A16 GEMM (#4232)
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>
2025-07-01 10:36:05 +03:00
Netanel Haber
6ee94c7ac8
Reintroduce with perf fixes: feature: unify new_tokens format sample state to trtllm samper tokens format (#5513)
58a8a8f - these changes were previously merged to main here.
6aef149 - the changes were temporarily reverted in main, due to a significant perf regression in models using the TorchSampler (observed by @byshiue).
This PR is meant to re-merge these changes along with a fix to prevent the regression.

The first commit of this PR is actually just the reverted revert - filter it out of the changes to see previously unmerged changes.

Signed-off-by: Netanel Haber <nhaber@nvidia.com>
2025-06-30 11:58:59 -07:00
Wei-Ming Chen
f28cd3056e
feat: AutoDeploy fp8 quantization support for bmm (#3849)
Signed-off-by: Wei-Ming Chen <17592131+meenchen@users.noreply.github.com>
2025-06-30 12:36:34 -04:00
nv-guomingz
6e48ac25a6
chore: remove cuda_graph_ prefix from cuda_graph_config filed members. (#5585)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-06-30 12:23:14 -04:00
Li Min
16fc99391f
refactor: [TRTLLM-6150] Refactor moe permute and finalize op by removing duplicated code (#5557)
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
2025-06-30 08:48:04 -07:00
Yan Chunwei
98a7c24062
chore [TRTLLM-6009]: remove ptuning knobs from TorchLlmArgs (#5595)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-30 20:40:23 +08:00
Robin Kobus
9bdc5951f8
refactor: decoder state setup (#5093)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-30 11:09:43 +02:00
Fanrong Li
6cbc9a5297
[nvbug/5354946][fix] Fix mtp vanilla draft inputs (#5568)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-30 15:59:12 +08:00
WeiHaocheng
42a9385d02
[TRTLLM-5331] perf: Replace allgaher with AllToAllPrepare (#5570)
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
2025-06-30 13:06:09 +08:00
dongjiyingdjy
852b79053d
feat : support duplicate_kv_weight for qwen3 blockwise scale (#5459)
Signed-off-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>
2025-06-30 11:49:22 +08:00
nv-guomingz
578430e64c
[TRTLLM-5530][BREAKING CHANGE]: enhance the llm args pytorch config part 1(cuda_graph_config) (#5014)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-06-30 11:05:40 +08:00
Bo Li
6000380a0c
perf: Avoid reswizzle_sf after allgather. (#5504)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-06-29 21:25:50 +08:00
Talor Abramovich
70e34a3291
[TRTLLM-5831][feat] Add LoRA support for pytorch backend in trtllm-serve (#5376)
Signed-off-by: Talor Abramovich <talora@nvidia.com>
2025-06-29 12:46:30 +00:00
amirkl94
de9779900c
feat: Add support for YARN in NemotronNAS models (#4906)
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
2025-06-29 09:45:49 +03:00
Lucas Liebenwein
619709fc33
[AutoDeploy] merge feat/ad-2025-06-13 (#5556)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-06-29 03:52:14 +08:00
Li Min
6021a439ab
Make moe permute and final as custom op (#5412)
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
2025-06-27 15:48:33 -07:00
Darragh Hanley
5437075def
ReDrafter support for Qwen (#4875)
Signed-off-by: darraghdog <darragh.hanley@gmail.com>
Signed-off-by: Darragh Hanley <darragh.hanley@gmail.com>
Co-authored-by: rakib-hasan <rhasan@nvidia.com>
2025-06-28 02:33:10 +08:00
Robin Kobus
a8141a4513
refactor: Speculative decoding buffers part 2 (#5316)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-27 17:41:48 +02:00
Aurelien Chartier
833c0dea4a
[TRTLLM-6104] feat: add request_perf_metrics to LLMAPI (#5497)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-06-27 17:03:05 +02:00
wili
56cdfe5c6c
[TRTLLM-5000][feat] NGrams V2 (#4569)
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-06-27 23:00:17 +08:00
peaceh-nv
cb58073ab7
Fix : fix build for sm120 (#5265)
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
2025-06-27 20:42:47 +08:00
Daniel Cámpora
73b8a95049
feat: Use inference mode in update_requests to improve perf of TRTLLM Sampler (#5538) 2025-06-27 18:40:53 +08:00
Daniel Stokes
83a1f60556
feat: Expose bias and FP8_MXFP4 MOE CUTLASS backend features to pytorch (#5410)
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-06-27 12:29:34 +08:00
Yuxian Qiu
dc36228f52
fix: Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-06-27 11:03:38 +08:00
Yibin Li
0f3bd7800e
[TRTLLM-4971]: Use safe deserialization in ParallelConfig (#4630)
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-06-27 09:58:41 +08:00
Frank
aa6e015ef8
Update trtllm-bench to support new Pytorch default. (#5491)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-06-26 17:05:43 -07:00
Venky
0083228d2a
fix: Mapping rank boundary check bug (#4935)
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
2025-06-27 07:27:59 +08:00