DylanChen-NV
|
5ca2b9bb15
|
[TRTLLM-5812][feat] support FP8 row-wise dense GEMM in torch flow (#5615)
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
|
2025-07-07 18:04:57 +08:00 |
|
ChristinaZ
|
12d8c7d129
|
Refactor the topk parallelization part for the routing kernels (#5567)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
|
2025-07-07 15:53:25 +08:00 |
|
Daniel Stokes
|
ec6c7dff1a
|
feat: Add support for MXFP8xMXFP4 in pytorch (#5535)
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
|
2025-07-06 15:32:06 -07:00 |
|
Robin Kobus
|
ae27261094
|
refactor: decoding inputs (#5679)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-07-06 08:21:02 +02:00 |
|
Julien Debache
|
6bddaf6df6
|
chore: Improve documentation of Kv_block_array (#5765)
Signed-off-by: Julien Debache <julien.debache@hotmail.com>
|
2025-07-05 22:25:27 +02:00 |
|
jthomson04
|
1b588f8390
|
feat: KV events for sliding window attention (#5580)
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
|
2025-07-05 06:05:20 +08:00 |
|
Stefan Niebler
|
d1112aac37
|
[TRTLLM-3442] feat: added beam search support to the PyTorch Workflow (#5333)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
|
2025-07-05 01:35:13 +09:00 |
|
Chuang Zhu
|
ffc0b8f5da
|
Cache transceiver support VSWA (#5505)
Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
Co-authored-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
|
2025-07-05 01:18:42 +09:00 |
|
Faraz
|
81c0764012
|
Cherry pick "[NVBUG:5355009] Modify check for fuse_fp4_quant on SM120 (#5724)
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
Co-authored-by: peaceh-nv <103117813+peaceh-nv@users.noreply.github.com>
|
2025-07-04 16:53:20 +09:00 |
|
Robin Kobus
|
07f9cf1519
|
fix: Improve chunking test and skip empty kernel calls (#5710)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-07-04 09:08:15 +02:00 |
|
Yuan Tong
|
32b244af38
|
feat: reduce unnecessary kernel generation (#5476)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
|
2025-07-04 14:37:49 +08:00 |
|
Netanel Haber
|
134b2383ff
|
[fix: nvbugs/5355493] Correctly clamp max sequence len to max attention window (#5720)
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
|
2025-07-04 08:16:25 +02:00 |
|
Robin Kobus
|
1a3bd140ed
|
chore: Remove unused isFullContextRequest method (#5666)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-07-03 15:08:09 +02:00 |
|
WeiHaocheng
|
dccbfc8b1e
|
fix: Set init value for moe expert id (#5660)
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
|
2025-07-03 07:05:31 -04:00 |
|
Jhao-Ting Chen
|
77082cde38
|
[https://nvbugspro.nvidia.com/bug/5329655] [feat] Pytorch path add spec dec param to attention op (#5146)
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
|
2025-07-02 04:54:43 -04:00 |
|
Robin Kobus
|
4cd8543d8c
|
[TRTLLM-1316] refactor: Remove unnecessary pipeline parallelism logic from postProcessRequest (#5489)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-07-02 10:13:31 +02:00 |
|
qixiang-99
|
ca7b6ec8d8
|
Feat/pytorch vswa kvcachemanager (#5151)
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
|
2025-07-02 15:58:00 +08:00 |
|
Xiaowei Wang
|
32dfdfba30
|
feat: fuse w4a8 moe pre-quant scale on Hopper (#5613)
Signed-off-by: Xiaowei Wang <100599594+xiaoweiw-nv@users.noreply.github.com>
|
2025-07-01 23:02:41 -04:00 |
|
Void
|
7992869798
|
perf: better heuristic for allreduce (#5432)
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
|
2025-07-01 22:56:06 -04:00 |
|
liji-nv
|
c345f5876c
|
[feat] Support torch compile for attention dp (#5086)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
|
2025-07-01 13:48:52 -04:00 |
|
Robin Kobus
|
d68fa728d8
|
refactor: Clean up DecodingInput and DecodingOutput (#5617)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-07-01 14:31:42 +02:00 |
|
Yan Chunwei
|
a5eff139f1
|
[TRTLLM-5277] chore: refine llmapi examples for 1.0 (part1) (#5431)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com>
|
2025-07-01 19:06:41 +08:00 |
|
杨凯旋
|
61c5a53642
|
[#5403][perf] Conditionally enable SWAP AB for speculative decoding (#5404)
Signed-off-by: zoheth <z0heth@outlook.com>
Co-authored-by: Yao Yao <lowsfer@users.noreply.github.com>
|
2025-07-01 18:32:37 +08:00 |
|
Robin Kobus
|
5f77d212ef
|
test: Reduce number of C++ test cases (#5437)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-07-01 09:40:49 +02:00 |
|
danielafrimi
|
7a617ad1fe
|
feat: W4A16 GEMM (#4232)
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>
|
2025-07-01 10:36:05 +03:00 |
|
Li Min
|
16fc99391f
|
refactor: [TRTLLM-6150] Refactor moe permute and finalize op by removing duplicated code (#5557)
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
|
2025-06-30 08:48:04 -07:00 |
|
Robin Kobus
|
9bdc5951f8
|
refactor: decoder state setup (#5093)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-06-30 11:09:43 +02:00 |
|
WeiHaocheng
|
42a9385d02
|
[TRTLLM-5331] perf: Replace allgaher with AllToAllPrepare (#5570)
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
|
2025-06-30 13:06:09 +08:00 |
|
Cheng Hang
|
64db7d27f6
|
[feat] Optimizations on weight-only batched gemv kernel (#5420)
Signed-off-by: Cheng Hang <chang@nvidia.com>
|
2025-06-30 10:20:16 +08:00 |
|
Enwei Zhu
|
b4dab23e7b
|
[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP (#5435)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
|
2025-06-30 01:02:07 +08:00 |
|
Li Min
|
6021a439ab
|
Make moe permute and final as custom op (#5412)
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
|
2025-06-27 15:48:33 -07:00 |
|
Daniel Stokes
|
5773cfdcf2
|
feat: Add support for per expert activation scaling factors (#5013)
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
|
2025-06-28 09:10:35 +12:00 |
|
Darragh Hanley
|
5437075def
|
ReDrafter support for Qwen (#4875)
Signed-off-by: darraghdog <darragh.hanley@gmail.com>
Signed-off-by: Darragh Hanley <darragh.hanley@gmail.com>
Co-authored-by: rakib-hasan <rhasan@nvidia.com>
|
2025-06-28 02:33:10 +08:00 |
|
Robin Kobus
|
a8141a4513
|
refactor: Speculative decoding buffers part 2 (#5316)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-06-27 17:41:48 +02:00 |
|
Aurelien Chartier
|
833c0dea4a
|
[TRTLLM-6104] feat: add request_perf_metrics to LLMAPI (#5497)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
|
2025-06-27 17:03:05 +02:00 |
|
wili
|
56cdfe5c6c
|
[TRTLLM-5000][feat] NGrams V2 (#4569)
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
|
2025-06-27 23:00:17 +08:00 |
|
peaceh-nv
|
cb58073ab7
|
Fix : fix build for sm120 (#5265)
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
|
2025-06-27 20:42:47 +08:00 |
|
ChristinaZ
|
a608b00d38
|
Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
|
2025-06-27 20:17:40 +08:00 |
|
Daniel Stokes
|
83a1f60556
|
feat: Expose bias and FP8_MXFP4 MOE CUTLASS backend features to pytorch (#5410)
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
|
2025-06-27 12:29:34 +08:00 |
|
Tailing Yuan
|
ef43b95aa1
|
Fix execute_process: check results using EQUAL (#5481)
|
2025-06-27 11:57:04 +08:00 |
|
Anthony Chang
|
de7cd0de05
|
fix: MoE autotune fallback failed to query default heuristic (#5520)
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
|
2025-06-26 17:28:48 +01:00 |
|
jmydurant
|
8836990bde
|
[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) (#5475)
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
|
2025-06-26 22:18:08 +08:00 |
|
Robin Kobus
|
8dfa31c71d
|
refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead (#5384)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-06-26 19:45:52 +08:00 |
|
Yao Yao
|
0788c5d0d6
|
[perf] improve XQA-MLA perf (#5468)
Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>
|
2025-06-26 18:09:13 +08:00 |
|
Bo Li
|
1bab9000a6
|
perf: Optimize swizzle_sf, unswizzle_sf, reswizzle_sf (#5318)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
|
2025-06-26 14:03:56 +08:00 |
|
Alessio Netti
|
7e681fbe52
|
[chore] Allow configuring linking of NVRTC wrapper (#5189)
Signed-off-by: Alessio Netti <netti.alessio@gmail.com>
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-06-26 07:26:10 +02:00 |
|
dongxuy04
|
490d2e5819
|
feat: large-scale EP(part 8: Online EP load balancer integration for PCIe fp8) (#5226)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
|
2025-06-25 22:25:13 -07:00 |
|
Daniel Stokes
|
942841417e
|
opensource: Opensource MOE MXFP8-MXFP4 implementation (#5222)
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
|
2025-06-26 12:18:19 +08:00 |
|
qsang-nv
|
e9cd810071
|
keep sm90 headsize 128 cubins (#5320)
Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
|
2025-06-26 12:14:01 +08:00 |
|
ChristinaZ
|
d135f5993d
|
Add unit test for routing kernels (#5405)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
|
2025-06-26 09:49:11 +08:00 |
|