zhhuang-nv
|
7e135d2ea7
|
[None][feat] Use Separate QKV Input Layout for Context MLA (#6538)
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
|
2025-08-19 22:04:48 +08:00 |
|
Shunkangz
|
54ec2c1af1
|
[None][opt] Add batch wait timeout in fetching requests (#6923)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
|
2025-08-19 03:50:08 -04:00 |
|
Yi Zhang
|
a15af879ec
|
[None][refactor] Refactor Torch Compile Backend, MoeLoadBalancer and warmup Logic (#6615)
Signed-off-by: yizhang-nv <187001205+yizhang-nv@users.noreply.github.com>
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
|
2025-08-19 09:58:44 +08:00 |
|
Kaiyu Xie
|
e88cb92f24
|
[None] [feat] Support accurate device iter time (#6906)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
|
2025-08-18 13:47:14 +08:00 |
|
Izzy Putterman
|
f6ff0e3311
|
[None][fix] Skip Topk if 0 (#6934)
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
|
2025-08-16 02:17:36 -04:00 |
|
Daniel Cámpora
|
53312eeebd
|
[TRTLLM-7157][feat] BREAKING CHANGE Introduce sampler_type, detect sampler according to options (#6831)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
|
2025-08-16 00:27:24 -04:00 |
|
yifeizhang-c
|
4127d77678
|
[https://nvbugs/5394392][fix] Enlarge scheduler capacity under disagg bs == 1 (#6537)
Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>
|
2025-08-15 09:52:06 -07:00 |
|
tomeras91
|
f7dbc1435a
|
[None] [chore] Mamba cache in separate file (#6796)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
|
2025-08-15 13:42:51 +03:00 |
|
qianbiao
|
5c2f0fd03d
|
[None] [feat] Add Tencent HunYuanMoEV1 model support (#5521)
Signed-off-by: sorenwu <sorenwu@tencent.com>
Co-authored-by: sorenwu <sorenwu@tencent.com>
Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com>
|
2025-08-15 06:56:44 +08:00 |
|
Matthias Jouanneaux
|
69574ad730
|
[TRTLLM-5966][feat] Helix: extend mapping to support different CP types (#6816)
Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>
|
2025-08-14 09:00:02 -07:00 |
|
jmydurant
|
4200fa46d1
|
[None][feat] Add support for Hopper MLA chunked prefill (#6655)
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
|
2025-08-14 10:39:26 +08:00 |
|
Izzy Putterman
|
ef53de8eef
|
[None][feat] Add test for speculative rejection sampler (2-model) (#6542)
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
|
2025-08-13 22:09:35 -04:00 |
|
Robin Kobus
|
45c7518032
|
[None][refactor] Simplify decoder state initialization (#6559)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-08-12 21:44:41 +02:00 |
|
Shunkangz
|
ab0d768acf
|
[None][fix] Fix attention dp log (#6570)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
|
2025-08-12 04:53:09 -04:00 |
|
Sergey Klevtsov
|
27fc35175e
|
[None][feat] CUTLASS MoE FC2+Finalize fusion (#3294)
Signed-off-by: Sergey Klevtsov <sklevtsov@nvidia.com>
|
2025-08-12 15:56:48 +08:00 |
|
Enwei Zhu
|
7c686ba8de
|
[TRTLLM-2285][feat] Enable guided decoding with CUDA graph padding and draft model chunked prefill (#6774)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
|
2025-08-12 09:30:06 +08:00 |
|
Ziyi Xiong
|
b4fcd5f592
|
[https://nvbugs/5441438][fix] Set correct draft length for the cuda graph dummy request (#6701)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
|
2025-08-12 09:28:47 +08:00 |
|
rakib-hasan
|
7ab8112450
|
[None][fix] Refactoring to avoid circular import when importing torch models (#6720)
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
|
2025-08-11 18:00:42 -04:00 |
|
bhsueh_NV
|
83dbc6c75d
|
[TRTLLM-5532][feat] store the block of context request into kv cache (#6683)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
|
2025-08-11 16:14:52 +08:00 |
|
Yechan Kim
|
60073a7ad9
|
[None][feat] Support SharedTensor on MultimodalParams (#6254)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
|
2025-08-10 17:48:24 -07:00 |
|
shaharmor98
|
b6baa9ed9b
|
[TRTLLM-6823][doc] Add checkpoint refactor docs (#6592)
Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
|
2025-08-10 19:47:39 -04:00 |
|
shaharmor98
|
14b36e07d7
|
[TRTLLM-6174][feat] Enable FP32 mamba ssm cache (#6574)
Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
|
2025-08-10 16:27:51 -04:00 |
|
Ziyi Xiong
|
de472828b9
|
[TRTLLM-6637][feat] Resolve KV cache divergence issue (#6628)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
|
2025-08-09 23:15:04 +08:00 |
|
Ye Zhang
|
bcf5ec0c9a
|
[None][feat] Core Metrics Implementation (#5785)
Signed-off-by: Ye Zhang <zhysishu@gmail.com>
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
|
2025-08-09 02:48:53 -04:00 |
|
Mike Iovine
|
90145cf557
|
[None][feat] Optimize CUDA graph memory usage for spec decode cases (#6718)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
|
2025-08-08 13:56:53 -04:00 |
|
Stefan Niebler
|
b8f036f264
|
[TRTLLM-6650][fix] Enhance CUDA graph + Beam search to correctly handle padding (#6665)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
|
2025-08-08 14:00:33 +02:00 |
|
Liao Lanyu
|
32ad7f3c12
|
[None][fix] Remove lock related typo in py_executor (#6653)
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
|
2025-08-08 17:48:57 +08:00 |
|
Enwei Zhu
|
aee828d98a
|
[TRTLLM-6854][feat] Enable guided decoding with disagg serving (#6704)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
|
2025-08-08 12:10:36 +08:00 |
|
zhanghaotong
|
1cf669496a
|
[None][fix] Fix unnecessary GPU synchronization in torch sampler caused by incorrect tensor reference (#6626)
Signed-off-by: 皓聪 <zhanghaotong.zht@alibaba-inc.com>
Co-authored-by: 皓聪 <zhanghaotong.zht@alibaba-inc.com>
|
2025-08-07 23:44:47 -04:00 |
|
Daniel Cámpora
|
efca359b66
|
[TRTLLM-6785][feat] BREAKING CHANGE Enable TRTLLM sampler by default (#6216)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
|
2025-08-07 22:19:37 -04:00 |
|
Iman Tabrizian
|
82276167e6
|
[None][feat] Add NCCL Symmetric Integration for All Reduce (#4500)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
|
2025-08-07 17:28:14 -07:00 |
|
Yuan Tong
|
db8dc97b7b
|
[None][fix] Migrate to new cuda binding package name (#6700)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
|
2025-08-07 16:29:55 -04:00 |
|
Mike Iovine
|
e968f98b43
|
[None][feat] Clean up ngram auto mode, add max_concurrency to configs (#6676)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
|
2025-08-07 12:51:47 -04:00 |
|
Emma Qiao
|
3c44b44e45
|
[None][infra] Fix guardwords (#6711)
Signed-off-by: qqiao <qqiao@nvidia.com>
|
2025-08-07 21:06:47 +08:00 |
|
pcastonguay
|
453a06e6ab
|
[TRTLLM-6881][feat] Include attention dp rank info with KV cache events (#6563)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
|
2025-08-07 14:17:07 +02:00 |
|
Enwei Zhu
|
1b9781e8e7
|
[TRTLLM-6409][feat] Enable guided decoding with speculative decoding (part 1: two-model engine) (#6300)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
|
2025-08-07 05:53:48 -04:00 |
|
hlu1
|
8207d5fd39
|
[None] [feat] Add model gpt-oss (#6645)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
|
2025-08-07 03:04:18 -04:00 |
|
amitz-nv
|
85af62184b
|
[TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter (#6510)
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
|
2025-08-07 09:05:36 +03:00 |
|
Netanel Haber
|
83ee91e17b
|
[None][fix] Fix 6522 mpi.pkl5.intracomm.Request has wait not Wait (#6646)
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
|
2025-08-06 14:18:09 +08:00 |
|
yunruis
|
3ff4f503ad
|
[None][opt] ADP schedule balance optimization (#6061)
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
|
2025-08-06 09:38:02 +08:00 |
|
ixlmar
|
1ebceb790d
|
[TRTLLM-5508][feat] check input tokens + improve error handling (#5170)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
|
2025-08-05 18:27:43 +01:00 |
|
liji-nv
|
dcbfa7e509
|
[https://nvbugs/5252313][fix] Fix torch compile + MTP (#6554)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
|
2025-08-05 10:31:29 -04:00 |
|
Venky
|
61da2daeb4
|
[TRTLLM-6761][refactor] Replace LogitBiasLogitsProcessor with embedding bias tensor system (#6464)
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
|
2025-08-05 07:14:24 -07:00 |
|
danielafrimi
|
ed801ff74b
|
[None][fix] Remove expand configuration from mamba2 mixer (#6521)
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>
|
2025-08-05 04:18:25 -04:00 |
|
Olya Kozlova
|
13cc1c4878
|
[TRTLLM-5271][feat] best_of/n for pytorch workflow (#5997)
Signed-off-by: Olya Kozlova <okozlova@nvidia.com>
|
2025-08-04 14:08:06 +02:00 |
|
Chuang Zhu
|
542f552d0b
|
use cudaSetDevice to create context ,fix nvbug 5394497 (#6403)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
|
2025-08-03 13:32:55 -04:00 |
|
Shunkangz
|
67a3fd858b
|
[None][feat] Add support of scheduling attention dp request (#6246)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
|
2025-08-01 20:38:01 -04:00 |
|
liji-nv
|
1daa8c3232
|
[https://nvbugs/5340941][https://nvbugs/5375785] - fix: Wrap attentio… (#6355)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
|
2025-08-01 07:38:06 -04:00 |
|
Robin Kobus
|
d3c14682f0
|
refactor: Remove unused buffers and bindings from sampler (#6484)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-08-01 00:43:03 -04:00 |
|
Jaedeok Kim
|
fbee279909
|
fix: remove duplicate layer multiplication in KV cache size calculation (#6481)
Signed-off-by: Jaedeok Kim <jaedeokk@nvidia.com>
|
2025-07-31 22:34:34 -04:00 |
|