Commit Graph

511 Commits

Author SHA1 Message Date
danielafrimi
5300a99bd8
W4A8 GEMM (#6005)
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>
2025-07-20 17:34:57 +03:00
amitz-nv
98428f330e
[TRTLLM-5826][feat] Support pytorch LoRA adapter eviction (#5616)
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-07-20 08:00:14 +03:00
Martin Marciniszyn Mehringer
943fd418dd
fix: Ensure mlx5 library is installed for deep_ep and remove deprecated python bindings (#6189)
Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>
2025-07-20 10:38:51 +08:00
bhsueh_NV
2e14c8f443
[Fix][Chore][Qwen3] fix bug of using fp4 on sm120 (#6065)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-07-20 10:25:25 +08:00
Void
118307c224
DeepEP LL support variable hidden size and tokens num (#6141)
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
2025-07-20 09:32:41 +08:00
Ziyi Xiong
66030ef815
[TRTLLM-6452][feat]: Two-model engine KV cache reuse support (#6133)
Signed-off-by: ziyixiong-nv <fxiong@nvidia.com>
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-07-19 13:17:15 +08:00
Bo Deng
0388ff9083
[https://nvbugs/5393961][fix] record kv-cache size in MLACacheFormatter (#6181)
Signed-off-by: Bo Deng <deemod@nvidia.com>
2025-07-19 05:06:45 +08:00
Stefan Niebler
d475c97c82
[nvbugs/5354884][fix] Update beam search workspace estimation to new upper bound (#5926)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-07-19 01:54:51 +08:00
Stefan Niebler
6d7874a467
[nvbugs/5369799] fix: Update disaggregation handling in sampler (#5762)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-07-19 01:40:46 +08:00
Robin Kobus
ec2b953e7e
refactor: Enhanced handling of decoder requests and logits within the batch manager (#6055)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-18 12:12:08 +02:00
QI JUN
a95f31e72a
chore: add more log in FmhaDispatcher (#6170)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-07-18 16:53:02 +08:00
xavier-nvidia
200ea9ee81
fix TMA error with GEMM+AR on TP=2 (#6075)
Signed-off-by: Xavier Simmons <xsimmons@nvidia.com>
2025-07-18 10:26:08 +08:00
yifeizhang-c
0155e7a3a1
[TRTLLM-6368] Update deepep dispatch API (#6037)
Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>
2025-07-18 10:13:31 +08:00
Iman Tabrizian
b75e53ab69
Revert "feat: nanobind bindings (#5961)" (#6160)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-18 10:12:54 +08:00
Daniel Stokes
ae28b3a664
feat: Add support for benchmarking individual gemms in MOE benchmark (#6080)
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-07-18 09:00:12 +12:00
Linda
5bff317abf
feat: nanobind bindings (#5961)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-17 22:42:52 +08:00
Enwei Zhu
21efb50068
[TRTLLM-6406] feat: Enable guided decoding with overlap scheduler (#6000)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-07-17 17:46:10 +08:00
Chuang Zhu
44c70c88f9
chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service (#5234)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-07-17 17:42:07 +08:00
ChristinaZ
7e033c392e
Feat: Add vectorized loading for finalize kernel in MoE Trtllm backend (#5919)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-07-17 12:38:29 +08:00
Shiyu Li
6e1aee6fd6
[fix] Performance Optimization for MNNVL TwoShot Kernel (#5934)
Signed-off-by: Shiyu Li <shili@nvidia.com>
Co-authored-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-07-17 10:49:51 +08:00
qixiang-99
e09e409dfb
Fix: Enhance ModelConfig for kv cache size calculations (#5868)
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
2025-07-16 14:41:31 -07:00
qsang-nv
8ef8e73002
update spec_dec (#6079)
Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
2025-07-16 17:50:43 +08:00
Tomer Shmilovich
0552a02943
BlockManager copy constructor fix (#5982)
Signed-off-by: Tomer Shmilovich <tshmilovich@nvidia.com>
2025-07-16 17:33:17 +08:00
Bo Deng
ec3ebae43e
[TRTLLM-6471] Infra: Upgrade NIXL to 0.3.1 (#5991)
Signed-off-by: Rabia Loulou <174243936+rabial-nv@users.noreply.github.com>
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: Bo Deng <deemod@nvidia.com>
Co-authored-by: Rabia Loulou <174243936+rabial-nv@users.noreply.github.com>
Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-07-16 13:54:42 +08:00
Zheng Duan
38db4bc7fb
feat: use session abstraction in data transceiver and cache formatter (#5611)
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-07-16 13:52:44 +08:00
Jinyang Yuan
e761231c0b
[fix] Move NCCL group in all-gather and reduce-scatter OPs outside the outer loop (#6053)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-07-16 00:25:32 +09:00
Daniel Stokes
dd2491f47d
fix: Fix MOE benchmark to rotate buffers to prevent L2 cache reuse (#4135)
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-07-15 13:40:42 +12:00
Daniel Stokes
f277afdd93
perf: Enable 128x256 tile shapes for FP4 MOE CUTLASS backend (#5986)
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-07-14 14:04:15 -07:00
Robin Kobus
6d4b045d1f
refactor: Remove enforced sorted order of batch slots (#3502)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-14 17:23:02 +02:00
Perkz Zheng
4a0b7a0cf1 [https://nvbugspro.nvidia.com/bug/5355054] fallback to cubins for fp8 fmha kernels on Ada. (#5779)
Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
Co-authored-by: qsang-nv <200703406+qsang-nv@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
Yi Zhang
9cc4e5d50e [nvbugs/5336321][fix] Enable attention dp = False test case, Fix TRTLLM Gen Moe workspace allocation (#5463)
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Signed-off-by: yizhan <187001205+yizhang-nv@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
Dom Brown
afaa388bee [TRTLLM-6100] fix: Nvbug 5356427: autotuned TRTLLM Gen fp8 block scale MoE illegal memory access (#5676)
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
dongxuy04
c04570a506
Use huge page mapping for host accessible memory on GB200 (#5963)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-07-14 16:11:04 +08:00
Enwei Zhu
ed77ef2ff4
fix: Fix MoE benchmark (#5966)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-07-14 15:17:26 +09:00
Yuan Tong
a36ac45c4d
fix: fast redux detection in trtllm gen routing kernel (#5941)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-07-13 16:35:07 +08:00
Enwei Zhu
bc1d4fb5da
[NvBug 5378370] fix: Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-07-12 15:50:31 +09:00
ChristinaZ
c5fb692a7d
Refactor the rest routing part for the routing kernels in the MoE TRT-LLM backend (#5771)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-07-11 16:37:56 +08:00
Zhihan Jiang
682acd40da
[nvbugs/5321981] Cherrypick fix: Fix the Llama3.1 405B hanging issue. (#5698) (#5925)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-07-11 07:51:43 +08:00
Linda
4d071eb2d1
feat: binding type build argument (pybind, nanobind) (#5802)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-11 00:48:50 +09:00
narutolhy
41ef1ade19
feat:enable kvcache to be reused during request generation (#4028)
Signed-off-by: narutolhy <582909902@qq.com>
2025-07-10 22:18:01 +09:00
Jinyang Yuan
8b9a030a5c
[fix] Fix MoE workspace info by storing Torch tensor itself instead of data_ptr (#5900)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-07-10 20:07:32 +09:00
CarstyYou
dc32f9ae73
[fix] fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531)
Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>
2025-07-10 15:16:18 +08:00
Anthony Chang
7d21b55b5a
[feat] Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE (#5723)
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-07-10 14:06:50 +08:00
QI JUN
e289a98d5a
avoid nesting NCCL group in allgather and reduce scatter OPs (#5866)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-07-10 12:32:59 +09:00
peaceh-nv
76c3a12bcb
[fix] WAR to fix the illegal memory access issue in moe gemm on SM120 (#5636)
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
2025-07-10 09:20:30 +08:00
DylanChen-NV
74dca0aa7b
[NVBUG-5304516/5319741]Qwen2.5VL FP8 support (#5029)
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
2025-07-09 23:16:42 +08:00
peaceh-nv
52684d79f7
Fix : fix moe regression for sm120 (#5823)
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
2025-07-09 21:25:11 +08:00
Dom Brown
3e3b1769ad
[TRTLLM-5881] feat: Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner (#5764)
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-07-09 08:21:58 +01:00
Jhao-Ting Chen
e4c777df7d
Add is_fp8_output key to XQA kernel cubin hashing (solves Eagle3-one-engine Hopper fp8 bug) (#5813)
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
2025-07-09 09:26:27 +08:00
xavier-nvidia
b6013da198
Fix GEMM+AR fusion on blackwell (#5563)
Signed-off-by: xsimmons <xsimmons@nvidia.com>
2025-07-09 08:48:47 +08:00