TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
xavier-nvidia	200ea9ee81	fix TMA error with GEMM+AR on TP=2 (#6075 ) Signed-off-by: Xavier Simmons <xsimmons@nvidia.com>	2025-07-18 10:26:08 +08:00
yifeizhang-c	0155e7a3a1	[TRTLLM-6368] Update deepep dispatch API (#6037 ) Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>	2025-07-18 10:13:31 +08:00
Iman Tabrizian	b75e53ab69	Revert "feat: nanobind bindings (#5961 )" (#6160 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-07-18 10:12:54 +08:00
Daniel Stokes	ae28b3a664	feat: Add support for benchmarking individual gemms in MOE benchmark (#6080 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-07-18 09:00:12 +12:00
Linda	5bff317abf	feat: nanobind bindings (#5961 ) Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>	2025-07-17 22:42:52 +08:00
Enwei Zhu	21efb50068	[TRTLLM-6406] feat: Enable guided decoding with overlap scheduler (#6000 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-07-17 17:46:10 +08:00
Chuang Zhu	44c70c88f9	chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service (#5234 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-07-17 17:42:07 +08:00
ChristinaZ	7e033c392e	Feat: Add vectorized loading for finalize kernel in MoE Trtllm backend (#5919 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-07-17 12:38:29 +08:00
Shiyu Li	6e1aee6fd6	[fix] Performance Optimization for MNNVL TwoShot Kernel (#5934 ) Signed-off-by: Shiyu Li <shili@nvidia.com> Co-authored-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-07-17 10:49:51 +08:00
qixiang-99	e09e409dfb	Fix: Enhance ModelConfig for kv cache size calculations (#5868 ) Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>	2025-07-16 14:41:31 -07:00
qsang-nv	8ef8e73002	update spec_dec (#6079 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-07-16 17:50:43 +08:00
Tomer Shmilovich	0552a02943	BlockManager copy constructor fix (#5982 ) Signed-off-by: Tomer Shmilovich <tshmilovich@nvidia.com>	2025-07-16 17:33:17 +08:00
Bo Deng	ec3ebae43e	[TRTLLM-6471] Infra: Upgrade NIXL to 0.3.1 (#5991 ) Signed-off-by: Rabia Loulou <174243936+rabial-nv@users.noreply.github.com> Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> Signed-off-by: Bo Deng <deemod@nvidia.com> Co-authored-by: Rabia Loulou <174243936+rabial-nv@users.noreply.github.com> Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>	2025-07-16 13:54:42 +08:00
Zheng Duan	38db4bc7fb	feat: use session abstraction in data transceiver and cache formatter (#5611 ) Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>	2025-07-16 13:52:44 +08:00
Jinyang Yuan	e761231c0b	[fix] Move NCCL group in all-gather and reduce-scatter OPs outside the outer loop (#6053 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-07-16 00:25:32 +09:00
Daniel Stokes	dd2491f47d	fix: Fix MOE benchmark to rotate buffers to prevent L2 cache reuse (#4135 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-07-15 13:40:42 +12:00
Daniel Stokes	f277afdd93	perf: Enable 128x256 tile shapes for FP4 MOE CUTLASS backend (#5986 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-07-14 14:04:15 -07:00
Robin Kobus	6d4b045d1f	refactor: Remove enforced sorted order of batch slots (#3502 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-14 17:23:02 +02:00
Perkz Zheng	4a0b7a0cf1	[https://nvbugspro.nvidia.com/bug/5355054 ] fallback to cubins for fp8 fmha kernels on Ada. (#5779 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> Co-authored-by: qsang-nv <200703406+qsang-nv@users.noreply.github.com>	2025-07-14 17:17:30 +08:00
Yi Zhang	9cc4e5d50e	[nvbugs/5336321][fix] Enable attention dp = False test case, Fix TRTLLM Gen Moe workspace allocation (#5463 ) Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> Signed-off-by: yizhan <187001205+yizhang-nv@users.noreply.github.com>	2025-07-14 17:17:30 +08:00
Dom Brown	afaa388bee	[TRTLLM-6100] fix: Nvbug 5356427: autotuned TRTLLM Gen fp8 block scale MoE illegal memory access (#5676 ) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-07-14 17:17:30 +08:00
dongxuy04	c04570a506	Use huge page mapping for host accessible memory on GB200 (#5963 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-07-14 16:11:04 +08:00
Enwei Zhu	ed77ef2ff4	fix: Fix MoE benchmark (#5966 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-07-14 15:17:26 +09:00
Yuan Tong	a36ac45c4d	fix: fast redux detection in trtllm gen routing kernel (#5941 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-07-13 16:35:07 +08:00
Enwei Zhu	bc1d4fb5da	[NvBug 5378370] fix: Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-07-12 15:50:31 +09:00
ChristinaZ	c5fb692a7d	Refactor the rest routing part for the routing kernels in the MoE TRT-LLM backend (#5771 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-07-11 16:37:56 +08:00
Zhihan Jiang	682acd40da	[nvbugs/5321981] Cherrypick fix: Fix the Llama3.1 405B hanging issue. (#5698 ) (#5925 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-07-11 07:51:43 +08:00
Linda	4d071eb2d1	feat: binding type build argument (pybind, nanobind) (#5802 ) Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>	2025-07-11 00:48:50 +09:00
narutolhy	41ef1ade19	feat:enable kvcache to be reused during request generation (#4028 ) Signed-off-by: narutolhy <582909902@qq.com>	2025-07-10 22:18:01 +09:00
Jinyang Yuan	8b9a030a5c	[fix] Fix MoE workspace info by storing Torch tensor itself instead of data_ptr (#5900 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-07-10 20:07:32 +09:00
CarstyYou	dc32f9ae73	[fix] fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531 ) Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>	2025-07-10 15:16:18 +08:00
Anthony Chang	7d21b55b5a	[feat] Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE (#5723 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-07-10 14:06:50 +08:00
QI JUN	e289a98d5a	avoid nesting NCCL group in allgather and reduce scatter OPs (#5866 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-07-10 12:32:59 +09:00
peaceh-nv	76c3a12bcb	[fix] WAR to fix the illegal memory access issue in moe gemm on SM120 (#5636 ) Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>	2025-07-10 09:20:30 +08:00
DylanChen-NV	74dca0aa7b	[NVBUG-5304516/5319741]Qwen2.5VL FP8 support (#5029 ) Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>	2025-07-09 23:16:42 +08:00
peaceh-nv	52684d79f7	Fix : fix moe regression for sm120 (#5823 ) Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>	2025-07-09 21:25:11 +08:00
Dom Brown	3e3b1769ad	[TRTLLM-5881] feat: Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner (#5764 ) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-07-09 08:21:58 +01:00
Jhao-Ting Chen	e4c777df7d	Add is_fp8_output key to XQA kernel cubin hashing (solves Eagle3-one-engine Hopper fp8 bug) (#5813 ) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>	2025-07-09 09:26:27 +08:00
xavier-nvidia	b6013da198	Fix GEMM+AR fusion on blackwell (#5563 ) Signed-off-by: xsimmons <xsimmons@nvidia.com>	2025-07-09 08:48:47 +08:00
Pamela Peng	da8c7372d4	[TRTLLM-5366][feat]Add support for sm121 (#5524 ) Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> Initial CI run failed a single step A30-CPP-3 due to timeout. Rerunning that step succeeded.	2025-07-08 14:27:00 -07:00
Tailing Yuan	ba0aea1da6	Fix a quote error introduced in #5534 (#5816 ) Signed-off-by: Tailing Yuan <yuantailing@gmail.com>	2025-07-08 18:48:32 +08:00
xiweny	eaf8bec88b	fix: Disaggregate serving with attention DP (#4993 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-07-08 16:15:03 +08:00
JieXin Liang	664bf95892	[fix] improve fp4_block_scale_moe_runner type check (#5681 ) Signed-off-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: ChristinaZ <83400082+ChristinaZ@users.noreply.github.com>	2025-07-08 14:32:14 +09:00
davidclark-nv	a1235ee978	[feat] Adds optional module cache for TRT-LLM Gen Gemm interfaces (#5743 ) Signed-off-by: David Clark <215764518+davidclark-nv@users.noreply.github.com> Co-authored-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>	2025-07-07 13:34:55 -07:00
Tailing Yuan	85b4a6808d	Refactor: move DeepEP from Docker images to wheel building (#5534 ) Signed-off-by: Tailing Yuan <yuantailing@gmail.com>	2025-07-07 22:57:03 +09:00
Daniel Cámpora	1260e2f33f	feat: Optimize TRTLLM Sampler perf single beam single step (#5550 ) Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-07-07 15:44:47 +02:00
DylanChen-NV	5ca2b9bb15	[TRTLLM-5812][feat] support FP8 row-wise dense GEMM in torch flow (#5615 ) Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>	2025-07-07 18:04:57 +08:00
ChristinaZ	12d8c7d129	Refactor the topk parallelization part for the routing kernels (#5567 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-07-07 15:53:25 +08:00
Daniel Stokes	ec6c7dff1a	feat: Add support for MXFP8xMXFP4 in pytorch (#5535 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-07-06 15:32:06 -07:00
Robin Kobus	ae27261094	refactor: decoding inputs (#5679 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-06 08:21:02 +02:00
Julien Debache	6bddaf6df6	chore: Improve documentation of Kv_block_array (#5765 ) Signed-off-by: Julien Debache <julien.debache@hotmail.com>	2025-07-05 22:25:27 +02:00
jthomson04	1b588f8390	feat: KV events for sliding window attention (#5580 ) Signed-off-by: jthomson04 <jwillthomson19@gmail.com>	2025-07-05 06:05:20 +08:00
Stefan Niebler	d1112aac37	[TRTLLM-3442] feat: added beam search support to the PyTorch Workflow (#5333 ) Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>	2025-07-05 01:35:13 +09:00
Chuang Zhu	ffc0b8f5da	Cache transceiver support VSWA (#5505 ) Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> Co-authored-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>	2025-07-05 01:18:42 +09:00
Faraz	81c0764012	Cherry pick "[NVBUG:5355009] Modify check for fuse_fp4_quant on SM120 (#5724 ) Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com> Co-authored-by: peaceh-nv <103117813+peaceh-nv@users.noreply.github.com>	2025-07-04 16:53:20 +09:00
Robin Kobus	07f9cf1519	fix: Improve chunking test and skip empty kernel calls (#5710 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-04 09:08:15 +02:00
Yuan Tong	32b244af38	feat: reduce unnecessary kernel generation (#5476 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-07-04 14:37:49 +08:00
Netanel Haber	134b2383ff	[fix: nvbugs/5355493] Correctly clamp max sequence len to max attention window (#5720 ) Signed-off-by: Netanel Haber <nhaber@nvidia.com>	2025-07-04 08:16:25 +02:00
Robin Kobus	1a3bd140ed	chore: Remove unused isFullContextRequest method (#5666 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-03 15:08:09 +02:00
WeiHaocheng	dccbfc8b1e	fix: Set init value for moe expert id (#5660 ) Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>	2025-07-03 07:05:31 -04:00
Jhao-Ting Chen	77082cde38	[https://nvbugspro.nvidia.com/bug/5329655 ] [feat] Pytorch path add spec dec param to attention op (#5146 ) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>	2025-07-02 04:54:43 -04:00
Robin Kobus	4cd8543d8c	[TRTLLM-1316] refactor: Remove unnecessary pipeline parallelism logic from postProcessRequest (#5489 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-02 10:13:31 +02:00
qixiang-99	ca7b6ec8d8	Feat/pytorch vswa kvcachemanager (#5151 ) Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>	2025-07-02 15:58:00 +08:00
Xiaowei Wang	32dfdfba30	feat: fuse w4a8 moe pre-quant scale on Hopper (#5613 ) Signed-off-by: Xiaowei Wang <100599594+xiaoweiw-nv@users.noreply.github.com>	2025-07-01 23:02:41 -04:00
Void	7992869798	perf: better heuristic for allreduce (#5432 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>	2025-07-01 22:56:06 -04:00
liji-nv	c345f5876c	[feat] Support torch compile for attention dp (#5086 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-07-01 13:48:52 -04:00
Robin Kobus	d68fa728d8	refactor: Clean up DecodingInput and DecodingOutput (#5617 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-01 14:31:42 +02:00
Yan Chunwei	a5eff139f1	[TRTLLM-5277] chore: refine llmapi examples for 1.0 (part1) (#5431 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com>	2025-07-01 19:06:41 +08:00
杨凯旋	61c5a53642	[#5403 ][perf] Conditionally enable SWAP AB for speculative decoding (#5404 ) Signed-off-by: zoheth <z0heth@outlook.com> Co-authored-by: Yao Yao <lowsfer@users.noreply.github.com>	2025-07-01 18:32:37 +08:00
Robin Kobus	5f77d212ef	test: Reduce number of C++ test cases (#5437 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-01 09:40:49 +02:00
danielafrimi	7a617ad1fe	feat: W4A16 GEMM (#4232 ) Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>	2025-07-01 10:36:05 +03:00
Li Min	16fc99391f	refactor: [TRTLLM-6150] Refactor moe permute and finalize op by removing duplicated code (#5557 ) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>	2025-06-30 08:48:04 -07:00
Robin Kobus	9bdc5951f8	refactor: decoder state setup (#5093 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-30 11:09:43 +02:00
WeiHaocheng	42a9385d02	[TRTLLM-5331] perf: Replace allgaher with AllToAllPrepare (#5570 ) Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>	2025-06-30 13:06:09 +08:00
Cheng Hang	64db7d27f6	[feat] Optimizations on weight-only batched gemv kernel (#5420 ) Signed-off-by: Cheng Hang <chang@nvidia.com>	2025-06-30 10:20:16 +08:00
Enwei Zhu	b4dab23e7b	[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP (#5435 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-30 01:02:07 +08:00
Li Min	6021a439ab	Make moe permute and final as custom op (#5412 ) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>	2025-06-27 15:48:33 -07:00
Daniel Stokes	5773cfdcf2	feat: Add support for per expert activation scaling factors (#5013 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-06-28 09:10:35 +12:00
Darragh Hanley	5437075def	ReDrafter support for Qwen (#4875 ) Signed-off-by: darraghdog <darragh.hanley@gmail.com> Signed-off-by: Darragh Hanley <darragh.hanley@gmail.com> Co-authored-by: rakib-hasan <rhasan@nvidia.com>	2025-06-28 02:33:10 +08:00
Robin Kobus	a8141a4513	refactor: Speculative decoding buffers part 2 (#5316 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-27 17:41:48 +02:00
Aurelien Chartier	833c0dea4a	[TRTLLM-6104] feat: add request_perf_metrics to LLMAPI (#5497 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-06-27 17:03:05 +02:00
wili	56cdfe5c6c	[TRTLLM-5000][feat] NGrams V2 (#4569 ) Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>	2025-06-27 23:00:17 +08:00
peaceh-nv	cb58073ab7	Fix : fix build for sm120 (#5265 ) Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>	2025-06-27 20:42:47 +08:00
ChristinaZ	a608b00d38	Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-06-27 20:17:40 +08:00
Daniel Stokes	83a1f60556	feat: Expose bias and FP8_MXFP4 MOE CUTLASS backend features to pytorch (#5410 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-06-27 12:29:34 +08:00
Tailing Yuan	ef43b95aa1	Fix execute_process: check results using EQUAL (#5481 )	2025-06-27 11:57:04 +08:00
Anthony Chang	de7cd0de05	fix: MoE autotune fallback failed to query default heuristic (#5520 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-06-26 17:28:48 +01:00
jmydurant	8836990bde	[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) (#5475 ) Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-06-26 22:18:08 +08:00
Robin Kobus	8dfa31c71d	refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead (#5384 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-26 19:45:52 +08:00
Yao Yao	0788c5d0d6	[perf] improve XQA-MLA perf (#5468 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>	2025-06-26 18:09:13 +08:00
Bo Li	1bab9000a6	perf: Optimize swizzle_sf, unswizzle_sf, reswizzle_sf (#5318 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-06-26 14:03:56 +08:00
Alessio Netti	7e681fbe52	[chore] Allow configuring linking of NVRTC wrapper (#5189 ) Signed-off-by: Alessio Netti <netti.alessio@gmail.com> Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-26 07:26:10 +02:00
dongxuy04	490d2e5819	feat: large-scale EP(part 8: Online EP load balancer integration for PCIe fp8) (#5226 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-25 22:25:13 -07:00
Daniel Stokes	942841417e	opensource: Opensource MOE MXFP8-MXFP4 implementation (#5222 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-06-26 12:18:19 +08:00
qsang-nv	e9cd810071	keep sm90 headsize 128 cubins (#5320 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-06-26 12:14:01 +08:00
ChristinaZ	d135f5993d	Add unit test for routing kernels (#5405 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-06-26 09:49:11 +08:00
jmydurant	578dbc8d9a	feat: chunked prefill for MLA (Blackwell) (#4651 ) Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-06-26 09:01:00 +08:00
Perkz Zheng	1f292ff2a0	[https://jirasw.nvidia.com/browse/TRTLLM-4645 ] support mutliCtasKvMode for high-throughput MLA kernels (#5426 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-06-25 16:31:10 +08:00
Enwei Zhu	fc7a81ceb0	test: Add LLGuidance test and refine guided decoding (#5348 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-25 14:12:56 +08:00
Robin Kobus	e2a8cbc80b	refactor: manage cache indirection in decoder state (#5315 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-24 09:15:59 +02:00
Robin Kobus	b3045c44b9	refactor: remove TrtGptModelOptionalParams (#5165 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-20 10:31:40 +02:00
dongxuy04	4f0f17ac8a	feat: Misc Opt for large scale EP (#5374 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-20 13:11:31 +08:00
Fanrong Li	5d4ab47d5b	fix: refactor and fix mtp vanilla (#4762 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-20 05:23:39 +08:00
Kaiyu Xie	113f6fbadd	Fix: missing clientId when serialize and deserialize response (#5231 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-06-19 23:05:11 +08:00
Fanrong Li	c7af650d5a	Fix: fix the deterministic issue in the MTP Eagle path (#5285 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-19 18:08:40 +08:00
yunruis	b3e886074e	Fix CI build time increase (#5337 ) Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>	2025-06-19 13:49:42 +08:00
jellysnack	0623ffe3bc	feat: Add LLGuidance Support for PyTorch Backend (#5214 ) Signed-off-by: jellysnack <oleg.jellysnack@gmail.com> Signed-off-by: jellysnack <158609015+jellysnack@users.noreply.github.com> Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-18 19:33:34 +08:00
Bo Li	d76bda7f2c	chore: Refine printed info of CHECK_TYPE. (#5295 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-06-18 15:35:41 +08:00
Yukun He	6711ad9cf3	[TRTLLM-5589] feat: Minor optimizations for tunable FP8 batched GEMM op. (#5139 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-06-18 14:33:46 +08:00
Yao Yao	908463a5f5	[feat]: improve performance of XQA-MLA for sm120 (#5087 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>	2025-06-18 14:19:22 +08:00
Robin Kobus	627062c265	refactor: Update decoder buffer and logits management (#4450 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-18 08:10:32 +08:00
Emma Qiao	ff32caf4d7	[Infra] - Update dependencies with NGC PyTorch 25.05 and TRT 10.11 (#4885 ) Signed-off-by: qqiao <qqiao@nvidia.com> Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Signed-off-by: Emma Qiao <qqiao@nvidia.com> Co-authored-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-06-17 23:48:34 +08:00
qsang-nv	5236bb9084	delete cubins (#5274 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-06-17 22:10:49 +08:00
QI JUN	f899c4d294	Re-implement LlmResponse in Python to reduce host overhead of pybind (#5224 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-17 21:28:09 +08:00
Dom Brown	44fb3c1673	[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner (#5207 ) - Adds a new Python custom op (fp8_block_scale_moe_runner) and a FP8BlockScaleMoERunner class for autotuning. - Updates C++ MoE and batched GEMM kernels to accept a configIndex for workspace sizing and execution. - Extends the unit test to run both autotuned and non-autotuned code paths. Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-06-17 21:01:56 +08:00
amirkl94	8451a87742	chore: Mass integration of release/0.20 (#5082 ) Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com> Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Co-authored-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com> Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Co-authored-by: Erin <14718778+hchings@users.noreply.github.com> Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com> Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>	2025-06-17 14:32:02 +03:00
liji-nv	13eef642e6	[feat] Piecewise cuda graph support for MLA (#4467 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-06-17 18:58:38 +08:00
Robin Kobus	dc3861b4aa	refactor: Unify decoder test with e2e worklfow (#5239 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-17 12:04:58 +02:00
qsang-nv	faca19c2f0	update setup.py for special cases (#5227 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-06-17 16:41:07 +08:00
qsang-nv	134cb66a53	fix mla test (#5240 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-06-17 15:26:25 +08:00
Enwei Zhu	4b82b8b4c7	[TRTLLM-5330] perf: Optimize MoE supplementary kernels for large-scale EP (#5215 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-17 15:23:24 +08:00
Tracin	a2e8ae1120	Update internal cutlass commit. (#5228 ) Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>	2025-06-17 10:47:45 +08:00
Robin Kobus	b6ca677741	refactor: remove decoder request from decoder interface (#5129 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-16 09:12:30 +02:00
Anthony Chang	4f9fa9f21d	feat: MoE trtllm backend kernel update (#5183 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-06-16 14:46:13 +08:00
Chuang Zhu	1d2b0d3d80	use file lock to avoid port conflict (#5123 )	2025-06-16 14:15:37 +08:00
Robin Kobus	dda64166cd	refactor: Scheduling based on KV cache state (#4865 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-16 08:14:58 +02:00
Tracin	ef3fdc8051	feat: Add w4a8_mxfp4_fp8 quantization recipe. (#4867 ) Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>	2025-06-16 11:30:57 +08:00
qsang-nv	5a01ba5260	use cu for fmha_v2 (#4694 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-06-15 18:40:44 +08:00
Aurelien Chartier	1389f5a4d3	feat: Add support for fp8 rowwise quantization (#4876 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com> Co-authored-by: aikitoria <151776613+aikitoria@users.noreply.github.com>	2025-06-14 06:37:48 -07:00
Robin Kobus	443b2eb51f	refactor: Speculative decoding buffers (#5091 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-14 11:39:32 +02:00
yunruis	b99c5ce8c1	Feat/ds r1 min latency opt round3, add router gemm, fused a gemm, PDL (#4560 ) Signed-off-by: yunruis <yunruis@nvidia.com> Signed-off-by: kduan <176893526+Kefeng-Duan@users.noreply.github.com> Signed-off-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com> Co-authored-by: kduan <176893526+Kefeng-Duan@users.noreply.github.com>	2025-06-14 17:36:22 +08:00
dongxuy04	97657bfda2	optimize memset before alltoall communication (#5188 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-14 10:49:47 +08:00
Perkz Zheng	3d87770e15	[https://nvbugspro.nvidia.com/bug/5295470 ] support headDim 256 for blackwell fmha kernels (#5164 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-06-13 23:01:01 +08:00
Chuang Zhu	8e9937081d	ucxx only use ucp_feature_tag to aviod some issuse on some platform (#4994 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-06-13 19:14:25 +08:00
yunruis	e5be3a95b3	fix: fix license bug (#5200 ) Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>	2025-06-13 18:58:15 +08:00
yunruis	30c5b4183a	refactoring: port customized kernels with public cutlass version (#5027 ) Signed-off-by: yunruis Merge this to unblock others since the full CI has been run through	2025-06-13 16:19:31 +08:00
Yao Yao	12e075eb70	[nvbug 5333996 ][fix] Unload XQA cubins early to avoid static lifetime (#5133 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>	2025-06-13 15:53:29 +08:00
Matthias Jouanneaux	514baf1287	[fix] Fix comment to pass guardwords check (#5191 ) Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>	2025-06-13 15:49:59 +08:00
zhhuang-nv	a891013e3c	[feat] Optimize KV Cache Reuse for MLA (#4869 ) Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>	2025-06-13 11:03:05 +08:00
Matthias Jouanneaux	a0b6c635b1	[feat] trtllmGen MoE routing: added support for top groups and top K bounds (#4063 ) Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com> Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com> Co-authored-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>	2025-06-13 06:00:02 +08:00
Xiaodong (Vincent) Huang	cc2a1344be	None: fix OOM because of unnecessary mha workspace (#5056 ) Signed-off-by: Vincent Huang <vincenth@nvidia.com>	2025-06-12 21:56:05 +02:00
liji-nv	10ab9791ec	[fix] Do not reuse dummy request KVCache (#4804 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-06-12 15:24:50 +08:00
Netanel Haber	e692779ead	Solve underallocation in VSWA+/VGQA (#4667 ) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2025-06-12 12:12:46 +08:00
HuiGao-NV	43192379af	Use backend to replace macro to control enablement of MNNVL all reduce (#4635 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-06-12 11:22:49 +08:00
Zheng Duan	ee44fa00f8	chore: rename IOFormatter to BaseCacheFormatter (#5068 ) Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>	2025-06-12 10:50:14 +08:00
Bo Li	1b79041f5d	fix: XQA is not enabled when history_length < kMinHistoryTokensPerBlock. (#4264 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-06-11 09:38:10 +08:00
Tracin	6c91f1c7ac	Mxfp8xmxfp4 quant mode(#4978 ) Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-10 22:01:37 +08:00
Zongfei Jing	6d1f2d0fd7	[TRTLLM-3927] [feat] Finalize + Allreduce + add + rmsnorm fusion (#4756 ) Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-06-10 19:55:16 +08:00
Aurelien Chartier	dcf72c6ad3	chore: cleanup GDS Cmake interface (#4928 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-06-10 17:25:43 +08:00
dongxuy04	7137cc8f67	fix cuda driver link issue with driver version less than 12.3 (#5025 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-10 15:27:39 +08:00

1 2 3 4 5 ...

600 Commits