TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Stefan Niebler	d475c97c82	[nvbugs/5354884][fix] Update beam search workspace estimation to new upper bound (#5926 ) Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>	2025-07-19 01:54:51 +08:00
QI JUN	a95f31e72a	chore: add more log in FmhaDispatcher (#6170 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-07-18 16:53:02 +08:00
xavier-nvidia	200ea9ee81	fix TMA error with GEMM+AR on TP=2 (#6075 ) Signed-off-by: Xavier Simmons <xsimmons@nvidia.com>	2025-07-18 10:26:08 +08:00
Daniel Stokes	ae28b3a664	feat: Add support for benchmarking individual gemms in MOE benchmark (#6080 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-07-18 09:00:12 +12:00
ChristinaZ	7e033c392e	Feat: Add vectorized loading for finalize kernel in MoE Trtllm backend (#5919 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-07-17 12:38:29 +08:00
Shiyu Li	6e1aee6fd6	[fix] Performance Optimization for MNNVL TwoShot Kernel (#5934 ) Signed-off-by: Shiyu Li <shili@nvidia.com> Co-authored-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-07-17 10:49:51 +08:00
Daniel Stokes	f277afdd93	perf: Enable 128x256 tile shapes for FP4 MOE CUTLASS backend (#5986 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-07-14 14:04:15 -07:00
Robin Kobus	6d4b045d1f	refactor: Remove enforced sorted order of batch slots (#3502 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-14 17:23:02 +02:00
Perkz Zheng	4a0b7a0cf1	[https://nvbugspro.nvidia.com/bug/5355054 ] fallback to cubins for fp8 fmha kernels on Ada. (#5779 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> Co-authored-by: qsang-nv <200703406+qsang-nv@users.noreply.github.com>	2025-07-14 17:17:30 +08:00
Yuan Tong	a36ac45c4d	fix: fast redux detection in trtllm gen routing kernel (#5941 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-07-13 16:35:07 +08:00
Enwei Zhu	bc1d4fb5da	[NvBug 5378370] fix: Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-07-12 15:50:31 +09:00
ChristinaZ	c5fb692a7d	Refactor the rest routing part for the routing kernels in the MoE TRT-LLM backend (#5771 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-07-11 16:37:56 +08:00
CarstyYou	dc32f9ae73	[fix] fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531 ) Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>	2025-07-10 15:16:18 +08:00
Anthony Chang	7d21b55b5a	[feat] Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE (#5723 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-07-10 14:06:50 +08:00
peaceh-nv	76c3a12bcb	[fix] WAR to fix the illegal memory access issue in moe gemm on SM120 (#5636 ) Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>	2025-07-10 09:20:30 +08:00
DylanChen-NV	74dca0aa7b	[NVBUG-5304516/5319741]Qwen2.5VL FP8 support (#5029 ) Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>	2025-07-09 23:16:42 +08:00
peaceh-nv	52684d79f7	Fix : fix moe regression for sm120 (#5823 ) Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>	2025-07-09 21:25:11 +08:00
Jhao-Ting Chen	e4c777df7d	Add is_fp8_output key to XQA kernel cubin hashing (solves Eagle3-one-engine Hopper fp8 bug) (#5813 ) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>	2025-07-09 09:26:27 +08:00
xavier-nvidia	b6013da198	Fix GEMM+AR fusion on blackwell (#5563 ) Signed-off-by: xsimmons <xsimmons@nvidia.com>	2025-07-09 08:48:47 +08:00
Pamela Peng	da8c7372d4	[TRTLLM-5366][feat]Add support for sm121 (#5524 ) Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> Initial CI run failed a single step A30-CPP-3 due to timeout. Rerunning that step succeeded.	2025-07-08 14:27:00 -07:00
davidclark-nv	a1235ee978	[feat] Adds optional module cache for TRT-LLM Gen Gemm interfaces (#5743 ) Signed-off-by: David Clark <215764518+davidclark-nv@users.noreply.github.com> Co-authored-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>	2025-07-07 13:34:55 -07:00
ChristinaZ	12d8c7d129	Refactor the topk parallelization part for the routing kernels (#5567 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-07-07 15:53:25 +08:00
Daniel Stokes	ec6c7dff1a	feat: Add support for MXFP8xMXFP4 in pytorch (#5535 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-07-06 15:32:06 -07:00
Julien Debache	6bddaf6df6	chore: Improve documentation of Kv_block_array (#5765 ) Signed-off-by: Julien Debache <julien.debache@hotmail.com>	2025-07-05 22:25:27 +02:00
Yuan Tong	32b244af38	feat: reduce unnecessary kernel generation (#5476 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-07-04 14:37:49 +08:00
WeiHaocheng	dccbfc8b1e	fix: Set init value for moe expert id (#5660 ) Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>	2025-07-03 07:05:31 -04:00
Xiaowei Wang	32dfdfba30	feat: fuse w4a8 moe pre-quant scale on Hopper (#5613 ) Signed-off-by: Xiaowei Wang <100599594+xiaoweiw-nv@users.noreply.github.com>	2025-07-01 23:02:41 -04:00
Void	7992869798	perf: better heuristic for allreduce (#5432 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>	2025-07-01 22:56:06 -04:00
Yan Chunwei	a5eff139f1	[TRTLLM-5277] chore: refine llmapi examples for 1.0 (part1) (#5431 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com>	2025-07-01 19:06:41 +08:00
danielafrimi	7a617ad1fe	feat: W4A16 GEMM (#4232 ) Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>	2025-07-01 10:36:05 +03:00
Li Min	16fc99391f	refactor: [TRTLLM-6150] Refactor moe permute and finalize op by removing duplicated code (#5557 ) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>	2025-06-30 08:48:04 -07:00
WeiHaocheng	42a9385d02	[TRTLLM-5331] perf: Replace allgaher with AllToAllPrepare (#5570 ) Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>	2025-06-30 13:06:09 +08:00
Cheng Hang	64db7d27f6	[feat] Optimizations on weight-only batched gemv kernel (#5420 ) Signed-off-by: Cheng Hang <chang@nvidia.com>	2025-06-30 10:20:16 +08:00
Enwei Zhu	b4dab23e7b	[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP (#5435 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-30 01:02:07 +08:00
Li Min	6021a439ab	Make moe permute and final as custom op (#5412 ) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>	2025-06-27 15:48:33 -07:00
Daniel Stokes	5773cfdcf2	feat: Add support for per expert activation scaling factors (#5013 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-06-28 09:10:35 +12:00
peaceh-nv	cb58073ab7	Fix : fix build for sm120 (#5265 ) Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>	2025-06-27 20:42:47 +08:00
ChristinaZ	a608b00d38	Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-06-27 20:17:40 +08:00
Tailing Yuan	ef43b95aa1	Fix execute_process: check results using EQUAL (#5481 )	2025-06-27 11:57:04 +08:00
Anthony Chang	de7cd0de05	fix: MoE autotune fallback failed to query default heuristic (#5520 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-06-26 17:28:48 +01:00
jmydurant	8836990bde	[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) (#5475 ) Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-06-26 22:18:08 +08:00
Bo Li	1bab9000a6	perf: Optimize swizzle_sf, unswizzle_sf, reswizzle_sf (#5318 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-06-26 14:03:56 +08:00
Alessio Netti	7e681fbe52	[chore] Allow configuring linking of NVRTC wrapper (#5189 ) Signed-off-by: Alessio Netti <netti.alessio@gmail.com> Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-26 07:26:10 +02:00
Daniel Stokes	942841417e	opensource: Opensource MOE MXFP8-MXFP4 implementation (#5222 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-06-26 12:18:19 +08:00
qsang-nv	e9cd810071	keep sm90 headsize 128 cubins (#5320 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-06-26 12:14:01 +08:00
ChristinaZ	d135f5993d	Add unit test for routing kernels (#5405 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-06-26 09:49:11 +08:00
jmydurant	578dbc8d9a	feat: chunked prefill for MLA (Blackwell) (#4651 ) Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-06-26 09:01:00 +08:00
Perkz Zheng	1f292ff2a0	[https://jirasw.nvidia.com/browse/TRTLLM-4645 ] support mutliCtasKvMode for high-throughput MLA kernels (#5426 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-06-25 16:31:10 +08:00
dongxuy04	4f0f17ac8a	feat: Misc Opt for large scale EP (#5374 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-20 13:11:31 +08:00
Fanrong Li	5d4ab47d5b	fix: refactor and fix mtp vanilla (#4762 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-20 05:23:39 +08:00

1 2 3 4 5 ...

265 Commits