TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-02-17 00:04:57 +08:00

Author	SHA1	Message	Date
Anthony Chang	86cfb3ea7e	[None][feat] Update TRTLLM MoE cubins; reduce mxfp4 weight padding requirement; tighten TMA bound (#9025 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-11-17 10:04:29 +08:00
sunnyqgg	7862b15a65	[TRTLLM-8778][feat] Add tree attention support for blackwell arch (#8975 ) Signed-off-by: qgai <qgai@nvidia.com>	2025-11-17 09:01:53 +08:00
heyuhhh	f07e9977c6	[None] [feat] Use triton kernels for RocketKV prediction module (#8682 ) Signed-off-by: yuhangh <58161490+heyuhhh@users.noreply.github.com>	2025-11-13 18:51:09 -08:00
Neta Zmora	34dc6869f3	[#8732 ][feat] Update TRTLLM Cutlass MoE kernels with ReLU2 (#9011 ) Update TRTLLM Cutlass MoE kernels with ReLU2 activation. Nemotron-6 requires ReLU2 (i.e. squared ReLU) MoE activation function. The PR adds this and adds an API to set the activation function, in general. The ReLU2 changes are based on this FlashInfer PR: https://github.com/flashinfer-ai/flashinfer/pull/1954. The PR also updates the Auto Deploy MoE backend for 16-bit and FP8 from Triton (`torch.ops.auto_deploy.triton_moe_fused`, `torch.ops.auto_deploy.triton_quant_fp8_moe`) to TRTLLM/Cutlass (`torch.ops.auto_deploy.trtllm_moe_fused`, `torch.ops.auto_deploy.trtllm_quant_fp8_moe_fused`). Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com> Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com> Co-authored-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>	2025-11-13 16:54:45 -08:00
Jiagan Cheng	1a56722697	[None][fix] Remove unnecessary attention workspace memory check (#9064 ) Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>	2025-11-12 11:18:50 +08:00
ChristinaZ	2e7769d1e8	[None][feat] Add customized topk and related unit tests for DSA (#8882 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-11-10 03:35:35 -08:00
Chang Liu	7081f254cf	[None][perf] Add custom indexer k cache scatter op (#8960 ) Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>	2025-11-07 11:24:26 -08:00
yunruis	51545560da	[TRTLLM-8803][feat] Add rope and uk-bgemm overlap for mla generation (#8495 ) Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>	2025-11-06 17:39:57 +08:00
Shiyu Li	eeb56c2848	[None][feat] MNNVLAllreduce Kernel Refactor (#8018 ) Signed-off-by: Shiyu Li <timlee0212@outlook.com> Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-11-05 08:49:47 +08:00
Bo Li	e4bf29bc66	[None][feat] Integrate MnnvlThroughput into TRTLLM MoE. (#8728 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-11-04 21:36:29 +08:00
CarstyYou	4296c9553d	[TRTLLM-1234][feat] Add fp8 blockscaled Gemm for sm120 (#8844 ) Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>	2025-11-04 18:10:36 +08:00
Yukun He	2225745782	[TRTLLM-8129][feat] Allreduce tuning and benchmark script revising (#7870 ) Because we have encountered some perf regression due to using a one-shot kernel instead of NCCL on A100/H100, it will be beneficial if we can have a solid benchmarking of allreduce Op and analyze the data collected from it. Implemented new AllreduceOp heuristic: - Added Linear programming-based heuristic implementation. - Added LUT-based heuristic implementation and corresponding code generation script. AllreduceOp minor fixing: - Fixed a minor issue in AllreduceOp, that the strategy can not be overridden when ONESHOT or TWOSHOT is set. - Fixed a minor TWOSHOT kernel perf issue. - Cleaned up Dispatching code in AllReduceOp. This PR will fix the perf gaps reported in: https://nvbugspro.nvidia.com/bug/5517023 For Deepseek-R1, it shows a performance gain of about 3-4% in concurrency levels of 256 and 512. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-11-04 16:42:31 +08:00
Matthias Jouanneaux	d0f107e4dd	[TRTLLM-5966][feat] Helix: add full MLA support for Helix (#8104 ) Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>	2025-11-04 09:06:58 +08:00
Bo Li	4c5a8f4ec6	[None][fix] Rename: slot_count -> invalid_expert_id (#8783 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-11-01 21:36:59 +08:00
Fanrong Li	f0dc746738	[TRTLLM-8541][feat] Add trtllm-gen sparse MLA kernels to support per-Tensor FP8 KV Cache (#8692 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com> Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Co-authored-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> Co-authored-by: Tracin <10434017+Tracin@users.noreply.github.com>	2025-10-31 14:38:31 -07:00
Anthony Chang	f666ad2f6b	[None][feat] Autotuner can iterate through all tactics for test purposes (#8663 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-10-30 13:11:25 +01:00
Chang Liu	5f737b8dbe	[None][perf] Use fp8 quant kernel in DS3.2 indexer module (#8701 ) Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>	2025-10-29 12:45:09 +08:00
Cheng Hang	15c293a90b	[None][feat] Enable nvfp4 cuda core for sm120 (#8620 ) Signed-off-by: Cheng Hang <chang@nvidia.com>	2025-10-29 12:39:03 +08:00
Bo Li	9c4432f8a4	[TRTLLM-7318][feat] MnnvlThroughput AlltoAll implementation. (#7499 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-10-27 13:23:06 -04:00
nvxuanyuc	d1398c05e6	[None][feat] Support ignored prompt length for penalties via new sampling config parameter (#8127 ) Signed-off-by: Xuanyu Chen <xuanyuc@nvidia.com>	2025-10-27 13:12:31 -04:00
Jinyang Yuan	0a0f93d4a8	[None][fix] Fix the performance issue of FP8 blockwise grouped GEMM when using attention DP (#8501 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-10-27 10:18:19 +08:00
Shijie	928247a3f9	[https://nvbugs/5451205 ][feat] Add cuBLASLt NVFP4 GEMM backend support (#7943 ) Signed-off-by: Shijie Wang <jaywan@nvidia.com>	2025-10-23 15:55:10 +08:00
Anthony Chang	8a3b870e09	[None][feat] Update TRTLLM MoE MxFP4 cubins; autotune tileN (#8156 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-10-23 09:14:18 +08:00
ChristinaZ	c8b9998acb	[TRTLLM-8637][feat] Optimize the routing kernel for DeepseekV3 (MoE CUTLASS backend); Add support for KimiK2 and Qwen-next (MoE TRTLLM backend) (#7761 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-10-20 10:08:31 +08:00
Fanrong Li	0d20a8fd61	[TRTLLM-8536][feat] Add the sparse attention framework and one use case--RocketKV support (#8086 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Signed-off-by: yuhangh <58161490+heyuhhh@users.noreply.github.com> Co-authored-by: yuhangh <58161490+heyuhhh@users.noreply.github.com>	2025-10-14 08:23:16 -07:00
Jonas Li	76a47c7bef	[None][fix] Enable FP8 ContextMLA on GB300 (#8080 ) Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>	2025-10-10 10:20:46 +08:00
dongfengy	9f2a3ae88c	[None][fix] Restrict tinygemm use to certain SMs (#8182 ) Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com> Signed-off-by: dongfengy <99041270+dongfengy@users.noreply.github.com>	2025-10-08 17:55:57 -07:00
Faraz	27a5091fcb	[None][feat] GPT-OSS Sm120/Sm121 Support (#7937 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> Signed-off-by: list <58580514+farazkh80@users.noreply.github.com> Signed-off-by: Vincent Huang <vincenth@nvidia.com> Co-authored-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> Co-authored-by: Vincent Huang <vincenth@nvidia.com>	2025-10-06 16:59:06 -04:00
Jonas Yang CN	88ea2c4ee9	[TRTLLM-7349][feat] Adding new orchestrator type -- ray (#7520 ) Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com>	2025-10-04 08:12:24 +08:00
Nikita Korobov	9b3d7cc3e6	[None][feat] Update TRT-LLM Gen MoE kernels (#7970 ) Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>	2025-10-03 09:22:45 +08:00
dongfengy	6568e565db	[TRTLLM-7775][feat] Integrate tinygemm2 for gpt-oss (#7916 ) Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com> Signed-off-by: dongfengy <99041270+dongfengy@users.noreply.github.com> Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-10-02 10:47:04 -07:00
Jhao-Ting Chen	c33f43e13a	[https://nvbugs/5518713 ][fix] Trtllm-gen moe backend for blockwise fp8 ckpt (Qwen3-235B-A22B-FP8) (#7856 ) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>	2025-09-26 14:29:32 -07:00
Matthias Jouanneaux	eda1467061	[TRTLLM-5966][feat] Helix: add alltoall op (#6815 ) Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>	2025-09-25 07:18:29 -07:00
Void	336c2ef540	[None][feat] DeepEP LL fp8 dispatch/combine (#7927 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>	2025-09-25 09:20:24 +08:00
sychen52	5a65af24cd	[OMNIML-2336][feat] Add NVFP4 x FP8 moe kernels (#7821 ) Signed-off-by: Shiyang Chen <shiychen@nvidia.com>	2025-09-24 12:14:35 -07:00
HuiGao-NV	29e63d3bc2	[https://nvbugs/5532248 ][fix] Fix fused_moe OOM (#7931 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-09-24 02:22:38 -07:00
xiweny	276d83c898	[https://nvbugs/5532225 ] [fix] MoE use stream-dependent workspace (#7940 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-09-24 14:44:27 +08:00
ChristinaZ	be576a3152	[None] [feat] Enable run_post_quant_allgather for MoE TRTLLM backend (#6794 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-09-23 08:24:21 +08:00
Matthias Jouanneaux	1be7faef37	[TRTLLM-5966][feat] Helix: add custom position ids to MLA kernels (#6904 ) Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com> Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com>	2025-09-19 20:55:32 +08:00
Yuxian Qiu	d6ebcf7c4a	[TRTLLM-6994][feat] FP8 Context MLA integration (Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/6059 from release/1.1.0rc2) (#7610 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-09-19 09:40:49 +08:00
Matthias Jouanneaux	022d77807d	[TRTLLM-5966][feat] Helix: make softmax stats pointer available to attention gen (#6865 ) Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com> Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com>	2025-09-18 05:01:24 +08:00
xiweny	c076a02b38	[TRTLLM-4629] [feat] Add support of CUDA13 and sm103 devices (#7568 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com> Signed-off-by: Daniel Stokes <dastokes@nvidia.com> Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com> Signed-off-by: Xiwen Yu <xiweny@nvidia.com> Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com> Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Signed-off-by: Bo Deng <deemod@nvidia.com> Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: xiweny <13230610+VALLIS-NERIA@users.noreply.github.com> Co-authored-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com> Co-authored-by: Daniel Stokes <dastokes@nvidia.com> Co-authored-by: Zhanrui Sun <zhanruis@nvidia.com> Co-authored-by: Jiagan Cheng <jiaganc@nvidia.com> Co-authored-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Bo Deng <deemod@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-09-16 09:56:18 +08:00
jmydurant	7deefb3d2b	[TRTLLM-7192][feat] optimize MLA chunked prefill && support fp8 mla chunked prefill (#7477 ) Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-09-15 21:43:49 +08:00
Perkz Zheng	1b29c2e731	[None][feat] support gpt-oss with fp8 kv cache (#7612 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-09-15 02:17:37 +08:00
NVJiangShao	cc7593987b	[https://nvbugs/5434424 ][fix] A quick fix for the wrong output issue of SM89 blocked scaling batched GEMM when the input tensor is non-contiguous. (#7615 ) Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>	2025-09-09 08:58:15 -04:00
xiweny	0fdc6c7278	[TRTLLM-4629] [feat] trtllm-gen kernels support sm103 (#7570 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-09-07 10:04:10 +08:00
sychen52	98a1bffb7c	[OMNIML-2336][feat] Add NVFP4 x FP8 (#6809 ) Signed-off-by: Shiyang Chen <shiychen@nvidia.com>	2025-09-04 09:03:38 -07:00
Enwei Zhu	1745102e72	[TRTLLM-7027][feat] Fuse d2t to logitsBitmaskKernel and fix a race condition in one-model spec (#7481 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-09-04 23:30:14 +08:00
Yilin Fan	261ffacfa4	[https://nvbugs/5412562 ][feat] Allocate MoE workspace only when necessary (release/1.0 retargeted) (#6955 ) Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-01 11:02:31 +08:00
Tian Zheng	e257cb3533	[None][feat] Support NVFP4 KV Cache (#6244 ) Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>	2025-09-01 09:24:52 +08:00
Zongfei Jing	53163bf1df	[TRTLLM-6876][feat] Add low precision all2all for mnnvl (#7155 ) Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-08-28 18:26:16 +08:00
Jin Li	028235404b	[TRTLLM-6633][feat] Padding for piecewise cudagraph (#6750 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-08-26 18:31:33 -04:00
Void	040f4c70d3	[None][perf] Accelerate global scale calculations for deepEP fp4 combine (#7126 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>	2025-08-27 00:13:13 +08:00
Bo Li	bf1b958f1a	[TRTLLM-7319][perf] Fuse slicing into MoE. (#6728 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> Signed-off-by: Sergey Klevtsov <sklevtsov@nvidia.com> Co-authored-by: Sergey Klevtsov <sklevtsov@nvidia.com>	2025-08-25 16:52:30 -04:00
dongxuy04	19a0ea363b	[TRTLLM-6743][feat] Optimize and refactor alltoall in WideEP (#6973 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com> Signed-off-by: Dongxu Yang <dongxuy@nvidia.com> Co-authored-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>	2025-08-24 08:15:29 -04:00
Daniel Stokes	f7c597ec40	[None][perf] Make finalize fusion part of the tactic selection logic (#6915 ) Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com>	2025-08-21 14:08:03 -07:00
Yuan Tong	90bfc8cc29	[https://nvbugs/5453827 ][fix] Fix RPATH of th_common shared library to find pip-installed NCCL (#6984 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-08-21 17:58:30 +08:00
ChristinaZ	c7269ea93a	[https://nvbugs/5392414 ] [fix] Add customized default routing method (#6818 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-08-21 16:58:41 +08:00
zhhuang-nv	7e135d2ea7	[None][feat] Use Separate QKV Input Layout for Context MLA (#6538 ) Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>	2025-08-19 22:04:48 +08:00
amitz-nv	a54c53652b	[TRTLLM-7263][fix] Prevent recreation of cublas handles in lora_grouped_gemm every call (#6968 ) Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>	2025-08-19 15:39:56 +03:00
bhsueh_NV	85cbd0263b	[None][feat] Support Yarn on Qwen3 (#6785 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-08-17 07:21:29 +08:00
Yuening Li	1f8ae2b2db	[TRTLLM-5863][feat] Support MoE INT8 Weight-Only-Quantization in PyTorch Workflow (#6629 ) Signed-off-by: Yuening Li <62227368+yueningl@users.noreply.github.com>	2025-08-15 17:15:49 -04:00
Liao Lanyu	f7c13a4aa7	[TRTLLM-6906][chore] Using pybind to bind functions in thop/attentionOp (#6745 ) Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>	2025-08-12 16:45:16 +08:00
Sergey Klevtsov	27fc35175e	[None][feat] CUTLASS MoE FC2+Finalize fusion (#3294 ) Signed-off-by: Sergey Klevtsov <sklevtsov@nvidia.com>	2025-08-12 15:56:48 +08:00
NVJiangShao	2f2f5cc72c	[TRTLLM-6744][feat] Remove input_sf swizzle for module WideEPMoE (#6231 ) Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>	2025-08-08 11:13:42 +08:00
Iman Tabrizian	82276167e6	[None][feat] Add NCCL Symmetric Integration for All Reduce (#4500 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-08-07 17:28:14 -07:00
peaceh-nv	8ec3b1de10	[None][feat] : Add FP8 context MLA support for SM120 (#6059 ) Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>	2025-08-07 16:16:34 +08:00
hlu1	8207d5fd39	[None] [feat] Add model gpt-oss (#6645 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>	2025-08-07 03:04:18 -04:00
Haohang Huang	c9eebcb454	[TRTLLM-6674][feat] (Breaking Change) Hopper SWA non-cyclic kernels + KV reuse + Spec Dec (#6379 ) Signed-off-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com> Signed-off-by: symphonylyh <31998628+symphonylyh@users.noreply.github.com>	2025-08-05 07:47:41 +00:00
Yuan Tong	a2f271c8e0	[TRTLLM-4406][feat] LLM sleep & wakeup Part 1: virtual device memory (#5034 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-08-04 13:51:01 +08:00
Jhao-Ting Chen	6edaa23c1c	[None][feat] Multi-block mode for Hopper spec dec XQA kernel (#4416 ) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>	2025-08-03 14:31:33 -07:00
Yang Li	ac23f4a80d	[TRTLLM-4279] fix: Add a protection test for checking trtllm custom ops (#6515 ) Signed-off-by: Yang Li <56944310+yali-arch@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>	2025-08-01 15:59:09 +08:00
liji-nv	e07fff4f78	[https://nvbugs/5340941 ] - fix: Correct custom ops used by Qwen3 Moe … (#6285 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-07-25 14:49:45 +08:00
Zhenhua Wang	62298bc473	perf: customize cublastLt algo for Llamba 3.3 70B TP4 (#6315 ) Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>	2025-07-24 23:01:15 +08:00
Perkz Zheng	2193ad3aac	[https://nvbugs/5387771 ] fix deadlocks due to insufficient numSemaphores (#6262 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-07-23 11:20:55 +08:00
Yuening Li	e8c068b4b1	[TRTLLM-5863][feat] Support Weight-Only-Quantization in PyTorch Workflow (#5850 ) Signed-off-by: Yuening Li <62227368+yueningl@users.noreply.github.com> Co-authored-by: Yuening Li <62227368+yueningl@users.noreply.github.com>	2025-07-21 15:17:35 +08:00
danielafrimi	5300a99bd8	W4A8 GEMM (#6005 ) Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>	2025-07-20 17:34:57 +03:00
bhsueh_NV	2e14c8f443	[Fix][Chore][Qwen3] fix bug of using fp4 on sm120 (#6065 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-07-20 10:25:25 +08:00
Enwei Zhu	21efb50068	[TRTLLM-6406] feat: Enable guided decoding with overlap scheduler (#6000 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-07-17 17:46:10 +08:00
Jinyang Yuan	e761231c0b	[fix] Move NCCL group in all-gather and reduce-scatter OPs outside the outer loop (#6053 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-07-16 00:25:32 +09:00
Yi Zhang	9cc4e5d50e	[nvbugs/5336321][fix] Enable attention dp = False test case, Fix TRTLLM Gen Moe workspace allocation (#5463 ) Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> Signed-off-by: yizhan <187001205+yizhang-nv@users.noreply.github.com>	2025-07-14 17:17:30 +08:00
Dom Brown	afaa388bee	[TRTLLM-6100] fix: Nvbug 5356427: autotuned TRTLLM Gen fp8 block scale MoE illegal memory access (#5676 ) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-07-14 17:17:30 +08:00
Enwei Zhu	bc1d4fb5da	[NvBug 5378370] fix: Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-07-12 15:50:31 +09:00
Jinyang Yuan	8b9a030a5c	[fix] Fix MoE workspace info by storing Torch tensor itself instead of data_ptr (#5900 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-07-10 20:07:32 +09:00
Anthony Chang	7d21b55b5a	[feat] Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE (#5723 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-07-10 14:06:50 +08:00
QI JUN	e289a98d5a	avoid nesting NCCL group in allgather and reduce scatter OPs (#5866 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-07-10 12:32:59 +09:00
DylanChen-NV	74dca0aa7b	[NVBUG-5304516/5319741]Qwen2.5VL FP8 support (#5029 ) Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>	2025-07-09 23:16:42 +08:00
Dom Brown	3e3b1769ad	[TRTLLM-5881] feat: Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner (#5764 ) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-07-09 08:21:58 +01:00
JieXin Liang	664bf95892	[fix] improve fp4_block_scale_moe_runner type check (#5681 ) Signed-off-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: ChristinaZ <83400082+ChristinaZ@users.noreply.github.com>	2025-07-08 14:32:14 +09:00
DylanChen-NV	5ca2b9bb15	[TRTLLM-5812][feat] support FP8 row-wise dense GEMM in torch flow (#5615 ) Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>	2025-07-07 18:04:57 +08:00
Daniel Stokes	ec6c7dff1a	feat: Add support for MXFP8xMXFP4 in pytorch (#5535 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-07-06 15:32:06 -07:00
Jhao-Ting Chen	77082cde38	[https://nvbugspro.nvidia.com/bug/5329655 ] [feat] Pytorch path add spec dec param to attention op (#5146 ) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>	2025-07-02 04:54:43 -04:00
Void	7992869798	perf: better heuristic for allreduce (#5432 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>	2025-07-01 22:56:06 -04:00
liji-nv	c345f5876c	[feat] Support torch compile for attention dp (#5086 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-07-01 13:48:52 -04:00
danielafrimi	7a617ad1fe	feat: W4A16 GEMM (#4232 ) Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>	2025-07-01 10:36:05 +03:00
Li Min	16fc99391f	refactor: [TRTLLM-6150] Refactor moe permute and finalize op by removing duplicated code (#5557 ) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>	2025-06-30 08:48:04 -07:00
WeiHaocheng	42a9385d02	[TRTLLM-5331] perf: Replace allgaher with AllToAllPrepare (#5570 ) Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>	2025-06-30 13:06:09 +08:00
Enwei Zhu	b4dab23e7b	[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP (#5435 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-30 01:02:07 +08:00
Li Min	6021a439ab	Make moe permute and final as custom op (#5412 ) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>	2025-06-27 15:48:33 -07:00
Daniel Stokes	5773cfdcf2	feat: Add support for per expert activation scaling factors (#5013 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-06-28 09:10:35 +12:00

1 2 3 4 5 ...

294 Commits