TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-27 06:03:09 +08:00

Author	SHA1	Message	Date
Yukun He	bed5bc9f2e	[None][chore] Wrap the swiglu into custom op to avoid redundant device copy. (#7021 ) A redundant D2D copy is observed when enabling torch.compile for the Llama model due to the swiglu triton kernel, which brings perf overhead. Use a custom op to wrap the swiglu op to avoid this overhead. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-08-27 13:02:10 +08:00
Void	040f4c70d3	[None][perf] Accelerate global scale calculations for deepEP fp4 combine (#7126 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>	2025-08-27 00:13:13 +08:00
Bo Li	bf1b958f1a	[TRTLLM-7319][perf] Fuse slicing into MoE. (#6728 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> Signed-off-by: Sergey Klevtsov <sklevtsov@nvidia.com> Co-authored-by: Sergey Klevtsov <sklevtsov@nvidia.com>	2025-08-25 16:52:30 -04:00
Yukun He	9c5b464fe0	[None][feat] Apply AutoTuner to fp8_block_scale_deep_gemm to trigger JIT ahead of time. (#7113 ) Because deep_gemm.gp8_gemm_nt will trigger many JIT processes during the inference phase, we need to sweep these shapes ahead of time. Apply the AutoTuner framework to achieve this and retain the potential capability to tune the swap_ab flag. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-08-25 10:48:31 +08:00
dongxuy04	19a0ea363b	[TRTLLM-6743][feat] Optimize and refactor alltoall in WideEP (#6973 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com> Signed-off-by: Dongxu Yang <dongxuy@nvidia.com> Co-authored-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>	2025-08-24 08:15:29 -04:00
Daniel Stokes	f7c597ec40	[None][perf] Make finalize fusion part of the tactic selection logic (#6915 ) Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com>	2025-08-21 14:08:03 -07:00
ChristinaZ	c7269ea93a	[https://nvbugs/5392414 ] [fix] Add customized default routing method (#6818 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-08-21 16:58:41 +08:00
Robin Kobus	b95cab2a7c	[None][ci] move unittests to sub-directories (#6635 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-08-20 05:42:22 -04:00
Yi Zhang	a15af879ec	[None][refactor] Refactor Torch Compile Backend, MoeLoadBalancer and warmup Logic (#6615 ) Signed-off-by: yizhang-nv <187001205+yizhang-nv@users.noreply.github.com> Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>	2025-08-19 09:58:44 +08:00
Yuening Li	1f8ae2b2db	[TRTLLM-5863][feat] Support MoE INT8 Weight-Only-Quantization in PyTorch Workflow (#6629 ) Signed-off-by: Yuening Li <62227368+yueningl@users.noreply.github.com>	2025-08-15 17:15:49 -04:00
Yukun He	bc5f766e0e	[TRTLLM-4501][feat] AutoTuner tuning config refactor and valid tactic generalization. (#6545 ) * Generalize the definition of tactics so that users can implement more customizable tactic types, making the configurations clearer for each kernel run. * Allow the user not to specify the `gen_tuning_buckets` or the `map_to_tuning_buckets` function. * Other code refactoring. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-08-13 16:25:22 +08:00
Sergey Klevtsov	27fc35175e	[None][feat] CUTLASS MoE FC2+Finalize fusion (#3294 ) Signed-off-by: Sergey Klevtsov <sklevtsov@nvidia.com>	2025-08-12 15:56:48 +08:00
NVJiangShao	2f2f5cc72c	[TRTLLM-6744][feat] Remove input_sf swizzle for module WideEPMoE (#6231 ) Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>	2025-08-08 11:13:42 +08:00
hlu1	8207d5fd39	[None] [feat] Add model gpt-oss (#6645 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>	2025-08-07 03:04:18 -04:00
liji-nv	1daa8c3232	[https://nvbugs/5340941 ][https://nvbugs/5375785 ] - fix: Wrap attentio… (#6355 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-08-01 07:38:06 -04:00
Jinyang Yuan	97f7e12588	[fix] Fix perf regression caused by MoE autotuner when using DeepEPLowLatency (#6288 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-07-28 01:37:11 -04:00
liji-nv	e07fff4f78	[https://nvbugs/5340941 ] - fix: Correct custom ops used by Qwen3 Moe … (#6285 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-07-25 14:49:45 +08:00
danielafrimi	ff9963978a	Add register_fake for finegrained_mixed_dtype_gemm torch_op (#6255 ) Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>	2025-07-22 16:59:55 +03:00
liji-nv	3e0fb60e50	[TRTLLM-4279] feat: Multistream initial support for torch compile flow (#5847 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-07-21 19:10:22 +08:00
Yuening Li	e8c068b4b1	[TRTLLM-5863][feat] Support Weight-Only-Quantization in PyTorch Workflow (#5850 ) Signed-off-by: Yuening Li <62227368+yueningl@users.noreply.github.com> Co-authored-by: Yuening Li <62227368+yueningl@users.noreply.github.com>	2025-07-21 15:17:35 +08:00
danielafrimi	5300a99bd8	W4A8 GEMM (#6005 ) Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>	2025-07-20 17:34:57 +03:00
Dom Brown	afaa388bee	[TRTLLM-6100] fix: Nvbug 5356427: autotuned TRTLLM Gen fp8 block scale MoE illegal memory access (#5676 ) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-07-14 17:17:30 +08:00
Enwei Zhu	bc1d4fb5da	[NvBug 5378370] fix: Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-07-12 15:50:31 +09:00
Dom Brown	3e3b1769ad	[TRTLLM-5881] feat: Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner (#5764 ) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-07-09 08:21:58 +01:00
DylanChen-NV	5ca2b9bb15	[TRTLLM-5812][feat] support FP8 row-wise dense GEMM in torch flow (#5615 ) Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>	2025-07-07 18:04:57 +08:00
Daniel Stokes	ec6c7dff1a	feat: Add support for MXFP8xMXFP4 in pytorch (#5535 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-07-06 15:32:06 -07:00
Jhao-Ting Chen	77082cde38	[https://nvbugspro.nvidia.com/bug/5329655 ] [feat] Pytorch path add spec dec param to attention op (#5146 ) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>	2025-07-02 04:54:43 -04:00
liji-nv	c345f5876c	[feat] Support torch compile for attention dp (#5086 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-07-01 13:48:52 -04:00
danielafrimi	7a617ad1fe	feat: W4A16 GEMM (#4232 ) Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>	2025-07-01 10:36:05 +03:00
Li Min	6021a439ab	Make moe permute and final as custom op (#5412 ) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>	2025-06-27 15:48:33 -07:00
Daniel Stokes	83a1f60556	feat: Expose bias and FP8_MXFP4 MOE CUTLASS backend features to pytorch (#5410 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-06-27 12:29:34 +08:00
Yukun He	9ee33605bb	[TRTLLM-6019] feat: Remove cutlass min latency code from AutoTuner. (#5394 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-06-26 13:12:03 +08:00
jmydurant	578dbc8d9a	feat: chunked prefill for MLA (Blackwell) (#4651 ) Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-06-26 09:01:00 +08:00
Yukun He	3fc57543e2	[5356427] fix: Remove the seq_len of 4096 from FP8 block scale MoE tuning configs. (#5485 ) The seq_len of 4096 will cause some unknown CUDA illegal memory access issue if run with some other tests consecutively. Put a saturated upper bound for any sequence length larger than it.	2025-06-26 08:38:35 +08:00
dongxuy04	4f0f17ac8a	feat: Misc Opt for large scale EP (#5374 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-20 13:11:31 +08:00
Yukun He	6711ad9cf3	[TRTLLM-5589] feat: Minor optimizations for tunable FP8 batched GEMM op. (#5139 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-06-18 14:33:46 +08:00
Dom Brown	44fb3c1673	[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner (#5207 ) - Adds a new Python custom op (fp8_block_scale_moe_runner) and a FP8BlockScaleMoERunner class for autotuning. - Updates C++ MoE and batched GEMM kernels to accept a configIndex for workspace sizing and execution. - Extends the unit test to run both autotuned and non-autotuned code paths. Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-06-17 21:01:56 +08:00
Enwei Zhu	4b82b8b4c7	[TRTLLM-5330] perf: Optimize MoE supplementary kernels for large-scale EP (#5215 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-17 15:23:24 +08:00
Tracin	ef3fdc8051	feat: Add w4a8_mxfp4_fp8 quantization recipe. (#4867 ) Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>	2025-06-16 11:30:57 +08:00
yunruis	b99c5ce8c1	Feat/ds r1 min latency opt round3, add router gemm, fused a gemm, PDL (#4560 ) Signed-off-by: yunruis <yunruis@nvidia.com> Signed-off-by: kduan <176893526+Kefeng-Duan@users.noreply.github.com> Signed-off-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com> Co-authored-by: kduan <176893526+Kefeng-Duan@users.noreply.github.com>	2025-06-14 17:36:22 +08:00
Dom Brown	9c012d5bf8	[TRTLLM-5589] feat: Integrate TRT-LLM Gen FP8 Batched GEMM with Pytorch workflow kernel autotuner (#4872 ) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-06-09 11:02:48 +01:00
Daniel Stokes	3a4851b7c3	feat: Add Mixture of Experts FP8xMXFP4 support (#4750 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-06-09 13:25:04 +08:00
Shiyu Li	b0d287c9b7	[TRTLLM-4647][fix] Fix the no fusion allreduce hanging (#4594 ) Signed-off-by: Shiyu Li <shili@nvidia.com>	2025-06-04 18:26:13 -07:00
Yukun He	5fa6fbd989	feat: Enhance AutoTuner inference path and code readability (#4466 ) Fix AutoTuner warmup request generating. * The current warmup phase creates one request, which is insufficient for the warmup to cover the max_num_tokens. Revise the warmup phase to a batch of requests to cover the max_num_tokens to eliminate potential fallback cases. Refactor AutoTuner API and reduce host overhead. Refine (min, opt, max) values of optimization profile setup for get_valid_tactics to achieve the correct canImplement definition. * Refine cache key assembly process to reduce host overhead and simplify API. * Fix lru_cache usage to reduce host overhead. * Move tuning config initialization as a one-time object in tunable runner to reduce host overhead. Improve tuning config readability. * Use dataclass to define tuning config. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-06-04 10:53:11 +08:00
Tian Zheng	9832787050	[feat] Enable NVFP4 output for TRTLLM attention kernels (#4737 ) Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>	2025-06-03 10:00:17 +08:00
tomeras91	bf9cd11fd4	[TRTLLM-4783][feat] Mamba2 kernel updates for Nemotron-H (#4494 ) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>	2025-06-01 13:56:44 +03:00
Enwei Zhu	25dde49c28	fix: EP load balancer with MTP layer and route offset by EP rank (#4767 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-01 00:07:44 +08:00
Jinyang Yuan	5339d367ce	[perf] Reduce the workspace size of FP4 activation scales for MoE (#4303 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-05-30 09:03:52 +08:00
Yilin Fan	31bb650298	Cherry pick feat/llama4 to main (#4739 ) Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com> Co-authored-by: Chenfei Zhang <chenfeiz@nvidia.com>	2025-05-30 05:28:40 +08:00
Mike Iovine	9c0de251db	[feat] Integrate Hopper chunked attention kernels (#4330 ) * Integrate chunked attention kernels Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> * Fix cache key Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> * Fix lint Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> --------- Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-05-22 17:10:57 -04:00

1 2

71 Commits