TensorRT-LLMs/tensorrt_llm/_torch/auto_deploy
Neta Zmora c4b36d31ff
[#10137][feat] AutoDeploy FP8 MoE refactor (#10138)
The trtllm (cutlass) fp8 moe operator performs W3+W1 fusion (concat) during inference and we want to move this fusion to the model optimization time.

The Cutlass MoE kernel is used thru a trtllm torch operator.
Its implementation uses two FC operations (fc1 and fc2) while the canonical MoE API defines three GEMM operations and their associated weights (W1, W2, W3) so when we switch from the torch.moe op to the trtllm.moe op we also change terminology from w1, w2, w3 to fc1, fc2.

Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
2025-12-24 18:58:10 +02:00
..
compile [None][feat] AutoDeploy: prepare_metadata revisited (#9764) 2025-12-12 20:14:14 +08:00
config [TRTLLM-9847][fix] WAR fix hanging fused allreduce. (#10087) 2025-12-23 00:03:32 +01:00
custom_ops [#10137][feat] AutoDeploy FP8 MoE refactor (#10138) 2025-12-24 18:58:10 +02:00
distributed [#9198][feat] Refactor dist ops in AutoDeploy (#9301) 2025-12-02 02:36:32 +08:00
export [#9230][feat] Slimmed down implementation of nemotron H (#9235) 2025-11-23 03:13:32 -08:00
models [TRTLLM-9565][fix] Fix deepseek sharding (#9984) 2025-12-23 10:28:14 -05:00
shim [#10052][feat] AutoDeploy enable cudagraphs for flashinfer BatchDecode (#10193) 2025-12-24 05:55:09 -08:00
transform [#10137][feat] AutoDeploy FP8 MoE refactor (#10138) 2025-12-24 18:58:10 +02:00
utils [TRTLLM-9565][fix] Fix deepseek sharding (#9984) 2025-12-23 10:28:14 -05:00
__init__.py [AutoDeploy] merge feat/ad-2025-07-07 (#6196) 2025-07-23 05:11:04 +08:00
llm_args.py [None][feat] AutoDeploy: prepare_metadata revisited (#9764) 2025-12-12 20:14:14 +08:00
llm.py [TRTLLM-9065][chore] remove PyTorchConfig completely (#8856) 2025-11-06 22:37:03 -08:00