mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
The trtllm (cutlass) fp8 moe operator performs W3+W1 fusion (concat) during inference and we want to move this fusion to the model optimization time. The Cutlass MoE kernel is used thru a trtllm torch operator. Its implementation uses two FC operations (fc1 and fc2) while the canonical MoE API defines three GEMM operations and their associated weights (W1, W2, W3) so when we switch from the torch.moe op to the trtllm.moe op we also change terminology from w1, w2, w3 to fc1, fc2. Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| attention | ||
| auto_deploy | ||
| compilation | ||
| debugger | ||
| executor | ||
| misc | ||
| modeling | ||
| models/checkpoints/hf | ||
| modules | ||
| multi_gpu | ||
| multi_gpu_modeling | ||
| multimodal | ||
| ray_orchestrator | ||
| sampler | ||
| speculative | ||
| thop | ||
| helpers.py | ||
| pattern_watcher.py | ||
| test_connector.py | ||