TensorRT-LLMs/tests/unittest/_torch
Neta Zmora c4b36d31ff
[#10137][feat] AutoDeploy FP8 MoE refactor (#10138)
The trtllm (cutlass) fp8 moe operator performs W3+W1 fusion (concat) during inference and we want to move this fusion to the model optimization time.

The Cutlass MoE kernel is used thru a trtllm torch operator.
Its implementation uses two FC operations (fc1 and fc2) while the canonical MoE API defines three GEMM operations and their associated weights (W1, W2, W3) so when we switch from the torch.moe op to the trtllm.moe op we also change terminology from w1, w2, w3 to fc1, fc2.

Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
2025-12-24 18:58:10 +02:00
..
attention [TRTLLM-9798][feat] Change to use new DeepGEMM MQA sm100 kernel for MTP-3 (#10226) 2025-12-24 14:39:12 +08:00
auto_deploy [#10137][feat] AutoDeploy FP8 MoE refactor (#10138) 2025-12-24 18:58:10 +02:00
compilation [TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804) 2025-05-09 11:04:01 +08:00
debugger Fix: fix nvbug 5356427 (#5464) 2025-06-25 22:24:26 +08:00
executor [TRTLLM-5972][chore] Load balance decode token KV cache with helix parallelism (#9757) 2025-12-12 22:29:05 +08:00
misc [TRTLLM-9615][feat] Implement a distributed tuning system (#9621) 2025-12-15 21:08:53 +08:00
modeling [https://nvbugs/5721644][fix] Update tests for nemotron_h (#9993) 2025-12-18 12:38:02 +08:00
models/checkpoints/hf [TRTLLM-7136][feat] Update load_weights method to include mapping parameter in checkpoint loaders (#9583) 2025-12-05 16:07:20 +01:00
modules [TRTLLM-9493][feat] Custom AllToAll for helix parallelism (#9986) 2025-12-23 18:14:30 -08:00
multi_gpu [https://nvbugs/5729697][fix] MNNVL Allreduce: use CUDA runtime instead of Macro to get SM version. (#10062) 2025-12-23 16:07:07 +08:00
multi_gpu_modeling [https://nvbugs/5515753][ci] Add NCCL_DEBUG=INFO flag to collect more info with CI failure. (#8440) 2025-11-20 12:43:13 -05:00
multimodal [TRTLLM-9409][feat] Pass MRoPE tensors for EPD disagg (#9758) 2025-12-22 06:32:49 -05:00
ray_orchestrator [TRTLLM-9737][chore] Add rl perf reproduce script and enhance the robustness of Ray tests (#9939) 2025-12-24 15:27:01 +08:00
sampler [None][fix] avoid implicit cudaStreamSynchronize in sample_async. (#10120) 2025-12-23 10:15:40 +08:00
speculative [None][fix] Fix iteration stats for spec-dec (#9855) 2025-12-16 14:11:38 -08:00
thop [https://nvbugs/5720357][fix] Fix indice offset overflow in custom Top-K kernel and corresponding UT case (#10027) 2025-12-19 14:58:01 -05:00
helpers.py [#8733][feat] Add Llama4 MoE handling to AutoDeploy (#9556) 2025-12-04 08:03:33 +02:00
pattern_watcher.py [TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804) 2025-05-09 11:04:01 +08:00
test_connector.py [None][feat] KV Cache Connector API (#7228) 2025-08-28 23:09:27 -04:00