TensorRT-LLMs/tests/scripts/cute_dsl_kernels
2026-01-25 21:02:30 +08:00
..
moe_workload_generator.py [TRTLLM-10147][perf] Balanced random MoE workload generator for CuteDSL kernel UT, autotuner and layerwise benchmark (#10279) 2026-01-25 21:02:30 +08:00
README.md [TRTLLM-10147][perf] Balanced random MoE workload generator for CuteDSL kernel UT, autotuner and layerwise benchmark (#10279) 2026-01-25 21:02:30 +08:00
run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py [None] [feat] Add test script and raster M for gather fc1 kernel (#10429) 2026-01-07 09:31:49 +08:00
run_blockscaled_contiguous_grouped_gemm_finalize_fusion.py [None][feat] CuteDSL MOE FC1 Enhancement (#10088) 2026-01-06 09:30:43 +08:00
run_blockscaled_contiguous_grouped_gemm_swiglu_fusion.py [None][feat] CuteDSL MOE FC1 Enhancement (#10088) 2026-01-06 09:30:43 +08:00
run_blockscaled_contiguous_grouped_gemm.py [None][feat] CuteDSL MOE FC1 Enhancement (#10088) 2026-01-06 09:30:43 +08:00
run_dense_blockscaled_gemm_persistent.py [TRTLLM-6222][feat] Several perf opt for cuteDSL nvf4 gemm (#9428) 2025-12-01 18:10:45 +08:00
testing.py [https://nvbugs/4141427][chore] Add more details to LICENSE file (#9881) 2025-12-13 08:35:31 +08:00

Launch Scripts for CuTe DSL Kernels

MoE Workload Generator

# Generate workload using a balanced random method
# Per-rank token number 128, EP size 32 (a typical workload for large EP gen phase)
python moe_workload_generator.py --num_tokens 128 --ep_size 32 --tile_size 128
# Per-rank token number 8192, EP size 4 (a typical workload for ctx phase)
python moe_workload_generator.py --num_tokens 8192 --ep_size 4 --tile_size 256