mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Bo Li bf1b958f1a [TRTLLM-7319][perf] Fuse slicing into MoE. (#6728 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> Signed-off-by: Sergey Klevtsov <sklevtsov@nvidia.com> Co-authored-by: Sergey Klevtsov <sklevtsov@nvidia.com>		2025-08-25 16:52:30 -04:00
..
CMakeLists.txt	fix: Fix MoE benchmark (#5966 )	2025-07-14 15:17:26 +09:00
gen-moe-benchmark-file.py	feat: Add support for benchmarking individual gemms in MOE benchmark (#6080 )	2025-07-18 09:00:12 +12:00
mixtureOfExpertsBackendBenchmarkFixture.h	[TRTLLM-7319][perf] Fuse slicing into MoE. (#6728 )	2025-08-25 16:52:30 -04:00
mixtureOfExpertsBackendBenchmarkLauncher.cu	[None][perf] Make finalize fusion part of the tactic selection logic (#6915 )	2025-08-21 14:08:03 -07:00
README.md	feat: Add support for benchmarking individual gemms in MOE benchmark (#6080 )	2025-07-18 09:00:12 +12:00

README.md

Micro Benchmarks

This folder contains benchmarks for specific components in TRT-LLM, using google-benchmark

Building

To build add the --micro_benchmark flag to build_wheel.py or pass -DBUILD_MICRO_BENCHMARKS=ON to cmake

Benchmark Documentations

Mixture Of Experts Backend Benchmark

Caution

Disclaimer this benchmark is intended for developers to help evaluating the impact of new optimisations. This benchmark does not meet the same quality standards as other parts of TRT-LLM. Please use with caution

Target mixtureOfExpertsBackendBenchmark

This benchmark covers the backend used by the MixtureOfExperts plugin. It allows you to benchmark different MOE configurations without building a TRT engine.

Usage:

./mixtureOfExpertsBackendBenchmark

# or

./mixtureOfExpertsBackendBenchmark --input_file <JSON benchmark definition>

For more information see:

./mixtureOfExpertsBackendBenchmark --help

The gen-moe-workload-file.py is a helper script that can generate workload files for MOE benchmarks. This is useful for sharing or comparing configurations, such as when generating a reproduction case for a performance bug