|
|
||
|---|---|---|
| .. | ||
| generate_eplb_config.py | ||
| README.md | ||
Expert Parallelism Load Balancer (EPLB)
Effective load balancing is crucial when leveraging large-scale expert parallelism. As described in the DeepSeek-V3 paper, redundant experts can be introduced to rebalance the workload across GPUs. This mechanism is known as the Expert Parallelism Load Balancer (EPLB).
Note: Currently, only the offline EP load balancer is supported.
Offline EP Load Balancer
Step 1: Run Inference and Collect Statistics
To generate the necessary statistics for load balancing, run your model on a target dataset (e.g., GSM8K) while counting the routed expert IDs during inference. Once counting is complete, the statistics will be saved for further processing.
Set up some environment variables:
export MODEL_PATH=<YOUR_MODEL_PATH>
# Set the expert statistic data path
export EXPERT_STATISTIC_PATH=./expert_statistic
# Enable counting of routed expert IDs from iteration 100 to iteration 200
export EXPERT_STATISTIC_ITER_RANGE=100-200
Prepare a configuration file and run inference on GSM8K:
cat > ./extra_llm_api_options.yaml <<EOF
enable_attention_dp: true
use_cuda_graph: true
EOF
trtllm-eval --model $MODEL_PATH \
--tp_size 8 \
--ep_size 8 \
--extra_llm_api_options ./extra_llm_api_options.yaml \
--backend pytorch gsm8k
After inference, review the dumped statistic files in $EXPERT_STATISTIC_PATH.
Step 2: Generate the EPLB Configuration
Use the provided generate_eplb_config.py script to convert the collected statistics into an EPLB configuration file. Specify the expert parallelism size (--ep_size) and the total number of slots (--num_slots) that will be used for deployment:
python generate_eplb_config.py \
--ep_size 8 \
--num_slots 320 \
--expert_statistic_path $EXPERT_STATISTIC_PATH \
--output_path ./moe_load_balancer.yaml
Step 3: Run Inference with the EPLB Configuration
Disable the expert ID counting by unsetting the environment variable:
unset EXPERT_STATISTIC_ITER_RANGE
Prepare a new configuration file that incorporates the generated EPLB configuration, then run inference on GSM8K:
cat > ./extra_llm_api_options_eplb.yaml <<EOF
enable_attention_dp: true
use_cuda_graph: true
moe_load_balancer: ./moe_load_balancer.yaml
EOF
trtllm-eval --model $MODEL_PATH \
--tp_size 8 \
--ep_size 8 \
--extra_llm_api_options ./extra_llm_api_options_eplb.yaml \
--backend pytorch gsm8k