minimind/docs/training.md
2025-10-21 22:24:30 +08:00

15 KiB
Raw Permalink Blame History

Model Training Guide

Learn how to train MiniMind language models from scratch using pure PyTorch.

📊 Training Overview

MiniMind implements a complete training pipeline:

Tokenizer Training
        ↓
   Pretraining (Learn knowledge)
        ↓
   SFT (Learn conversation)
        ↓
    ┌───────────────────┬─────────────────────┬──────────────┐
    ↓                   ↓                     ↓              ↓
  LoRA              DPO/RLHF         RLAIF (PPO/GRPO/SPO)  Distillation
(Domain adapt)    (Preference)     (Reinforcement Learn)   (Reasoning)

💰 Training Costs (Single NVIDIA 3090)

Model Dataset Duration Cost (RMB) Quality
MiniMind2-Small pretrain_hq + sft_mini_512 2.1h ≈3 😊😊
MiniMind2-Small Full dataset 38h ≈50 😊😊😊😊😊😊
MiniMind2 pretrain_hq + sft_mini_512 3.3h ≈5 😊😊
MiniMind2 Full dataset 122h ≈160 😊😊😊😊😊😊😊

!!! success "Ultra-Fast Training" Just 2.1 hours + $3 = Functional ChatBot!

Use `pretrain_hq.jsonl` + `sft_mini_512.jsonl` for fastest reproduction

📋 Data Preparation

1. Download Datasets

Download from ModelScope or HuggingFace:

mkdir -p dataset
cd dataset
# Download required files

2. Dataset Directory Structure

./dataset/
├── pretrain_hq.jsonl ✨ (1.6GB, required for pretraining)
├── sft_mini_512.jsonl ✨ (1.2GB, fastest SFT)
├── sft_512.jsonl (7.5GB, standard SFT)
├── sft_1024.jsonl (5.6GB, longer SFT)
├── sft_2048.jsonl (9GB, very long SFT)
├── dpo.jsonl (909MB, DPO training)
├── r1_mix_1024.jsonl (340MB, reasoning distillation)
├── rlaif-mini.jsonl (1MB, RLAIF algorithms)
├── lora_identity.jsonl (22.8KB, identity LoRA)
└── lora_medical.jsonl (34MB, medical domain LoRA)

3. Data Formats

Pretraining Data (pretrain_hq.jsonl):

{"text": "How to overcome procrastination? Overcoming procrastination is not easy, but these suggestions may help..."}

SFT Data (sft_*.jsonl):

{
  "conversations": [
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hello! How can I help?"},
    {"role": "user", "content": "Tell me a joke."},
    {"role": "assistant", "content": "Why did the scarecrow win an award? Because he was outstanding in his field!"}
  ]
}

DPO Data (dpo.jsonl):

{
  "chosen": [
    {"role": "user", "content": "What is 2+2?"},
    {"role": "assistant", "content": "2+2 equals 4."}
  ],
  "rejected": [
    {"role": "user", "content": "What is 2+2?"},
    {"role": "assistant", "content": "2+2 equals 5."}
  ]
}

LoRA Domain Data (lora_*.jsonl):

{
  "conversations": [
    {"role": "user", "content": "What's the treatment for cervical spondylosis?"},
    {"role": "assistant", "content": "Cervical spondylosis treatment typically includes..."}
  ]
}

🎯 Complete Training Pipeline

All training scripts are in the ./trainer directory.

cd trainer

Stage 1: Pretraining

Purpose: Learn foundational knowledge (word continuation)

# Single GPU
python train_pretrain.py

# Multi-GPU (DDP)
torchrun --nproc_per_node 2 train_pretrain.py

# Multi-GPU (DeepSpeed)
deepspeed --master_port 29500 --num_gpus=2 train_pretrain.py

Key Parameters:

  • max_seq_len: 512 (adjust based on GPU memory)
  • learning_rate: 1e-4
  • epochs: Adjust based on dataset size

Output: ./out/pretrain_*.pth

Training Duration:

  • MiniMind2-Small (26M): ~1.1h
  • MiniMind2 (104M): ~3.9h

!!! tip "Pretraining Tips" - Start with pretrain_hq.jsonl for best results - Quality > Quantity for pretraining data - Monitor loss curve to detect overfitting

Stage 2: Supervised Fine-Tuning (SFT)

Purpose: Teach conversation patterns and chat templates

# Single GPU
python train_full_sft.py

# Multi-GPU
torchrun --nproc_per_node 2 train_full_sft.py

Configuration:

  • Load pretrained model from Stage 1
  • Use SFT dataset (sft_mini_512.jsonl or sft_512.jsonl)
  • Adjust max_seq_len to match training data

Output: ./out/full_sft_*.pth

Training Duration:

  • With sft_mini_512: 1-3 hours
  • With full sft_512: 20-25 hours

!!! warning "SFT Data Selection" - sft_mini_512.jsonl: Fastest, ~1.2GB, 512 tokens max - sft_512.jsonl: Standard, ~7.5GB, 512 tokens max - sft_1024.jsonl: Longer, ~5.6GB, 1024 tokens max - sft_2048.jsonl: Extended, ~9GB, 2048 tokens max

Stage 3: LoRA Fine-Tuning (Optional)

Purpose: Parameter-efficient domain adaptation

Use Cases:

  • Medical Q&A knowledge
  • Personal identity/self-awareness
  • Proprietary domain knowledge
# Edit train_lora.py to set correct dataset and base model
python train_lora.py

# Multi-GPU
torchrun --nproc_per_node 2 train_lora.py

Output: ./out/lora/lora_*.pth

Example 1: Medical Domain

Prepare dataset/lora_medical.jsonl:

{
  "conversations": [
    {"role": "user", "content": "What's the correct pillow height for cervical spondylosis?"},
    {"role": "assistant", "content": "For cervical spondylosis, pillow height should be..."}
  ]
}

Train:

# Modify train_lora.py: lora_name = 'medical'
python train_lora.py

Example 2: Identity/Self-Awareness

Prepare dataset/lora_identity.jsonl:

{
  "conversations": [
    {"role": "user", "content": "Who are you?"},
    {"role": "assistant", "content": "I am MiniMind..."}
  ]
}

Stage 4: Direct Preference Optimization (DPO)

Purpose: Align model responses with human preferences

DPO eliminates the need for separate reward models by directly optimizing preference pairs.

python train_dpo.py

# Multi-GPU
torchrun --nproc_per_node 2 train_dpo.py

Output: ./out/rlhf_*.pth

Key Features:

  • Off-policy training (reuse data across epochs)
  • No separate reward model needed
  • Better sample efficiency than PPO
  • Stable training convergence

Training Duration: ~1-3 hours

Stage 5: Reinforcement Learning from AI Feedback (RLAIF)

RLAIF is an advanced training approach using AI-generated rewards instead of human annotations. MiniMind implements three modern algorithms:

5.1 PPO (Proximal Policy Optimization)

Classical on-policy RL algorithm with proven stability.

python train_ppo.py

# Multi-GPU
torchrun --nproc_per_node 2 train_ppo.py

Algorithm:

\mathcal{L}_{PPO} = -\mathbb{E}\left[\min(r_t \cdot A_t, \text{clip}(r_t, 1-\varepsilon, 1+\varepsilon) \cdot A_t)\right] + \beta \cdot \mathbb{E}[\text{KL}]

Characteristics:

  • Stable but slower reward improvement
  • Requires both Actor and Critic networks
  • High memory usage (1.5-2× single network)
  • Good for exploration

Output: ./out/ppo_actor_*.pth

Training Duration: ~1-3 hours

5.2 GRPO (Group Relative Policy Optimization)

Modern algorithm used in DeepSeek-R1, with faster convergence.

python train_grpo.py

# Multi-GPU
torchrun --nproc_per_node 2 train_grpo.py

Algorithm:

\mathcal{L}_{GRPO} = -\mathbb{E}\left[r_t \cdot A_t - \beta \cdot \text{KL}_t\right]

Where advantage is computed as:

A_t = \frac{R - \mu_{group}}{\sigma_{group}}

Characteristics:

  • Single-network design (memory efficient)
  • Faster reward improvement
  • Group normalization removes bias
  • Better convergence stability

Output: ./out/grpo_*.pth

Training Duration: ~1-3 hours

5.3 SPO (Single-stream Policy Optimization)

Newest algorithm (2025) addressing GRPO's degenerate group problem.

python train_spo.py

# Multi-GPU
torchrun --nproc_per_node 2 train_spo.py

Algorithm:

\mathcal{L}_{SPO} = -\mathbb{E}\left[\log \pi_\theta(a_t|s) \cdot A_t - \beta \cdot \text{KL}_t\right]

With adaptive baseline: B_t^{adaptive}

Characteristics:

  • No group dependency (1 input → 1 training sample)
  • Adaptive value tracking
  • Better handling of difficult examples
  • Experimental on small models

Output: ./out/spo_*.pth

Training Duration: ~1-3 hours

RLAIF Dataset Preparation

All RLAIF algorithms use rlaif-mini.jsonl (1MB, 10k examples):

# Download dataset
# Format: Same as SFT, but assistant content is "无" (none)
{
  "conversations": [
    {"role": "user", "content": "Explain photosynthesis briefly."},
    {"role": "assistant", "content": "无"}
  ]
}

The model generates completions during training, which are scored by a Reward Model (e.g., InternLM2-1.8B-Reward).

Reward Model Setup:

# Download reward model to parent directory
cd ../
git clone https://huggingface.co/internlm/internlm2-1_8b-reward

# Directory structure should be:
# project/
# ├── minimind/
# └── internlm2-1_8b-reward/

RLAIF vs DPO Comparison

Aspect DPO RLAIF (PPO/GRPO/SPO)
Training Type Off-policy On-policy
Data Freshness Static pairs Dynamic (generated)
Reward Source Implicit Explicit model
Convergence Fast Slower
Memory Usage Lower Higher
Best For Preference refinement Capability improvement

Stage 6: Reasoning Model Distillation

Purpose: Distill DeepSeek-R1-style reasoning into MiniMind

python train_distill_reason.py

# Multi-GPU
torchrun --nproc_per_node 2 train_distill_reason.py

Data Format (r1_mix_1024.jsonl):

{
  "conversations": [
    {
      "role": "user",
      "content": "Solve: 5 + 3 = ?"
    },
    {
      "role": "assistant",
      "content": "<think>\nI need to add 5 and 3.\n5 + 3 = 8\n</think>\n<answer>\n5 + 3 = 8\n</answer>"
    }
  ]
}

Output: ./out/reason_*.pth

Training Features:

  • Enforces <think> and <answer> tags
  • Penalty loss for format violations
  • Mixed data (reasoning + multi-turn + English)

🔧 Multi-GPU Training

DDP (Distributed Data Parallel)

Best for single-machine multi-GPU:

torchrun --nproc_per_node N train_xxx.py
# N = number of GPUs

DeepSpeed

For advanced optimization:

deepspeed --master_port 29500 --num_gpus=N train_xxx.py

Wandb Monitoring

Track training progress:

# Login first
wandb login

# Enable wandb logging
torchrun --nproc_per_node N train_xxx.py --use_wandb

# Or SwanLab (China-friendly alternative)
python train_xxx.py --use_wandb  # Automatically uses SwanLab if available

🧪 Model Testing

Evaluate Pretrain Model

python eval_model.py --model_mode 0

Evaluate Chat Model

python eval_model.py --model_mode 1

Evaluate with LoRA

python eval_model.py --lora_name 'lora_medical' --model_mode 1

Evaluate Reasoning Model

python eval_model.py --model_mode 3

Evaluate RLAIF Models

# PPO model
python eval_model.py --model_mode 4

# GRPO model
python eval_model.py --model_mode 4

RoPE Length Extrapolation

Test with extended context:

python eval_model.py --model_mode 1 --inference_rope_scaling True

📐 Model Architecture

MiniMind Structure

Decoder-Only Transformer (similar to Llama3):

Input Tokens
    ↓
Token Embedding (6400 vocab)
    ↓
Rotary Embeddings (RoPE) [with YaRN for length extrapolation]
    ↓
[Transformer Blocks] ×N
  ├─ Attention (Multi-Head)
  ├─ RMSNorm
  ├─ SwiGLU FFN [or MoE for MoE variant]
  └─ Residual Connections
    ↓
RMSNorm
    ↓
LM Head (→ 6400 vocab logits)
    ↓
Output Probabilities

Model Configurations

Config MiniMind2-Small MiniMind2 MiniMind2-MoE
Parameters 26M 104M 145M
Hidden Dim 512 768 640
Layers 8 16 8
KV Heads 2 2 2
Q Heads 8 8 8
Vocab Size 6,400 6,400 6,400
Context Length 2,048 2,048 2,048

Modify Architecture

Edit ./model/LMConfig.py:

class LMConfig:
    hidden_size: int = 768
    num_layers: int = 16
    num_heads: int = 8
    num_kv_heads: int = 2
    # ... other configs

🔍 Training Tips & Best Practices

Data Quality > Quantity

  • High-quality pretraining data accelerates convergence
  • pretrain_hq.jsonl is carefully curated for quality
  • Consider data deduplication and cleaning

Learning Rate Scheduling

# Recommended schedules
- Linear warmup then decay
- Initial: 1e-4 to 5e-4
- Warmup steps: 10% of total
- Final: 10% of initial LR

Batch Size & Sequence Length

# Balance between GPU memory and convergence
- Pretraining: max_seq_len=512, batch_size=32
- SFT: max_seq_len=512, batch_size=16
- LoRA: max_seq_len=512, batch_size=16

Memory Optimization

# Reduce batch size if OOM
python train_xxx.py --batch_size 8

# Or use gradient accumulation
python train_xxx.py --gradient_accumulation_steps 4

Checkpoint Management

  • Saves every 100 steps by default
  • Each new save overwrites the old one
  • Automatic backup before training

🚨 Common Issues & Solutions

Issue: CUDA Out of Memory

# Solution 1: Reduce batch size
python train_xxx.py --batch_size 4

# Solution 2: Use gradient accumulation
python train_xxx.py --batch_size 16 --gradient_accumulation_steps 2

# Solution 3: Use smaller model
# Edit trainer script to use MiniMind2-Small instead

Issue: Training Not Converging

# Possible causes:
1. Learning rate too high/low
2. Data quality issues
3. Model capacity mismatch

# Solutions:
- Reduce learning rate: --learning_rate 1e-5
- Check data format and quality
- Try smaller model first

Issue: Multi-GPU Sync Errors

# Ensure:
1. All GPUs visible: nvidia-smi
2. Same CUDA version across all GPUs
3. Network connectivity for distributed training

# Debug:
torchrun --nproc_per_node 2 train_xxx.py --debug

Issue: Different Results Than Expected

# Check:
1. Random seed set (reproducibility)
2. Correct model checkpoint loaded
3. Correct dataset being used
4. Same hyperparameters as reference

📈 Training Progression

Typical training curves:

Pretraining Loss: ↘↘↘ (steep decline, then plateau)
SFT Loss:         ↘ (steady decline)
PPO Reward:       ↗ (rising, may plateau)
GRPO Reward:      ↗↗ (faster rise, more stable)

🎓 Advanced Topics

Custom Datasets

Create your own dataset:

# Format: JSONL with conversations list
# Each line is one training example
# Ensure consistent quality and format

Model Quantization (Post-training)

# 4-bit quantization for inference
# Use tools like:
# - llama.cpp (gguf format)
# - bitsandbytes (dynamic quantization)
# - AutoGPTQ (static quantization)

Model Merging

# Merge base model + LoRA weights
# Use tools like: peft, llama.cpp

📚 References


Next: Deploy your trained model or explore advanced inference options