diff --git a/docs/images/train_grpo_512.png b/docs/images/train_grpo_512.png
new file mode 100644
index 0000000..008a598
Binary files /dev/null and b/docs/images/train_grpo_512.png differ
diff --git a/docs/images/train_grpo_768.png b/docs/images/train_grpo_768.png
new file mode 100644
index 0000000..339f393
Binary files /dev/null and b/docs/images/train_grpo_768.png differ
diff --git a/docs/images/train_ppo_512.png b/docs/images/train_ppo_512.png
new file mode 100644
index 0000000..3be1cca
Binary files /dev/null and b/docs/images/train_ppo_512.png differ
diff --git a/docs/images/train_ppo_768.png b/docs/images/train_ppo_768.png
new file mode 100644
index 0000000..399dfdf
Binary files /dev/null and b/docs/images/train_ppo_768.png differ
diff --git a/docs/images/train_spo_768.png b/docs/images/train_spo_768.png
new file mode 100644
index 0000000..0ac98e8
Binary files /dev/null and b/docs/images/train_spo_768.png differ
diff --git a/docs/index.md b/docs/index.md
index 9e41e7d..f0e49c9 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,4 +1,4 @@
-# Welcome to MiniMind!
+# Welcome to MiniMind!

@@ -7,47 +7,118 @@
## π Introduction
-MiniMind is a super-small language model project trained completely from scratch, requiring **only $0.5 + 2 hours** to train a **26M** language model!
+**MiniMind** is a complete, open-source project for training ultra-small language models from scratch with minimal cost. Train a **26M** ChatBot in just **2 hours** with only **$3** on a single 3090 GPU!
- **MiniMind** series is extremely lightweight, the smallest version is **1/7000** the size of GPT-3
-- The project open-sources the minimalist structure of large models, including:
- - Mixture of Experts (MoE)
- - Dataset cleaning
- - Pretraining
- - Supervised Fine-Tuning (SFT)
- - LoRA fine-tuning
- - Direct Preference Optimization (DPO)
- - Model distillation
-- All core algorithm code is reconstructed from scratch using native PyTorch, without relying on third-party abstract interfaces
-- This is not only a full-stage open-source reproduction of large language models, but also a tutorial for getting started with LLMs
+- Complete implementation covering:
+ - **Tokenizer training** with custom vocabulary
+ - **Pretraining** (knowledge learning)
+ - **Supervised Fine-Tuning (SFT)** (conversation patterns)
+ - **LoRA fine-tuning** (parameter-efficient adaptation)
+ - **Direct Preference Optimization (DPO)** (human preference alignment)
+ - **RLAIF algorithms** (PPO/GRPO/SPO - reinforcement learning)
+ - **Knowledge distillation** (compress large model knowledge)
+ - **Model reasoning distillation** (DeepSeek-R1 style)
+ - **YaRN algorithm** (context length extrapolation)
+- **Pure PyTorch implementation**: All core algorithms are implemented from scratch using native PyTorch, without relying on third-party abstract interfaces
+- **Educational value**: This is not only a full-stage open-source reproduction of large language models, but also a comprehensive tutorial for getting started with LLMs
+- **Extended capabilities**: MiniMind now supports [MiniMind-V](https://github.com/jingyaogong/minimind-v) for vision multimodal tasks
-!!! note "Training Cost"
- "2 hours" is based on NVIDIA 3090 hardware (single card) testing, "$0.5" refers to GPU server rental cost
+!!! note "Training Cost & Time"
+ "2 hours" is based on **NVIDIA 3090** hardware (single card) testing
+
+ "$3" refers to GPU server rental cost
+
+ With 8Γ RTX 4090 GPUs, training time can be compressed to **under 10 minutes**
-## β¨ Key Features
+## β¨ Key Highlights
-- **Ultra-low cost**: Single 3090, 2 hours, $0.5 to train a ChatBot from scratch
-- **Complete pipeline**: Covers Tokenizer, pretraining, SFT, LoRA, DPO, distillation full process
-- **Education-friendly**: Clean code, suitable for learning LLM principles
-- **Ecosystem compatible**: Supports `transformers`, `llama.cpp`, `vllm`, `ollama` and other mainstream frameworks
+- **Ultra-low cost**: Single 3090, 2 hours, $3 to train a fully functional ChatBot from scratch
+- **Complete pipeline**: Tokenizer β Pretraining β SFT β LoRA β DPO/RLAIF β Distillation β Reasoning
+- **Latest algorithms**: Implements cutting-edge techniques including GRPO, SPO, and YaRN
+- **Education-friendly**: Clean, well-documented code suitable for learning LLM principles
+- **Ecosystem compatible**: Seamless support for `transformers`, `trl`, `peft`, `llama.cpp`, `vllm`, `ollama`, and `Llama-Factory`
+- **Full capabilities**: Supports multi-GPU training (DDP/DeepSpeed), model visualization (Wandb/SwanLab), and dynamic checkpoint management
+- **Production-ready**: OpenAI API protocol support for easy integration with third-party UIs (FastGPT, Open-WebUI, etc.)
+- **Multimodal extension**: Extended to vision with [MiniMind-V](https://github.com/jingyaogong/minimind-v)
-## π Model List
+## π Model Series
-| Model (Size) | Inference Memory (Approx.) | Release |
-|------------|----------|---------|
-| MiniMind2-small (26M) | 0.5 GB | 2025.04.26 |
-| MiniMind2-MoE (145M) | 1.0 GB | 2025.04.26 |
-| MiniMind2 (104M) | 1.0 GB | 2025.04.26 |
+### MiniMind2 Series (Latest - 2025.04.26)
+
+| Model | Parameters | Vocabulary | Layers | Hidden Dim | Context | Inference Memory |
+|-------|-----------|------------|--------|-----------|---------|-----------------|
+| MiniMind2-small | 26M | 6,400 | 8 | 512 | 2K | ~0.5 GB |
+| MiniMind2-MoE | 145M | 6,400 | 8 | 640 | 2K | ~1.0 GB |
+| MiniMind2 | 104M | 6,400 | 16 | 768 | 2K | ~1.0 GB |
+
+### MiniMind-V1 Series (Legacy - 2024.09.01)
+
+| Model | Parameters | Vocabulary | Layers | Hidden Dim | Context |
+|-------|-----------|------------|--------|-----------|---------|
+| minimind-v1-small | 26M | 6,400 | 8 | 512 | 2K |
+| minimind-v1-moe | 104M | 6,400 | 8 | 512 | 2K |
+| minimind-v1 | 108M | 6,400 | 16 | 768 | 2K |
+
+## π
Latest Updates (2025-10-24)
+
+π₯ **RLAIF Training Algorithms**: Native implementation of PPO, GRPO, and SPO
+
+- **YaRN Algorithm**: RoPE length extrapolation for improved long-sequence handling
+- **Adaptive Thinking**: Reasoning models support optional thinking chains
+- **Full template support**: Tool calling and reasoning tags (``, ``, etc.)
+- **Visualization**: Switched from WandB to [SwanLab](https://swanlab.cn/) (China-friendly)
+- **Reasoning models**: Complete MiniMind-Reason series based on DeepSeek-R1 distillation
+
+## π― Project Contents
+
+- Complete MiniMind-LLM architecture code (Dense + MoE models)
+- Detailed Tokenizer training code
+- Full training pipeline: Pretrain β SFT β LoRA β RLHF/RLAIF β Distillation
+- High-quality, curated and deduplicated datasets at all stages
+- Native PyTorch implementation of key algorithms, minimal third-party dependencies
+- Multi-GPU training support (single-machine multi-card DDP, DeepSpeed, distributed clusters)
+- Visualization with Wandb/SwanLab
+- Model evaluation on third-party benchmarks (C-Eval, C-MMLU, OpenBookQA)
+- YaRN algorithm for RoPE context length extrapolation
+- OpenAI API protocol server for easy integration
+- Streamlit web UI for chat
+- Full compatibility with community tools: llama.cpp, vllm, ollama, Llama-Factory
+- MiniMind-Reason models: Complete open-source data + weights for reasoning distillation
## π Quick Navigation
-- [Quick Start](quickstart.md) - Environment setup, model download, quick testing
-- [Model Training](training.md) - Pretraining, SFT, LoRA, DPO training process
+- **[Quick Start](quickstart.md)** - Environment setup, model download, quick testing
+- **[Model Training](training.md)** - Pretraining, SFT, LoRA, RLHF, RLAIF, and reasoning training
-## π Related Links
+## π Links & Resources
+**Project Repositories**:
- **GitHub**: [https://github.com/jingyaogong/minimind](https://github.com/jingyaogong/minimind)
- **HuggingFace**: [MiniMind Collection](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5)
-- **ModelScope**: [MiniMind Models](https://www.modelscope.cn/profile/gongjy)
-- **Online Demo**: [ModelScope Studio](https://www.modelscope.cn/studios/gongjy/MiniMind)
+- **ModelScope**: [MiniMind Profile](https://www.modelscope.cn/profile/gongjy)
+
+**Online Demos**:
+- [ModelScope Studio - Standard Chat](https://www.modelscope.cn/studios/gongjy/MiniMind)
+- [ModelScope Studio - Reasoning Model](https://www.modelscope.cn/studios/gongjy/MiniMind-Reasoning)
+- [Bilibili Video Introduction](https://www.bilibili.com/video/BV12dHPeqE72/)
+
+**Vision Extension**:
+- [MiniMind-V](https://github.com/jingyaogong/minimind-v) - Multimodal vision language models
+
+## π‘ Why MiniMind?
+
+The AI community is flooded with high-cost, complex frameworks that abstract away the fundamentals. MiniMind aims to democratize LLM learning by:
+
+1. **Lowering the barrier**: No need for expensive GPUs or cloud services
+2. **Understanding, not just using**: Learn every detail from tokenization to inference
+3. **End-to-end learning**: Train from scratch, not just fine-tune existing models
+4. **Code clarity**: Pure PyTorch implementations you can read and understand
+5. **Practical results**: Get a working ChatBot with minimal resources
+
+As we say: **"Building a Lego airplane is far more exciting than flying first class!"**
+
+---
+
+Next: [Get Started β](quickstart.md)
diff --git a/docs/quickstart.md b/docs/quickstart.md
index 7eaa845..266540f 100644
--- a/docs/quickstart.md
+++ b/docs/quickstart.md
@@ -1,114 +1,279 @@
# Quick Start
-This page will help you quickly get started with the MiniMind project.
+Get MiniMind up and running in minutes!
## π Requirements
+### Hardware
+
+- **GPU Memory**: 8GB minimum (24GB recommended for comfortable development)
+- **Recommended GPU**: NVIDIA RTX 3090 (24GB)
+
+### Software
+
- **Python**: 3.10+
-- **PyTorch**: 1.12+
+- **PyTorch**: 2.0+ (with CUDA 12.2+ for GPU support)
- **CUDA**: 12.2+ (optional, for GPU acceleration)
-- **VRAM**: At least 8GB (24GB recommended)
!!! tip "Hardware Configuration Reference"
- - CPU: Intel i9-10980XE @ 3.00GHz
- - RAM: 128 GB
- - GPU: NVIDIA GeForce RTX 3090 (24GB)
+ - **CPU**: Intel i9-10980XE @ 3.00GHz
+ - **RAM**: 128 GB
+ - **GPU**: NVIDIA GeForce RTX 3090 (24GB) Γ 8
+ - **OS**: Ubuntu 20.04
+ - **CUDA**: 12.2
+ - **Python**: 3.10.16
-## π Testing Existing Models
-
-### 1. Clone the Project
+## π Step 0: Clone the Repository
```bash
git clone https://github.com/jingyaogong/minimind.git
cd minimind
```
-### 2. Install Dependencies
+## π― Section I: Testing Existing Models
+
+### 1. Environment Setup
```bash
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
```
-!!! warning "Torch CUDA Check"
- After installation, test if Torch can use CUDA:
+!!! warning "Verify CUDA Support"
+ After installation, verify PyTorch can access CUDA:
```python
import torch
print(torch.cuda.is_available())
+ print(torch.cuda.get_device_name(0))
```
+ If `False`, download the correct PyTorch version from [PyTorch Official](https://download.pytorch.org/whl/torch_stable.html)
-### 3. Download Model
+### 2. Download Pretrained Models
-Download pretrained models from HuggingFace or ModelScope:
+Choose one option:
+**From HuggingFace** (recommended for international users):
```bash
-# From HuggingFace
git clone https://huggingface.co/jingyaogong/MiniMind2
+```
-# Or from ModelScope
+**From ModelScope** (recommended for China users):
+```bash
git clone https://www.modelscope.cn/models/gongjy/MiniMind2.git
```
-### 4. Command Line Q&A
+### 3. Command-Line Chat
```bash
-# load=0: load PyTorch model, load=1: load Transformers model
+# load=0: load PyTorch model, load=1: load transformers model
python eval_model.py --load 1 --model_mode 2
```
-### 5. Start WebUI (Optional)
+**Model Modes**:
+- `model_mode 0`: Pretrain model (word continuation)
+- `model_mode 1`: SFT Chat model (conversation)
+- `model_mode 2`: RLHF model (refined responses, currently same as SFT for small models)
+- `model_mode 3`: Reasoning model (with thinking chains)
+- `model_mode 4/5`: RLAIF models (PPO/GRPO trained)
+
+**Example Session**:
+```text
+πΆ: Hello, please introduce yourself.
+π€οΈ: I am MiniMind, an AI assistant developed by Jingyao Gong.
+ I use natural language processing and machine learning algorithms to interact with users.
+
+πΆ: What's the capital of France?
+π€οΈ: The capital of France is Paris, which is located in the northern central part of France.
+ It is the largest city in France and serves as its political, economic, and cultural center.
+```
+
+### 4. Web UI Demo (Optional)
```bash
# Requires Python >= 3.10
pip install streamlit
+
cd scripts
streamlit run web_demo.py
```
-Visit `http://localhost:8501` to use the web interface.
+Visit `http://localhost:8501` to use the interactive web interface.
-## π§ Third-party Inference Frameworks
+### 5. Rope Length Extrapolation with YaRN
-MiniMind supports multiple mainstream inference frameworks:
+Extend context length beyond training with RoPE extrapolation:
-### Ollama
+```bash
+python eval_model.py --inference_rope_scaling True
+```
+
+This enables the YaRN algorithm to handle sequences longer than the 2K training context, useful for processing documents and long conversations.
+
+## π§ Third-Party Inference Frameworks
+
+MiniMind is compatible with popular inference engines:
+
+### Ollama (Easiest)
```bash
ollama run jingyaogong/minimind2
```
-### vLLM
+### vLLM (Fastest)
```bash
-vllm serve ./MiniMind2/ --served-model-name "minimind"
+vllm serve ./MiniMind2/ --served-model-name "minimind" --port 8000
+
+# Test with curl
+curl http://localhost:8000/v1/chat/completions \
+ -H "Content-Type: application/json" \
+ -d '{
+ "model": "minimind",
+ "messages": [{"role": "user", "content": "Hello!"}],
+ "temperature": 0.7,
+ "max_tokens": 512
+ }'
```
-### llama.cpp
+### llama.cpp (CPU-Friendly)
```bash
-# Convert model
-python convert_hf_to_gguf.py ./MiniMind2/
+# Convert to GGUF format
+python scripts/convert_model.py ./MiniMind2/ --output ./MiniMind2.gguf
-# Quantize model
-./build/bin/llama-quantize ./MiniMind2/MiniMind2-109M-F16.gguf ./Q4-MiniMind2.gguf Q4_K_M
+# Quantize for size reduction
+./llama-quantize ./MiniMind2.gguf ./MiniMind2-Q4.gguf Q4_K_M
-# Inference
-./build/bin/llama-cli -m ./Q4-MiniMind2.gguf --chat-template chatml
+# Run inference
+./llama-cli -m ./MiniMind2-Q4.gguf -p "Hello" -n 128
```
-## π Effect Testing
+## π OpenAI API Server (For Integration)
+
+Run MiniMind as an OpenAI API-compatible service:
+
+```bash
+python scripts/serve_openai_api.py
+```
+
+Test the API:
+
+```bash
+# In another terminal
+python scripts/chat_openai_api.py
+```
+
+**cURL Example**:
+```bash
+curl http://localhost:8000/v1/chat/completions \
+ -H "Content-Type: application/json" \
+ -d '{
+ "model": "minimind",
+ "messages": [
+ {"role": "user", "content": "Explain machine learning in one sentence."}
+ ],
+ "temperature": 0.7,
+ "max_tokens": 256,
+ "stream": true
+ }'
+```
+
+This enables integration with:
+- [FastGPT](https://fastgpt.run/)
+- [Open-WebUI](https://github.com/open-webui/open-webui)
+- [Dify](https://dify.ai/)
+- Any OpenAI API-compatible client
+
+## π Model Selection Guide
+
+| Use Case | Recommended Model | Memory | Speed |
+|----------|------------------|--------|-------|
+| Learning/Testing | MiniMind2-small (26M) | ~0.5 GB | Fastest |
+| Balanced | MiniMind2 (104M) | ~1.0 GB | Fast |
+| Expert System (MoE) | MiniMind2-MoE (145M) | ~1.0 GB | Dynamic |
+| Reasoning/Complex | MiniMind-Reason (104M) | ~1.0 GB | Standard |
+
+## β‘ Quick Test Results
+
+**Model**: MiniMind2 (104M parameters)
```text
-πΆ: Hello, please introduce yourself.
-π€οΈ: Hello! I'm MiniMind, an AI assistant developed by Jingyao Gong.
- I interact with users through natural language processing and algorithm training.
+Q: What is photosynthesis?
+A: Photosynthesis is a process in which plants convert light energy from the sun
+ into chemical energy to produce glucose. This process occurs mainly in leaves
+ and is essential for plant growth and survival.
-πΆ: What is the highest mountain in the world?
-π€οΈ: Mount Everest is the highest mountain in the world, located in the Himalayas,
- with an elevation of 8,848.86 meters (29,031.7 feet).
+Q: Write a Python function to calculate Fibonacci numbers.
+A: def fibonacci(n):
+ if n <= 1:
+ return n
+ return fibonacci(n-1) + fibonacci(n-2)
+
+ # For better performance, use dynamic programming:
+ def fibonacci_dp(n):
+ dp = [0] * (n + 1)
+ for i in range(2, n + 1):
+ dp[i] = dp[i-1] + dp[i-2]
+ return dp[n]
+
+Q: δΈηδΈζι«ηε±±ε³°ζ―δ»δΉ? (What is the highest mountain?)
+A: η η©ζηε³°οΌMount EverestοΌζ―δΈηδΈζι«ηε±±ε³°οΌδ½δΊε马ζι
ε±±θ...
+ (Mount Everest is the world's highest mountain, located in the Himalayas...)
```
-## π― Next Steps
+## π Troubleshooting
-- Check [Model Training](training.md) to learn how to train your own model from scratch
-- Read the source code to understand LLM implementation principles
+### Issue: CUDA Out of Memory
+
+**Solution**:
+```bash
+# Reduce batch size
+python eval_model.py --batch_size 1
+
+# Or use CPU (slow but works)
+python eval_model.py --device cpu
+```
+
+### Issue: Slow Inference
+
+**Solutions**:
+- Use vLLM or llama.cpp for faster inference
+- Enable quantization (4-bit, 8-bit)
+- Use GPU instead of CPU
+- Reduce `max_tokens` parameter
+
+### Issue: Model Responses Are Poor Quality
+
+**Possible Causes**:
+- Using pretrain model (`model_mode 0`) instead of SFT (`model_mode 1`)
+- Model is undertrained - download the full checkpoint instead
+- Input prompt is too short - provide more context
+
+### Issue: Python/PyTorch Version Mismatch
+
+**Solution**:
+```bash
+# Use conda for clean environment
+conda create -n minimind python=3.10
+conda activate minimind
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu122
+pip install -r requirements.txt
+```
+
+## π Next Steps
+
+- **[Model Training Guide](training.md)** - Train your own MiniMind from scratch
+- **[Source Code](https://github.com/jingyaogong/minimind)** - Explore and learn LLM implementation
+- **[Inference Benchmarks](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5)** - See model performance comparisons
+
+## π‘ Pro Tips
+
+1. **GPU Memory Optimization**: Use `torch.cuda.empty_cache()` periodically
+2. **Batch Processing**: For efficiency, process multiple prompts in batches
+3. **Temperature Tuning**: Lower (0.3-0.7) = more consistent, Higher (0.8-1.0) = more creative
+4. **Prompt Engineering**: Better prompts β better results, even for small models
+5. **Model Quantization**: Use 4-bit quantization to run on smaller GPUs
+
+---
+
+Done! Now you're ready to use MiniMind. Start with the Quick Start, then move to [Model Training](training.md) to learn how to train your own models.
diff --git a/docs/training.md b/docs/training.md
index dd6624d..8fdf672 100644
--- a/docs/training.md
+++ b/docs/training.md
@@ -1,155 +1,419 @@
-# Model Training
+# Model Training Guide
-This page introduces how to train MiniMind language models from scratch.
+Learn how to train MiniMind language models from scratch using pure PyTorch.
-## π Data Preparation
+## π Training Overview
-### 1. Download Dataset
+MiniMind implements a complete training pipeline:
-Download datasets from [ModelScope](https://www.modelscope.cn/datasets/gongjy/minimind_dataset/files) or [HuggingFace](https://huggingface.co/datasets/jingyaogong/minimind_dataset).
-
-Create `./dataset` directory and place data files:
-
-```bash
-./dataset/
-βββ pretrain_hq.jsonl (1.6GB, β¨Recommended)
-βββ sft_mini_512.jsonl (1.2GB, β¨Recommended)
-βββ sft_512.jsonl (7.5GB)
-βββ sft_1024.jsonl (5.6GB)
-βββ sft_2048.jsonl (9GB)
-βββ dpo.jsonl (909MB)
-βββ r1_mix_1024.jsonl (340MB)
-βββ lora_*.jsonl
+```
+Tokenizer Training
+ β
+ Pretraining (Learn knowledge)
+ β
+ SFT (Learn conversation)
+ β
+ βββββββββββββββββββββ¬ββββββββββββββββββββββ¬βββββββββββββββ
+ β β β β
+ LoRA DPO/RLHF RLAIF (PPO/GRPO/SPO) Distillation
+(Domain adapt) (Preference) (Reinforcement Learn) (Reasoning)
```
-!!! tip "Recommended Combination"
- Fastest reproduction: `pretrain_hq.jsonl` + `sft_mini_512.jsonl`
+## π° Training Costs (Single NVIDIA 3090)
+
+| Model | Dataset | Duration | Cost (RMB) | Quality |
+|-------|---------|----------|-----------|---------|
+| MiniMind2-Small | pretrain_hq + sft_mini_512 | 2.1h | β3 | ππ |
+| MiniMind2-Small | Full dataset | 38h | β50 | ππππππ |
+| MiniMind2 | pretrain_hq + sft_mini_512 | 3.3h | β5 | ππ |
+| MiniMind2 | Full dataset | 122h | β160 | πππππππ |
+
+!!! success "Ultra-Fast Training"
+ **Just 2.1 hours + $3 = Functional ChatBot!**
- **Single 3090 only needs 2 hours + $0.5!**
+ Use `pretrain_hq.jsonl` + `sft_mini_512.jsonl` for fastest reproduction
-### 2. Data Format
+## π Data Preparation
-**Pretrain Data** (`pretrain_hq.jsonl`):
+### 1. Download Datasets
+
+Download from [ModelScope](https://www.modelscope.cn/datasets/gongjy/minimind_dataset) or [HuggingFace](https://huggingface.co/datasets/jingyaogong/minimind_dataset):
+
+```bash
+mkdir -p dataset
+cd dataset
+# Download required files
+```
+
+### 2. Dataset Directory Structure
+
+```
+./dataset/
+βββ pretrain_hq.jsonl β¨ (1.6GB, required for pretraining)
+βββ sft_mini_512.jsonl β¨ (1.2GB, fastest SFT)
+βββ sft_512.jsonl (7.5GB, standard SFT)
+βββ sft_1024.jsonl (5.6GB, longer SFT)
+βββ sft_2048.jsonl (9GB, very long SFT)
+βββ dpo.jsonl (909MB, DPO training)
+βββ r1_mix_1024.jsonl (340MB, reasoning distillation)
+βββ rlaif-mini.jsonl (1MB, RLAIF algorithms)
+βββ lora_identity.jsonl (22.8KB, identity LoRA)
+βββ lora_medical.jsonl (34MB, medical domain LoRA)
+```
+
+### 3. Data Formats
+
+**Pretraining Data** (`pretrain_hq.jsonl`):
```json
-{"text": "How to overcome procrastination? Overcoming procrastination is not easy..."}
+{"text": "How to overcome procrastination? Overcoming procrastination is not easy, but these suggestions may help..."}
```
**SFT Data** (`sft_*.jsonl`):
```json
{
"conversations": [
- {"role": "user", "content": "Hello"},
- {"role": "assistant", "content": "Hello!"}
+ {"role": "user", "content": "Hello!"},
+ {"role": "assistant", "content": "Hello! How can I help?"},
+ {"role": "user", "content": "Tell me a joke."},
+ {"role": "assistant", "content": "Why did the scarecrow win an award? Because he was outstanding in his field!"}
]
}
```
-## π― Training Pipeline
+**DPO Data** (`dpo.jsonl`):
+```json
+{
+ "chosen": [
+ {"role": "user", "content": "What is 2+2?"},
+ {"role": "assistant", "content": "2+2 equals 4."}
+ ],
+ "rejected": [
+ {"role": "user", "content": "What is 2+2?"},
+ {"role": "assistant", "content": "2+2 equals 5."}
+ ]
+}
+```
-All training scripts are located in the `./trainer` directory.
+**LoRA Domain Data** (`lora_*.jsonl`):
+```json
+{
+ "conversations": [
+ {"role": "user", "content": "What's the treatment for cervical spondylosis?"},
+ {"role": "assistant", "content": "Cervical spondylosis treatment typically includes..."}
+ ]
+}
+```
-### 1. Pretraining
+## π― Complete Training Pipeline
-The pretraining stage lets the model learn basic knowledge, the goal is to **learn word continuation**.
+All training scripts are in the `./trainer` directory.
```bash
cd trainer
-python train_pretrain.py
-
-# Multi-GPU training
-torchrun --nproc_per_node 2 train_pretrain.py
```
-Output weights: `./out/pretrain_*.pth`
+### Stage 1: Pretraining
-!!! info "Training Duration"
- - MiniMind2-Small (26M): ~1.1h (single 3090)
- - MiniMind2 (104M): ~3.9h (single 3090)
-
-### 2. Supervised Fine-Tuning (SFT)
-
-The SFT stage teaches the model conversation patterns and adapts to chat templates.
+**Purpose**: Learn foundational knowledge (word continuation)
```bash
+# Single GPU
+python train_pretrain.py
+
+# Multi-GPU (DDP)
+torchrun --nproc_per_node 2 train_pretrain.py
+
+# Multi-GPU (DeepSpeed)
+deepspeed --master_port 29500 --num_gpus=2 train_pretrain.py
+```
+
+**Key Parameters**:
+- `max_seq_len`: 512 (adjust based on GPU memory)
+- `learning_rate`: 1e-4
+- `epochs`: Adjust based on dataset size
+
+**Output**: `./out/pretrain_*.pth`
+
+**Training Duration**:
+- MiniMind2-Small (26M): ~1.1h
+- MiniMind2 (104M): ~3.9h
+
+!!! tip "Pretraining Tips"
+ - Start with `pretrain_hq.jsonl` for best results
+ - Quality > Quantity for pretraining data
+ - Monitor loss curve to detect overfitting
+
+### Stage 2: Supervised Fine-Tuning (SFT)
+
+**Purpose**: Teach conversation patterns and chat templates
+
+```bash
+# Single GPU
python train_full_sft.py
-# Multi-GPU training
+# Multi-GPU
torchrun --nproc_per_node 2 train_full_sft.py
```
-Output weights: `./out/full_sft_*.pth`
+**Configuration**:
+- Load pretrained model from Stage 1
+- Use SFT dataset (`sft_mini_512.jsonl` or `sft_512.jsonl`)
+- Adjust `max_seq_len` to match training data
-!!! info "Training Duration"
- - MiniMind2-Small: ~1h (using sft_mini_512)
- - MiniMind2: ~3.3h (using sft_mini_512)
+**Output**: `./out/full_sft_*.pth`
-### 3. LoRA Fine-tuning (Optional)
+**Training Duration**:
+- With sft_mini_512: 1-3 hours
+- With full sft_512: 20-25 hours
-LoRA is a parameter-efficient fine-tuning method, suitable for domain adaptation.
+!!! warning "SFT Data Selection"
+ - `sft_mini_512.jsonl`: Fastest, ~1.2GB, 512 tokens max
+ - `sft_512.jsonl`: Standard, ~7.5GB, 512 tokens max
+ - `sft_1024.jsonl`: Longer, ~5.6GB, 1024 tokens max
+ - `sft_2048.jsonl`: Extended, ~9GB, 2048 tokens max
+
+### Stage 3: LoRA Fine-Tuning (Optional)
+
+**Purpose**: Parameter-efficient domain adaptation
+
+**Use Cases**:
+- Medical Q&A knowledge
+- Personal identity/self-awareness
+- Proprietary domain knowledge
```bash
+# Edit train_lora.py to set correct dataset and base model
+python train_lora.py
+
+# Multi-GPU
+torchrun --nproc_per_node 2 train_lora.py
+```
+
+**Output**: `./out/lora/lora_*.pth`
+
+**Example 1: Medical Domain**
+
+Prepare `dataset/lora_medical.jsonl`:
+```json
+{
+ "conversations": [
+ {"role": "user", "content": "What's the correct pillow height for cervical spondylosis?"},
+ {"role": "assistant", "content": "For cervical spondylosis, pillow height should be..."}
+ ]
+}
+```
+
+Train:
+```bash
+# Modify train_lora.py: lora_name = 'medical'
python train_lora.py
```
-**Use Cases**:
-- Medical Q&A: use `lora_medical.jsonl`
-- Self-awareness: use `lora_identity.jsonl`
+**Example 2: Identity/Self-Awareness**
-Output weights: `./out/lora/lora_*.pth`
+Prepare `dataset/lora_identity.jsonl`:
+```json
+{
+ "conversations": [
+ {"role": "user", "content": "Who are you?"},
+ {"role": "assistant", "content": "I am MiniMind..."}
+ ]
+}
+```
-### 4. DPO Reinforcement Learning (Optional)
+### Stage 4: Direct Preference Optimization (DPO)
-DPO is used to optimize model response quality to better align with human preferences.
+**Purpose**: Align model responses with human preferences
+
+DPO eliminates the need for separate reward models by directly optimizing preference pairs.
```bash
python train_dpo.py
+
+# Multi-GPU
+torchrun --nproc_per_node 2 train_dpo.py
```
-Output weights: `./out/rlhf_*.pth`
+**Output**: `./out/rlhf_*.pth`
-### 5. Reasoning Model Distillation (Optional)
+**Key Features**:
+- Off-policy training (reuse data across epochs)
+- No separate reward model needed
+- Better sample efficiency than PPO
+- Stable training convergence
-Distill reasoning capabilities from DeepSeek-R1.
+**Training Duration**: ~1-3 hours
+
+### Stage 5: Reinforcement Learning from AI Feedback (RLAIF)
+
+RLAIF is an advanced training approach using AI-generated rewards instead of human annotations. MiniMind implements three modern algorithms:
+
+#### 5.1 PPO (Proximal Policy Optimization)
+
+Classical on-policy RL algorithm with proven stability.
+
+```bash
+python train_ppo.py
+
+# Multi-GPU
+torchrun --nproc_per_node 2 train_ppo.py
+```
+
+**Algorithm**:
+$$\mathcal{L}_{PPO} = -\mathbb{E}\left[\min(r_t \cdot A_t, \text{clip}(r_t, 1-\varepsilon, 1+\varepsilon) \cdot A_t)\right] + \beta \cdot \mathbb{E}[\text{KL}]$$
+
+**Characteristics**:
+- Stable but slower reward improvement
+- Requires both Actor and Critic networks
+- High memory usage (1.5-2Γ single network)
+- Good for exploration
+
+**Output**: `./out/ppo_actor_*.pth`
+
+**Training Duration**: ~1-3 hours
+
+#### 5.2 GRPO (Group Relative Policy Optimization)
+
+Modern algorithm used in DeepSeek-R1, with faster convergence.
+
+```bash
+python train_grpo.py
+
+# Multi-GPU
+torchrun --nproc_per_node 2 train_grpo.py
+```
+
+**Algorithm**:
+$$\mathcal{L}_{GRPO} = -\mathbb{E}\left[r_t \cdot A_t - \beta \cdot \text{KL}_t\right]$$
+
+Where advantage is computed as:
+$$A_t = \frac{R - \mu_{group}}{\sigma_{group}}$$
+
+**Characteristics**:
+- Single-network design (memory efficient)
+- Faster reward improvement
+- Group normalization removes bias
+- Better convergence stability
+
+**Output**: `./out/grpo_*.pth`
+
+**Training Duration**: ~1-3 hours
+
+#### 5.3 SPO (Single-stream Policy Optimization)
+
+Newest algorithm (2025) addressing GRPO's degenerate group problem.
+
+```bash
+python train_spo.py
+
+# Multi-GPU
+torchrun --nproc_per_node 2 train_spo.py
+```
+
+**Algorithm**:
+$$\mathcal{L}_{SPO} = -\mathbb{E}\left[\log \pi_\theta(a_t|s) \cdot A_t - \beta \cdot \text{KL}_t\right]$$
+
+With adaptive baseline: $B_t^{adaptive}$
+
+**Characteristics**:
+- No group dependency (1 input β 1 training sample)
+- Adaptive value tracking
+- Better handling of difficult examples
+- Experimental on small models
+
+**Output**: `./out/spo_*.pth`
+
+**Training Duration**: ~1-3 hours
+
+#### RLAIF Dataset Preparation
+
+All RLAIF algorithms use `rlaif-mini.jsonl` (1MB, 10k examples):
+
+```bash
+# Download dataset
+# Format: Same as SFT, but assistant content is "ζ " (none)
+{
+ "conversations": [
+ {"role": "user", "content": "Explain photosynthesis briefly."},
+ {"role": "assistant", "content": "ζ "}
+ ]
+}
+```
+
+The model generates completions during training, which are scored by a **Reward Model** (e.g., InternLM2-1.8B-Reward).
+
+**Reward Model Setup**:
+
+```bash
+# Download reward model to parent directory
+cd ../
+git clone https://huggingface.co/internlm/internlm2-1_8b-reward
+
+# Directory structure should be:
+# project/
+# βββ minimind/
+# βββ internlm2-1_8b-reward/
+```
+
+#### RLAIF vs DPO Comparison
+
+| Aspect | DPO | RLAIF (PPO/GRPO/SPO) |
+|--------|-----|---------------------|
+| Training Type | Off-policy | On-policy |
+| Data Freshness | Static pairs | Dynamic (generated) |
+| Reward Source | Implicit | Explicit model |
+| Convergence | Fast | Slower |
+| Memory Usage | Lower | Higher |
+| Best For | Preference refinement | Capability improvement |
+
+### Stage 6: Reasoning Model Distillation
+
+**Purpose**: Distill DeepSeek-R1-style reasoning into MiniMind
```bash
python train_distill_reason.py
+
+# Multi-GPU
+torchrun --nproc_per_node 2 train_distill_reason.py
```
-Output weights: `./out/reason_*.pth`
-
-## π Model Architecture
-
-MiniMind uses Transformer Decoder-Only architecture (similar to Llama3):
-
-
-
-### Model Parameter Configuration
-
-| Model Name | params | d_model | n_layers | kv_heads | q_heads |
-|------------|--------|---------|----------|----------|---------|
-| MiniMind2-Small | 26M | 512 | 8 | 2 | 8 |
-| MiniMind2-MoE | 145M | 640 | 8 | 2 | 8 |
-| MiniMind2 | 104M | 768 | 16 | 2 | 8 |
-
-## π§ͺ Test Model
-
-```bash
-# model_mode: 0=pretrain, 1=sft, 2=rlhf, 3=reason
-python eval_model.py --model_mode 1
-
-# Test LoRA model
-python eval_model.py --lora_name 'lora_medical' --model_mode 2
+**Data Format** (`r1_mix_1024.jsonl`):
+```json
+{
+ "conversations": [
+ {
+ "role": "user",
+ "content": "Solve: 5 + 3 = ?"
+ },
+ {
+ "role": "assistant",
+ "content": "\nI need to add 5 and 3.\n5 + 3 = 8\n\n\n5 + 3 = 8\n"
+ }
+ ]
+}
```
+**Output**: `./out/reason_*.pth`
+
+**Training Features**:
+- Enforces `` and `` tags
+- Penalty loss for format violations
+- Mixed data (reasoning + multi-turn + English)
+
## π§ Multi-GPU Training
-### DDP Method
+### DDP (Distributed Data Parallel)
+
+Best for single-machine multi-GPU:
```bash
torchrun --nproc_per_node N train_xxx.py
+# N = number of GPUs
```
-### DeepSpeed Method
+### DeepSpeed
+
+For advanced optimization:
```bash
deepspeed --master_port 29500 --num_gpus=N train_xxx.py
@@ -157,30 +421,259 @@ deepspeed --master_port 29500 --num_gpus=N train_xxx.py
### Wandb Monitoring
+Track training progress:
+
```bash
# Login first
wandb login
-# Enable wandb
+# Enable wandb logging
torchrun --nproc_per_node N train_xxx.py --use_wandb
+
+# Or SwanLab (China-friendly alternative)
+python train_xxx.py --use_wandb # Automatically uses SwanLab if available
```
-## π° Training Cost
+## π§ͺ Model Testing
-Based on single NVIDIA 3090:
+### Evaluate Pretrain Model
-| Dataset Combination | Duration | Cost | Effect |
-|-----------|------|------|------|
-| pretrain_hq + sft_mini_512 | 2.1h | β$0.35 | ππ Basic chat |
-| Full dataset (MiniMind2-Small) | 38h | β$6.50 | ππππππ Complete capabilities |
-| Full dataset (MiniMind2) | 122h | β$20.80 | ππππππππ Best performance |
+```bash
+python eval_model.py --model_mode 0
+```
-!!! success "Quick Reproduction"
- Using `pretrain_hq` + `sft_mini_512`, single 3090 only needs **2 hours + $0.5** to train a ChatBot!
+### Evaluate Chat Model
-## π Common Issues
+```bash
+python eval_model.py --model_mode 1
+```
-- **Out of memory**: Reduce `batch_size` or use DeepSpeed
-- **Training not converging**: Adjust learning rate or check data quality
-- **Multi-GPU training error**: Ensure all GPUs are visible and CUDA versions are consistent
+### Evaluate with LoRA
+
+```bash
+python eval_model.py --lora_name 'lora_medical' --model_mode 1
+```
+
+### Evaluate Reasoning Model
+
+```bash
+python eval_model.py --model_mode 3
+```
+
+### Evaluate RLAIF Models
+
+```bash
+# PPO model
+python eval_model.py --model_mode 4
+
+# GRPO model
+python eval_model.py --model_mode 4
+```
+
+### RoPE Length Extrapolation
+
+Test with extended context:
+
+```bash
+python eval_model.py --model_mode 1 --inference_rope_scaling True
+```
+
+## π Model Architecture
+
+### MiniMind Structure
+
+**Decoder-Only Transformer** (similar to Llama3):
+
+```
+Input Tokens
+ β
+Token Embedding (6400 vocab)
+ β
+Rotary Embeddings (RoPE) [with YaRN for length extrapolation]
+ β
+[Transformer Blocks] ΓN
+ ββ Attention (Multi-Head)
+ ββ RMSNorm
+ ββ SwiGLU FFN [or MoE for MoE variant]
+ ββ Residual Connections
+ β
+RMSNorm
+ β
+LM Head (β 6400 vocab logits)
+ β
+Output Probabilities
+```
+
+### Model Configurations
+
+| Config | MiniMind2-Small | MiniMind2 | MiniMind2-MoE |
+|--------|-----------------|----------|---------------|
+| Parameters | 26M | 104M | 145M |
+| Hidden Dim | 512 | 768 | 640 |
+| Layers | 8 | 16 | 8 |
+| KV Heads | 2 | 2 | 2 |
+| Q Heads | 8 | 8 | 8 |
+| Vocab Size | 6,400 | 6,400 | 6,400 |
+| Context Length | 2,048 | 2,048 | 2,048 |
+
+### Modify Architecture
+
+Edit `./model/LMConfig.py`:
+
+```python
+class LMConfig:
+ hidden_size: int = 768
+ num_layers: int = 16
+ num_heads: int = 8
+ num_kv_heads: int = 2
+ # ... other configs
+```
+
+## π Training Tips & Best Practices
+
+### Data Quality > Quantity
+
+- High-quality pretraining data accelerates convergence
+- `pretrain_hq.jsonl` is carefully curated for quality
+- Consider data deduplication and cleaning
+
+### Learning Rate Scheduling
+
+```python
+# Recommended schedules
+- Linear warmup then decay
+- Initial: 1e-4 to 5e-4
+- Warmup steps: 10% of total
+- Final: 10% of initial LR
+```
+
+### Batch Size & Sequence Length
+
+```python
+# Balance between GPU memory and convergence
+- Pretraining: max_seq_len=512, batch_size=32
+- SFT: max_seq_len=512, batch_size=16
+- LoRA: max_seq_len=512, batch_size=16
+```
+
+### Memory Optimization
+
+```bash
+# Reduce batch size if OOM
+python train_xxx.py --batch_size 8
+
+# Or use gradient accumulation
+python train_xxx.py --gradient_accumulation_steps 4
+```
+
+### Checkpoint Management
+
+- Saves every 100 steps by default
+- Each new save overwrites the old one
+- Automatic backup before training
+
+## π¨ Common Issues & Solutions
+
+### Issue: CUDA Out of Memory
+
+```bash
+# Solution 1: Reduce batch size
+python train_xxx.py --batch_size 4
+
+# Solution 2: Use gradient accumulation
+python train_xxx.py --batch_size 16 --gradient_accumulation_steps 2
+
+# Solution 3: Use smaller model
+# Edit trainer script to use MiniMind2-Small instead
+```
+
+### Issue: Training Not Converging
+
+```python
+# Possible causes:
+1. Learning rate too high/low
+2. Data quality issues
+3. Model capacity mismatch
+
+# Solutions:
+- Reduce learning rate: --learning_rate 1e-5
+- Check data format and quality
+- Try smaller model first
+```
+
+### Issue: Multi-GPU Sync Errors
+
+```bash
+# Ensure:
+1. All GPUs visible: nvidia-smi
+2. Same CUDA version across all GPUs
+3. Network connectivity for distributed training
+
+# Debug:
+torchrun --nproc_per_node 2 train_xxx.py --debug
+```
+
+### Issue: Different Results Than Expected
+
+```python
+# Check:
+1. Random seed set (reproducibility)
+2. Correct model checkpoint loaded
+3. Correct dataset being used
+4. Same hyperparameters as reference
+```
+
+## π Training Progression
+
+Typical training curves:
+
+```
+Pretraining Loss: βββ (steep decline, then plateau)
+SFT Loss: β (steady decline)
+PPO Reward: β (rising, may plateau)
+GRPO Reward: ββ (faster rise, more stable)
+```
+
+## π Advanced Topics
+
+### Custom Datasets
+
+Create your own dataset:
+
+```python
+# Format: JSONL with conversations list
+# Each line is one training example
+# Ensure consistent quality and format
+```
+
+### Model Quantization (Post-training)
+
+```bash
+# 4-bit quantization for inference
+# Use tools like:
+# - llama.cpp (gguf format)
+# - bitsandbytes (dynamic quantization)
+# - AutoGPTQ (static quantization)
+```
+
+### Model Merging
+
+```python
+# Merge base model + LoRA weights
+# Use tools like: peft, llama.cpp
+```
+
+## π References
+
+- [Scaling Laws](https://arxiv.org/pdf/2001.08361.pdf)
+- [RoPE Position Embeddings](https://arxiv.org/abs/2104.09864)
+- [YaRN Length Extrapolation](https://arxiv.org/abs/2309.00071)
+- [PPO Algorithm](https://arxiv.org/abs/1707.06347)
+- [GRPO (DeepSeek)](https://arxiv.org/pdf/2402.03300)
+- [SPO Algorithm](https://arxiv.org/abs/2509.13232)
+- [DPO](https://arxiv.org/abs/2305.18290)
+
+---
+
+**Next**: Deploy your trained model or explore [advanced inference options](quickstart.md#third-party-inference-frameworks)