diff --git a/README.md b/README.md
index 8ef9609..8391922 100644
--- a/README.md
+++ b/README.md
@@ -15,6 +15,12 @@
 
 </div>
 
+<div align="center">
+
+![GitHub Trend](https://trendshift.io/api/badge/repositories/12586)
+
+</div>
+
 <div align="center">
   <h3>"大道至简"</h3>
 </div>
@@ -27,15 +33,13 @@
 
 * 此开源项目旨在完全从0开始，仅用3块钱成本 + 2小时！即可训练出仅为25.8M的超小语言模型**MiniMind**。
 * **MiniMind**系列极其轻量，最小版本体积是 GPT-3 的 $\frac{1}{7000}$，力求做到最普通的个人GPU也可快速训练。
-* 项目同时开源了大模型的极简结构-包含拓展共享混合专家(MoE)、数据集清洗、预训练(Pretrain)、监督微调(SFT)、LoRA微调，
-  直接偏好强化学习(DPO)算法、模型蒸馏算法等全过程代码。
+* 项目同时开源了大模型的极简结构-包含拓展共享混合专家(MoE)、数据集清洗、预训练(Pretrain)、监督微调(SFT)、LoRA微调、直接偏好优化(DPO)、强化学习训练(RLAIF: PPO/GRPO等)、模型蒸馏等全过程代码。
 * **MiniMind**同时拓展了视觉多模态的VLM: [MiniMind-V](https://github.com/jingyaogong/minimind-v)。
 * 项目所有核心算法代码均从0使用PyTorch原生重构！不依赖第三方库提供的抽象接口。
 * 这不仅是大语言模型的全阶段开源复现，也是一个入门LLM的教程。
 * 希望此项目能为所有人提供一个抛砖引玉的示例，一起感受创造的乐趣！推动更广泛AI社区的进步！
 
-> 为防止误解，“2小时” 基于NVIDIA 3090硬件设备（单卡）测试，“3块钱”
-> 指GPU服务器租用成本，具体规格详情见下文。
+> 为防止误解，“2小时” 基于NVIDIA 3090硬件设备（单卡）测试，“3块钱”指GPU服务器租用成本，具体规格详情见下文。
 
 ---
 
@@ -77,7 +81,7 @@
 这就好比教牛顿如何使用21世纪的智能手机——虽然有趣，却完全偏离了理解物理本质的初衷。
 与此同时，第三方的大模型框架和工具库，如transformers+trl，几乎只暴露了高度抽象的接口。
 通过短短10行代码，就能完成“加载模型+加载数据集+推理+强化学习”的全流程训练。
-这种高效的封装固然便利，但也像一架高速飞船，将我们与底层实现隔离开来，阻碍了深入探究LLM核心代码的机会。
+这种高效的封装固然便利，但也像一架高速飞船，将开发者与底层实现隔离开来，阻碍了深入探究LLM核心代码的机会。
 然而，“用乐高拼出一架飞机，远比坐在头等舱里飞行更让人兴奋！”。
 更糟糕的是，互联网上充斥着大量付费课程和营销号，以漏洞百出、一知半解的内容推销AI教程。
 正因如此，本项目初衷是拉低LLM的学习门槛，让每个人都能从理解每一行代码开始，
@@ -106,12 +110,12 @@
 
 - MiniMind-LLM结构的全部代码（Dense+MoE模型）。
 - 包含Tokenizer分词器详细训练代码。
-- 包含Pretrain、SFT、LoRA、RLHF-DPO、模型蒸馏的全过程训练代码。
+- 包含Pretrain、SFT、LoRA、RLHF-DPO、RLAIF(PPO/GRPO/SPO)、模型蒸馏的全过程训练代码。
 - 收集、蒸馏、整理并清洗去重所有阶段的高质量数据集，且全部开源。
-- 从0实现预训练、指令微调、LoRA、DPO强化学习，白盒模型蒸馏。关键算法几乎不依赖第三方封装的框架，且全部开源。
+- 从0实现预训练、指令微调、LoRA、DPO/PPO/GRPO/SPO强化学习，白盒模型蒸馏。关键算法几乎不依赖第三方封装的框架，且全部开源。
 - 同时兼容`transformers`、`trl`、`peft`等第三方主流框架。
-- 训练支持单机单卡、单机多卡(DDP、DeepSpeed)训练，支持wandb可视化训练流程。支持动态启停训练。
-- 在第三方测评榜（C-Eval、C-MMLU、OpenBookQA等）进行模型测试。
+- 训练支持单机单卡、单机多卡(DDP、DeepSpeed)训练，支持wandb/swanlab可视化训练流程。支持动态启停训练。
+- 在第三方测评榜（C-Eval、C-MMLU、OpenBookQA等）进行模型测试，支持YaRN算法执行RoPE长文本外推。
 - 实现Openai-Api协议的极简服务端，便于集成到第三方ChatUI使用（FastGPT、Open-WebUI等）。
 - 基于streamlit实现最简聊天WebUI前端。
 - 全面兼容社区热门`llama.cpp`、`vllm`、`ollama`推理引擎或`Llama-Factory`训练框架。
@@ -122,7 +126,21 @@
 ### 👉**更新日志**
 
 <details close> 
-<summary> <b>2025-04-26 (newest 🎉🎉🎉)</b> </summary>
+<summary> <b>2025-10-24 (newest🎉)</b> </summary>
+
+- 🔥 新增RLAIF训练算法：PPO、GRPO、SPO（从0原生实现）
+- 新增RLAIF数据集：rlaif-mini.jsonl（从SFT数据随机采样1万条）
+- 新增YaRN算法：支持RoPE长文本外推，提升长序列处理能力
+- Adaptive Thinking：Reason模型可选是否启用思考链
+- chat_template全面支持Tool Calling和Reasoning标签（`<tool_call>`、`<think>`等）
+- 新增RLAIF完整章节、训练曲线对比、算法原理折叠说明
+- [SwanLab](https://swanlab.cn/)替代WandB（国内访问友好，API完全兼容）
+- 修复一些已知bug
+
+</details>
+
+<details close> 
+<summary> <b>2025-04-26</b> </summary>
 
 - 重要更新
 - 如有兼容性需要，可访问[🔗旧仓库内容🔗](https://github.com/jingyaogong/minimind/tree/7da201a944a90ed49daef8a0265c959288dff83a)。
@@ -131,6 +149,7 @@
 - 🔥支持llama.cpp、vllm、ollama等热门三方生态。
 - 规范代码和目录结构。
 - 改动词表`<s></s>`->`<|im_start|><|im_end|>`
+
 ```text
 为兼容第三方推理框架llama.cpp、vllm，本次更新需付出一些可观代价。
 本次更新不再支持「直接」加载25-04-26以前的旧模型进行推理。
@@ -138,6 +157,7 @@
 MiniMind2系列旧模型均经过权重映射+（微调训练）QKVO线性层校准恢复而来。
 本次更新后将放弃对`minimind-v1`全系列的维护，并在仓库中下线。
 ```
+
 </details>
 
 <details close> 
@@ -236,7 +256,9 @@ pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
 ```
 
 ### 2.下载模型
+
 到项目根目录
+
 ```bash
 git clone https://huggingface.co/jingyaogong/MiniMind2
 ```
@@ -333,7 +355,7 @@ python train_full_sft.py
 
 ---
 
-### 4.测试模型效果
+### 4.测试自己训练的模型效果
 
 确保需要测试的模型`*.pth`文件位于`./out/`目录下。
 也可以直接去[此处](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch/files)下载使用我训练的`*.pth`文件。
@@ -345,7 +367,9 @@ python eval_model.py --model_mode 1 # 默认为0：测试pretrain模型效果，
 <details style="color:rgb(128,128,128)">
 <summary>注：测试须知</summary>
 
-如需详情，查看`eval_model.py`脚本代码即可。model_mode分为 0: 预训练模型，1: SFT-Chat模型，2: RLHF-Chat模型，3: Reason模型
+详情查看`eval_model.py`脚本代码即可。
+
+model_mode分为【必要的】 0: 预训练模型，1: SFT-Chat模型 |【非必要的】2: RLHF模型，3: Reason模型，4/5: RLAIF模型
 
 </details>
 
@@ -370,7 +394,7 @@ torchrun --nproc_per_node N train_xxx.py
 deepspeed --master_port 29500 --num_gpus=N train_xxx.py
 ```
 
-可根据需要开启wandb记录训练过程
+可根据需要开启wandb记录训练过程（需可直连）
 
 ```bash
 # 需要登录: wandb login
@@ -382,6 +406,8 @@ python train_xxx.py --use_wandb
 通过添加`--use_wandb`参数，可以记录训练过程，训练完成后，可以在wandb网站上查看训练过程。通过修改`wandb_project`
 和`wandb_run_name`参数，可以指定项目名称和运行名称。
 
+【注】：25年6月后，国内网络环境无法直连WandB，MiniMind项目默认转为使用[SwanLab](https://swanlab.cn/)作为训练可视化工具（完全兼容WandB API），即`import wandb`改为`import swanlab as wandb`即可，其他均无需改动。
+
 </details>
 
 # 📌 数据介绍
@@ -434,7 +460,7 @@ quality（当然也还不算high，提升数据质量无止尽）。
 
 文件`pretrain_hq.jsonl` 数据格式为
 
-```bash
+```json
 {"text": "如何才能摆脱拖延症？ 治愈拖延症并不容易，但以下建议可能有所帮助..."}
 ```
 
@@ -527,6 +553,7 @@ MiniMind训练数据集下载地址： [ModelScope](https://www.modelscope.cn/da
 ├── lora_medical.jsonl (34MB)
 ├── pretrain_hq.jsonl (1.6GB, ✨)
 ├── r1_mix_1024.jsonl (340MB)
+├── rlaif-mini.jsonl (1MB)
 ├── sft_1024.jsonl (5.6GB)
 ├── sft_2048.jsonl (9GB)
 ├── sft_512.jsonl (7.5GB)
@@ -542,6 +569,7 @@ MiniMind训练数据集下载地址： [ModelScope](https://www.modelscope.cn/da
 * `lora_medical.jsonl` --医疗问答数据集，推荐用于lora训练（亦可用于全参SFT，勿被名字局限）
 * `pretrain_hq.jsonl`✨ --预训练数据集，整合自jiangshu科技
 * `r1_mix_1024.jsonl` --DeepSeek-R1-1.5B蒸馏数据，每条数据字符最大长度为1024（因此训练时设置max_seq_len=1024）
+* `rlaif-mini.jsonl` --RLAIF训练数据集，从SFT数据集中随机采样1万条高质量对话，用于PPO/GRPO/SPO等强化学习算法训练
 * `sft_1024.jsonl` --整合自Qwen2.5蒸馏数据（是sft_2048的子集），每条数据字符最大长度为1024（因此训练时设置max_seq_len=1024）
 * `sft_2048.jsonl` --整合自Qwen2.5蒸馏数据，每条数据字符最大长度为2048（因此训练时设置max_seq_len=2048）
 * `sft_512.jsonl` --整合自匠数科技SFT数据，每条数据字符最大长度为512（因此训练时设置max_seq_len=512）
@@ -566,7 +594,9 @@ MiniMind训练数据集下载地址： [ModelScope](https://www.modelscope.cn/da
 
 </details>
 
-# 📌 Model Structure
+# 📌 Model
+
+## Structure
 
 MiniMind-Dense（和[Llama3.1](https://ai.meta.com/blog/meta-llama-3-1/)一样）使用了Transformer的Decoder-Only结构，跟GPT-3的区别在于：
 
@@ -600,6 +630,35 @@ MiniMind的整体结构一致，只是在RoPE计算、推理函数和FFN层的
 | minimind-v1-moe   | 4×26M  | 6400      | 1e4        | 8        | 512     | 8        | 16      | 1+4         |
 | minimind-v1       | 108M   | 6400      | 1e4        | 16       | 768     | 8        | 16      | -           |
 
+
+## Model Configuration
+
+📋关于LLM的参数配置，有一篇很有意思的论文[MobileLLM](https://arxiv.org/pdf/2402.14905)做了详细的研究和实验。
+Scaling Law在小模型中有自己独特的规律。
+引起Transformer参数成规模变化的参数几乎只取决于`d_model`和`n_layers`。
+
+* `d_model`↑ + `n_layers`↓ -> 矮胖子
+* `d_model`↓ + `n_layers`↑ -> 瘦高个
+
+2020年提出Scaling Law的论文认为，训练数据量、参数量以及训练迭代次数才是决定性能的关键因素，而模型架构的影响几乎可以忽视。
+然而似乎这个定律对小模型并不完全适用。
+MobileLLM提出架构的深度比宽度更重要，「深而窄」的「瘦长」模型可以学习到比「宽而浅」模型更多的抽象概念。
+例如当模型参数固定在125M或者350M时，30～42层的「狭长」模型明显比12层左右的「矮胖」模型有更优越的性能，
+在常识推理、问答、阅读理解等8个基准测试上都有类似的趋势。
+这其实是非常有趣的发现，因为以往为100M左右量级的小模型设计架构时，几乎没人尝试过叠加超过12层。
+这与MiniMind在训练过程中，模型参数量在`d_model`和`n_layers`之间进行调整实验观察到的效果是一致的。
+然而「深而窄」的「窄」也是有维度极限的，当d_model<512时，词嵌入维度坍塌的劣势非常明显，
+增加的layers并不能弥补词嵌入在固定q_head带来d_head不足的劣势。
+当d_model>1536时，layers的增加似乎比d_model的优先级更高，更能带来具有"性价比"的参数->效果增益。
+
+* 因此MiniMind设定small模型dim=512，n_layers=8来获取的「极小体积<->更好效果」的平衡。
+* 设定dim=768，n_layers=16来获取效果的更大收益，更加符合小模型Scaling-Law的变化曲线。
+
+作为参考，GPT3的参数设定见下表：
+![gpt3_config.png](./images/gpt3_config.png)
+
+---
+
 # 📌 Experiment
 
 ## Ⅰ 训练开销
@@ -669,16 +728,16 @@ Zero模型权重保存为 `full_sft_512_zero.pth`（见下文MiniMind模型文
 
 ---
 
-## Ⅱ 主要训练步骤
+## Ⅱ 主要训练（必须）
 
 > 所有训练脚本均 `cd ./trainer` 目录执行
 
-### **1. 预训练(Pretrain)**:
+### **1. 预训练 (Pretrain)**:
 
 LLM首先要学习的并非直接与人交流，而是让网络参数中充满知识的墨水，“墨水” 理论上喝的越饱越好，产生大量的对世界的知识积累。
 预训练就是让Model先埋头苦学大量基本的知识，例如从Wiki百科、新闻、书籍整理大规模的高质量训练数据。
 这个过程是“无监督”的，即人类不需要在过程中做任何“有监督”的校正，而是由模型自己从大量文本中总结规律学习知识点。
-模型此阶段目的只有一个：**学会词语接龙**。例如我们输入“秦始皇”四个字，它可以接龙“是中国的第一位皇帝”。
+模型此阶段目的只有一个：**学会词语接龙**。例如输入"秦始皇"四个字，它可以接龙"是中国的第一位皇帝"。
 
 ```bash
 torchrun --nproc_per_node 1 train_pretrain.py # 1即为单卡训练，可根据硬件情况自行调整 (设置>=2)
@@ -689,13 +748,18 @@ python train_pretrain.py
 > 训练后的模型权重文件默认每隔`100步`保存为: `pretrain_*.pth`（*
 > 为模型具体dimension，每次保存时新文件会覆盖旧文件）
 
-### **2. 有监督微调(Supervised Fine-Tuning)**:
+
+| MiniMind2-Small (512dim) | MiniMind2 (768dim) |
+|---|---|
+| <img src="./images/pre_512_loss.png"> | <img src="./images/pre_768_loss.png"> |
+
+### **2. 有监督微调 (Supervised Fine-Tuning)**:
 
 经过预训练，LLM此时已经掌握了大量知识，然而此时它只会无脑地词语接龙，还不会与人聊天。
 SFT阶段就需要把半成品LLM施加一个自定义的聊天模板进行微调。
 例如模型遇到这样的模板【问题->回答，问题->回答】后不再无脑接龙，而是意识到这是一段完整的对话结束。
 称这个过程为指令微调，就如同让已经学富五车的「牛顿」先生适应21世纪智能手机的聊天习惯，学习屏幕左侧是对方消息，右侧是本人消息这个规律。
-在训练时，MiniMind的指令和回答长度被截断在512，是为了节省显存空间。就像我们学习时，会先从短的文章开始，当学会写作200字作文后，800字文章也可以手到擒来。
+在训练时，MiniMind的指令和回答长度被截断在512，是为了节省显存空间。就像学习写作时，会先从短的文章开始，当学会写作200字作文后，800字文章也可以手到擒来。
 在需要长度拓展时，只需要准备少量的2k/4k/8k长度对话数据进行进一步微调即可（此时最好配合RoPE-NTK的基准差值）。
 > 在推理时通过调整RoPE线性差值，实现免训练长度外推到2048及以上将会很方便。
 
@@ -708,32 +772,16 @@ python train_full_sft.py
 > 训练后的模型权重文件默认每隔`100步`保存为: `full_sft_*.pth`（*
 > 为模型具体dimension，每次保存时新文件会覆盖旧文件）
 
-## Ⅲ 其它训练步骤
+
+| MiniMind2-Small (512dim) | MiniMind2 (768dim) |
+|---|---|
+| <img src="./images/sft_512_loss.png"> | <img src="./images/sft_768_loss.png"> |
+
+## Ⅲ 其它训练阶段（可选）
 
 > 所有训练脚本均 `cd ./trainer` 目录执行
 
-### **3. 人类反馈强化学习(Reinforcement Learning from Human Feedback, RLHF)**
-
-在前面的训练步骤中，模型已经具备了基本的对话能力，但是这样的能力完全基于单词接龙，缺少正反样例的激励。
-模型此时尚未知什么回答是好的，什么是差的。我们希望它能够更符合人的偏好，降低让人类不满意答案的产生概率。
-这个过程就像是让模型参加新的培训，从优秀员工的作为例子，消极员工作为反例，学习如何更好地回复。
-此处使用的是RLHF系列之-直接偏好优化(Direct Preference Optimization, DPO)。
-与PPO(Proximal Policy Optimization)这种需要奖励模型、价值模型的RL算法不同；
-DPO通过推导PPO奖励模型的显式解，把在线奖励模型换成离线数据，Ref模型输出可以提前保存。
-DPO性能几乎不变，只用跑 actor_model 和 ref_model 两个模型，大大节省显存开销和增加训练稳定性。
-
-> 注：RLHF训练步骤**并非必须**，此步骤难以提升模型“智力”而通常仅用于提升模型的“礼貌”，有利（符合偏好、减少有害内容）也有弊（样本收集昂贵、反馈偏差、多样性损失）。
-
-```bash
-torchrun --nproc_per_node 1 train_dpo.py
-# or
-python train_dpo.py
-```
-
-> 训练后的模型权重文件默认每隔`100步`保存为: `rlhf_*.pth`（*
-> 为模型具体dimension，每次保存时新文件会覆盖旧文件）
-
-### **4. 知识蒸馏(Knowledge Distillation, KD)**
+### **3. 知识蒸馏 (Knowledge Distillation, KD)**
 
 在前面的所有训练步骤中，模型已经完全具备了基本能力，通常可以学成出师了。
 而知识蒸馏可以进一步优化模型的性能和效率，所谓知识蒸馏，即学生模型面向教师模型学习。
@@ -768,7 +816,7 @@ torchrun --nproc_per_node 1 train_distillation.py
 python train_distillation.py
 ```
 
-### **5. LoRA (Low-Rank Adaptation)**
+### **4. LoRA (Low-Rank Adaptation)**
 
 LoRA是一种高效的参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）方法，旨在通过低秩分解的方式对预训练模型进行微调。
 相比于全参数微调（Full Fine-Tuning），LoRA 只需要更新少量的参数。
@@ -787,7 +835,7 @@ python train_lora.py
 
 非常多的人困惑，如何使模型学会自己私有领域的知识？如何准备数据集？如何迁移通用领域模型打造垂域模型？
 这里举几个例子，对于通用模型，医学领域知识欠缺，可以尝试在原有模型基础上加入领域知识，以获得更好的性能。
-同时，我们通常不希望学会领域知识的同时损失原有基础模型的其它能力，此时LoRA可以很好的改善这个问题。
+同时，通常不希望学会领域知识的同时损失原有基础模型的其它能力，此时LoRA可以很好的改善这个问题。
 只需要准备如下格式的对话数据集放置到`./dataset/lora_xxx.jsonl`，启动 `python train_lora.py`
 训练即可得到`./out/lora/lora_xxx.pth`新模型权重。
 
@@ -806,7 +854,7 @@ python train_lora.py
 ```
 
 此时【基础模型+LoRA模型】即可获得医疗场景模型增强的能力，相当于为基础模型增加了LoRA外挂，这个过程并不损失基础模型的本身能力。
-我们可以通过`eval_model.py`进行模型评估测试。
+可以通过`eval_model.py`进行模型评估测试。
 
 ```bash
 # 注意：model_mode即选择基础模型的类型，这和train_lora是基于哪个模型训练的相关，确保统一即可。
@@ -827,7 +875,7 @@ python eval_model.py --lora_name 'lora_medical' --model_mode 2
 
 PS：只要有所需要的数据集，也可以full_sft全参微调（需要进行通用知识的混合配比，否则过拟合领域数据会让模型变傻，损失通用性）
 
-### **6. 训练推理模型 (Reasoning Model)**
+### **5. 训练推理模型 (Reasoning Model)**
 
 DeepSeek-R1实在太火了，几乎重新指明了未来LLM的新范式。
 论文指出`>3B`的模型经历多次反复的冷启动和RL奖励训练才能获得肉眼可见的推理能力提升。
@@ -835,7 +883,9 @@ DeepSeek-R1实在太火了，几乎重新指明了未来LLM的新范式。
 但由于缺乏技术含量，蒸馏派被RL派瞧不起（hhhh）。
 本人迅速已经在Qwen系列1.5B小模型上进行了尝试，很快复现了Zero过程的数学推理能力。
 然而一个遗憾的共识是：参数太小的模型直接通过冷启动SFT+GRPO几乎不可能获得任何推理效果。
+<del>
 MiniMind2第一时间只能坚定不移的选择做蒸馏派，日后基于0.1B模型的RL如果同样取得小小进展会更新此部分的训练方案。
+</del>
 
 做蒸馏需要准备的依然是和SFT阶段同样格式的数据即可，数据集来源已如上文介绍。数据格式例如：
 
@@ -908,44 +958,343 @@ python train_distill_reason.py
 </answer>
 ```
 
----
+## IV 强化学习后训练
 
-## Ⅳ 模型参数设定
+LLM里的强化学习方法可分两类：
 
-📋关于LLM的参数配置，有一篇很有意思的论文[MobileLLM](https://arxiv.org/pdf/2402.14905)做了详细的研究和实验。
-Scaling Law在小模型中有自己独特的规律。
-引起Transformer参数成规模变化的参数几乎只取决于`d_model`和`n_layers`。
+1. **基于人类反馈的强化学习 (Reinforcement Learning from Human Feedback, RLHF)**
 
-* `d_model`↑ + `n_layers`↓ -> 矮胖子
-* `d_model`↓ + `n_layers`↑ -> 瘦高个
+- 通过**人类**对模型输出的偏好进行评价来训练模型，使其生成更符合人类价值观和偏好的内容。
 
-2020年提出Scaling Law的论文认为，训练数据量、参数量以及训练迭代次数才是决定性能的关键因素，而模型架构的影响几乎可以忽视。
-然而似乎这个定律对小模型并不完全适用。
-MobileLLM提出架构的深度比宽度更重要，「深而窄」的「瘦长」模型可以学习到比「宽而浅」模型更多的抽象概念。
-例如当模型参数固定在125M或者350M时，30～42层的「狭长」模型明显比12层左右的「矮胖」模型有更优越的性能，
-在常识推理、问答、阅读理解等8个基准测试上都有类似的趋势。
-这其实是非常有趣的发现，因为以往为100M左右量级的小模型设计架构时，几乎没人尝试过叠加超过12层。
-这与MiniMind在训练过程中，模型参数量在`d_model`和`n_layers`之间进行调整实验观察到的效果是一致的。
-然而「深而窄」的「窄」也是有维度极限的，当d_model<512时，词嵌入维度坍塌的劣势非常明显，
-增加的layers并不能弥补词嵌入在固定q_head带来d_head不足的劣势。
-当d_model>1536时，layers的增加似乎比d_model的优先级更高，更能带来具有“性价比”的参数->效果增益。
+2. **基于AI反馈的强化学习 (Reinforcement Learning from AI Feedback, RLAIF)**
 
-* 因此MiniMind设定small模型dim=512，n_layers=8来获取的「极小体积<->更好效果」的平衡。
-* 设定dim=768，n_layers=16来获取效果的更大收益，更加符合小模型Scaling-Law的变化曲线。
+- 使用**AI模型**（通常是预训练的语言奖励模型）来提供反馈，而不直接依赖人类的人工标注。
+- 这里的“AI”也可以是某些规则奖励，例如数学答案/代码解释器...
 
-作为参考，GPT3的参数设定见下表：
-![gpt3_config.png](./images/gpt3_config.png)
+| 类型    | 裁判 | 优点        | 缺点         |
+|-------|----|-----------|------------|
+| RLHF  | 人类 | 更贴近真实人类偏好 | 成本高、效率低    |
+| RLAIF | 模型 | 自动化、可扩展性强 | 可能偏离人类真实偏好 |
+
+二者本质上是一样的，都是通过**强化学习的方式**，利用某种形式的"**反馈**"来优化模型的行为。
+
+除了**反馈**的来源不同，其他并无任何区别。
+
+
+### 👀 PO算法的统一视角
+
+在介绍实现具体算法之前，我先以个人理解的极简视角，阐述所有Policy Optimization (PO)算法的统一共性。
+
+所有RL算法的本质都只是在优化一个期望：
+
+$$\mathcal{J}_{PO} = \mathbb{E}_{q \sim P(Q), o \sim \pi(O|q)} \left[ \underbrace{f(r_t)}_{\text{策略项}} \cdot \underbrace{g(A_t)}_{\text{优势项}} - \underbrace{h(\text{KL}_t)}_{\text{正则项}} \right]$$
+
+训练时，只需**最小化负目标函数**，即: $\mathcal{L_{PO}}=\mathcal{J_{PO}}$
+
+这个框架只包含三个核心组件：
+* **策略项** $f(r_t)$: 如何使用概率比 $r_t$? 即告诉模型新旧策略偏差有多大，是否探索到了更好的token
+* **优势项** $g(A_t)$: 如何计算优势 $A_t$, 这很重要！大模型算对定积分也不足为奇，小模型回答对加减法优势通常都是正的
+* **正则项** $h(\text{KL}_t)$: 如何约束变化幅度 $\text{KL}_t$, 既防止跑偏又防止管的太死
+
+<details>
+<summary>（展开）符号说明</summary>
+
+| 符号 | 含义 | 说明 | 值域 |
+|------|------|------|------|
+| $q$ | 问题/提示词 | 从数据集 $P(Q)$ 中采样 | - |
+| $o$ | 模型输出序列 | 由策略 $\pi$ 生成 | - |
+| $r_t$ | 概率比 | $r_t = \frac{\pi_\theta(o_t\|q, o_{<t})}{\pi_{ref}(o_t\|q, o_{<t})}$ | $(0, +\infty)$ |
+| $A_t$ | 优势函数 | 衡量某个动作相比基线有多好 | $(-\infty, +\infty)$ |
+| $\text{KL}_t$ | KL散度 | 防止策略偏离参考模型太远 | $[0, +\infty)$ |
+
+</details>
+<br/>
+
+不同的**xxPO算法**本质上只是对这三个组件的不同设计的实例化！
 
 ---
 
-## Ⅴ 训练结果
+### **6. 基于人类反馈的强化学习 (Reinforcement Learning from Human Feedback, RLHF)**
 
-MiniMind2 模型训练损失走势（由于数据集在训练后又更新清洗多次，因此Loss仅供参考）
+在前面的训练步骤中，模型已经具备了基本的对话能力，但是这样的能力完全基于单词接龙，缺少正反样例的激励。
+模型此时尚未知什么回答是好的，什么是差的。希望它能够更符合人的偏好，降低让人类不满意答案的产生概率。
+这个过程就像是让模型参加新的培训，从优秀员工的作为例子，消极员工作为反例，学习如何更好地回复。
 
-| models          | pretrain (length-512)                              | sft (length-512)                                   |
-|-----------------|----------------------------------------------------|----------------------------------------------------|
-| MiniMind2-Small | <img src="./images/pre_512_loss.png" width="100%"> | <img src="./images/sft_512_loss.png" width="100%"> |
-| MiniMind2       | <img src="./images/pre_768_loss.png" width="100%"> | <img src="./images/sft_768_loss.png" width="100%"> |
+#### 6.1 Direct Preference Optimization
+直接偏好优化（DPO）算法，损失为：
+
+$$\mathcal{L}_{DPO} = -\mathbb{E}\left[\log \sigma\left(\beta \left[\log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right]\right)\right]$$
+
+其中：
+- **策略项**: $f(r_t) = \log r_w - \log r_l$ (对比chosen vs rejected的概率比)
+- **优势项**: $g(A_t)$ = / (通过偏好对比，无需显式计算优势)
+- **正则项**: $h(\text{KL}_t)$ = 隐含在 $\beta$ 中 (控制偏离参考模型程度)
+
+特别地，
+- DPO从PPO带KL约束的目标推导出对偏好对的解析训练目标，直接最大化"chosen优于rejected"的对数几率；无需同步训练Reward/Value模型。DPO只需跑`actor`与`ref`两个模型，显存占用低、收敛稳定、实现简单。
+- 训练范式：off‑policy，使用静态偏好数据集，可反复多轮epoch；Ref模型固定（预先缓存输出）。
+- DPO的局限在于不做在线探索，更多用于"偏好/安全"的人类价值对齐；对"能不能做对题"的智力能力提升有限（当然这也取决于数据集，大规模收集正反样本并人类评估很困难）。
+
+```bash
+torchrun --nproc_per_node 1 train_dpo.py
+# or
+python train_dpo.py
+```
+
+> 训练后的模型权重文件默认每隔`100步`保存为: `rlhf_*.pth`（*为模型具体dimension，每次保存时新文件会覆盖旧文件）
+
+### **7. 基于AI反馈的强化学习 (Reinforcement Learning from AI Feedback, RLAIF)**
+
+相比RLHF依赖人类标注chosen/rejected偏好对，RLAIF则完全由AI来充当"裁判"。
+所谓AI"裁判"可以是model-base的奖励大模型(Reward Model)，也可以是R1一样设置规则函数进行校验，也可以是例如工具调用的环境反馈。
+例如：数学题答案是否正确、工具调用执行代码能否通过测试用例、推理过程是否符合格式...都可以自动化判断。
+RLAIF的最大优势在于**可扩展性**和**On-Policy**的特点——不需要昂贵的人工标注，可以生成海量的训练样本，让模型在在线大量试错中快速进化。
+
+MiniMind 着手实现**2+N**种基本+前沿的RLAIF方法：
+* **PPO**、**GRPO** 被大规模验证的经典RL算法；
+* N种前沿RL算法（不定期以Exp性质更新）。
+
+#### 1️⃣ 数据集准备 (需要)
+
+为了快速验证RLAIF的效果，这里从SFT数据集中随机采样了1万条高质量对话，构建约1MB大小的`rlaif-mini.jsonl`([Huggingface](https://huggingface.co/datasets/jingyaogong/minimind_dataset/blob/main/rlaif-mini.jsonl))
+
+数据格式与SFT一致，但assistant并不需要内容，因为训练过程中完全由 $\Pi$ 策略模型实时采样生成。因此形如：
+
+```json
+{
+    "conversations": [
+        {"role": "user", "content": "请解释一下什么是光合作用？"},
+        {"role": "assistant", "content": "无"}
+    ]
+}
+```
+
+RLAIF的训练过程中，模型会基于user的问题生成1或多个候选回答，然后由奖励函数/模型对回答打分，
+分数高的回答会被鼓励（增加 $\Pi$ 策略概率），分数低的回答会被抑制（降低 $\Pi$ 策略概率）。这个"打分->调整"的循环就是强化学习的核心。
+
+#### 2️⃣ 奖励模型准备 (需要)
+
+已知RLAIF训练需要“奖励模型 (Reward Model)”对生成的回答进行打分。
+
+此处选取小型且高质量的InternLM2-1.8B-Reward 
+([ModelScope](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-1_8b-reward) | [HuggingFace](https://huggingface.co/internlm/internlm2-1_8b-reward)) 
+作为基础奖励模型。
+
+下载奖励模型后需要放置在minimind项目的**同级目录**下，推荐结构如下：
+
+```
+project/
+├── minimind/                    # MiniMind项目
+│   ├── model/
+│   └── ...
+└── internlm2-1_8b-reward/       # 奖励模型（与minimind同级）
+    ├── config.json
+    ├── model.safetensors
+    └── ...
+```
+
+<details>
+<summary><b>奖励机制选择与MiniMind限制说明（点击展开）</b></summary>
+
+**1. 奖励机制的多样性**
+
+RLAIF中的"奖励信号"来源可以非常灵活：
+
+- **Model-based奖励**：可使用专门的Reward Model（如InternLM2-Reward），也可使用通用LLM+提示词进行打分（如Qwen3-as-a-Judge）。奖励模型规模和架构均可自由选择。
+
+- **Rule-based奖励**：可以基于规则函数构造奖励信号，例如：
+  - 数学题答案正确性验证（Ground Truth对比）
+  - SQL执行成功率与结果准确性
+  - 代码解释器运行结果（pass@k）
+  - 工具调用返回状态（API成功/失败）
+  - 格式合规性检查（JSON/XML解析）
+  - 推理链完整性评估（CoT步骤数）
+
+- **Environment-based奖励**：在Agent场景中，环境反馈本身即为天然奖励（如游戏得分、Research完整度、任务完成度）。
+
+任何能够量化"回答质量"的机制都可作为RL的奖励来源。DeepSeek R1就是典型案例：使用规则函数验证数学答案正确性作为奖励，无需额外的Reward Model。
+
+**2. MiniMind限制：奖励稀疏问题**
+
+RLAIF训练既可以针对推理模型也可以针对非推理模型，区别仅在于格式。
+
+然而对于MiniMind这种0.1B参数量极小能力弱的模型，在通用任务（如R1风格的数学数据集）上会遇到严重的奖励稀疏(Reward Sparsity)问题：
+
+- **现象**：模型生成的候选回答几乎全部错误，导致所有奖励分数 $r(x,y) \approx 0$
+- **后果**：优势函数 $A(x,y) = r(x,y) - b(x) \approx 0$，策略梯度信号消失，无法有效更新参数 $\theta$
+
+如同让小学生做高考数学题，无论尝试多少次都得零分，无法通过分数差异学习改进策略。因此这是RL算法的根本原理限制的。
+
+为缓解此问题，MiniMind的实现选择了**model-based的连续性奖励信号**：
+
+- Reward Model输出连续分数（如-2.5到+3.0），而非二元的0/1
+- 即使回答质量都差，也仍能区分"更更差"(-3.0)和"更差"(-2.8)的细微差异。所以这种**稠密且连续**的奖励信号能够为优势函数 $A(x,y)$ 提供非零梯度，使得策略网络得以渐进式优化
+- 也可以混合多种奖励源: $r_{\text{total}} = \alpha \cdot r_{\text{model}} + \beta \cdot r_{\text{rule}}$ (例如既可以检测think标签格式reward，又可以综合回答本身质量的reward分数)
+- minimind实践中避免直接使用rule-based二元奖励 + 超纲难度数据（如MATH500），易导致奖励全零；
+- 监控训练时观察奖励分数的方差 $\text{Var}(r)$，若持续接近0则需调整数据或奖励机制
+
+**对于生产级大模型的Agentic RL场景**：
+
+在真实Agent系统（代码生成、工具调用、检索-规划-执行的多轮链路）中，奖励是“延迟整轮结算”的不同范式：
+
+- LLM需要逐token生成工具调用指令（tool_call），经历解析（tool_parse）、工具执行（tool_exec），再把结果拼接回上下文继续下一步；循环往复直到完成。
+- 一次完整的任务链路包含多次调用+思考，直到终止条件满足时计算一次总reward（如任务是否完成、测试是否通过、目标是否命中）。
+
+因此，Agentic RL更接近稀疏/延迟奖励设定：梯度回传在“整轮结束后”才发生，和非Agentic RL任务在对话单轮上“即时评分即时更新”有很大不同。
+这也解释了Agent任务上更偏向环境反馈（environment-based reward），而非凭Reward Model进行静态打分。
+
+- **环境交互反馈**：最终以执行结果为准（代码是否跑通、API是否返回成功、子目标是否完成）；
+- **Model-based奖励局限**：对长链路、可执行语义的全貌捕捉有限，且大概率和真实环境反馈不一致（reward hacking）。
+
+
+</details>
+
+---
+
+#### 7.1 [Proximal Policy Optimization](https://arxiv.org/abs/1707.06347)
+
+PPO 是2017年OpenAI提出的非常经典强化学习算法，也是LLM RL通用的基线方法，甚至不需要加之一。
+
+**PPO损失**：
+$$\mathcal{L}_{PPO} = -\mathbb{E}\left[\min(r_t \cdot A_t, \text{clip}(r_t, 1-\varepsilon, 1+\varepsilon) \cdot A_t)\right] + \beta \cdot \mathbb{E}[\text{KL}]$$
+
+其中：
+- **策略项**: $f(r_t) = \min(r_t, \text{clip}(r_t, 1-\varepsilon, 1+\varepsilon))$ (裁剪概率比防止更新过激)
+- **优势项**: $g(A_t) = R - V(s)$ (通过Critic网络估计价值函数)
+- **正则项**: $h(\text{KL}_t) = \beta \cdot \mathbb{E}[\text{KL}]$ (全局KL散度约束)
+
+对比DPO而言，
+- DPO (Off-Policy)：训练数据是静态的偏好数据集（chosen vs rejected），可以反复使用同一批数据训练多个epoch，就像传统监督学习一样。数据效率高，训练成本低。它直接优化偏好对的对数似然，无需Reward Model。
+- PPO (On-Policy)：必须用当前策略实时采样生成新数据，旧策略采集的数据不能用（会有distribution shift问题）。虽然通过importance sampling和clip机制允许轻微的分布偏移，但本质上要求数据来自相对新鲜的策略。数据效率低，但适合探索式学习。
+
+简单来说：
+
+- 前者教模型按离线预定的「好/坏标准」学习，尽管它并非是当前模型所能输出的（例如参考世界冠/亚军录像练习打球）；
+- 后者实时地教模型把事情做对做好，在线采样自最新模型policy（教练手把手教打，为每个动作实时打分）。
+
+MiniMind的PPO实现包含了Actor模型(生成回答)和Critic模型(评估回答价值)，以及完整的GAE(Generalized Advantage Estimation)优势函数计算。
+
+**训练方式**：
+
+```bash
+torchrun --nproc_per_node N train_ppo.py
+# or
+python train_ppo.py
+```
+
+> 训练后的模型权重文件默认每隔`100步`保存为: `ppo_actor_*.pth`（*为模型具体dimension）
+
+
+| MiniMind2-Small (512dim) | MiniMind2 (768dim) |
+|---|---|
+| <img src="./images/train_ppo_512.png"> | <img src="./images/train_ppo_768.png"> |
+
+从训练曲线可以看出，PPO存在**reward提升缓慢**的问题。私以为这主要源于**PPO双网络联合优化**方法：Critic需要逐步收敛以准确估计价值函数，而Actor的策略更新依赖Critic提供的优势估计，两者相互依赖形成复杂的优化过程。训练初期Critic估计不准会影响Actor梯度方向，导致整体收敛缓慢。此外，PPO需要同时维护两个网络，显存占用约为单网络方法的1.5-2倍。
+
+#### 7.2 [Group Relative Policy Optimization](https://arxiv.org/pdf/2402.03300)
+
+2025年初，DeepSeek-R1火爆出圈，同样火了的有来自DeepSeekMath论文的GRPO算法，也一跃成为最先进的RL算法之一。
+然而AI半年=人间半个世纪，时至今日GRPO已经演变为各大XXPO大战(后面演变的DAPO、GSPO、CISPO等)的基线算法。
+具体来说，一句话总结它的核心创新是"分组相对价值估计"。
+
+**GRPO损失**：
+
+$$\mathcal{L}_{GRPO} = -\mathbb{E}\left[r_t \cdot A_t - \beta \cdot \text{KL}_t\right]$$
+
+其中：
+- **策略项**: $f(r_t) = r_t$ (直接使用概率比，无clip裁剪)
+- **优势项**: $g(A_t) = \frac{R - \mu_{group}}{\sigma_{group}}$ (组内归一化，消除Critic网络)
+- **正则项**: $h(\text{KL}_t) = \beta \cdot \text{KL}_t$ (token级KL散度约束)
+
+对于同一个问题，模型生成N个不同的回答(例如N=4)，然后计算这N个回答的奖励分数。
+接着把这N个回答的平均奖励作为baseline，高于baseline的回答被鼓励，低于baseline的回答被抑制。
+用这种方式巧妙地避免了训练额外的critic网络。
+
+只要是RL都必须面对的正反样本这个原理性限制，GRPO也不会例外，其更显著的问题是：退化组(Degenerate Groups)。
+假设某个问题略难，导致N个回答的奖励分数几乎一样（大部分情况是一样烂而不是一样好），那么这一组的学习信号就无限接近0。
+在MiniMind这种超小模型上，这个问题尤为明显，求解数学问题99.99%的情况下整组回答质量都很差，那么将无法学习。
+因此必须为模型指定合理的domain，即必须限制在能力边界内。
+
+
+**训练方式**：
+
+```bash
+torchrun --nproc_per_node N train_grpo.py
+# or
+python train_grpo.py
+```
+
+> 训练后的模型权重文件默认每隔`100步`保存为: `grpo_*.pth`
+
+
+| MiniMind2-Small (512dim) | MiniMind2 (768dim) |
+|---|---|
+| <img src="./images/train_grpo_512.png"> | <img src="./images/train_grpo_768.png"> |
+
+从训练曲线可以看出，GRPO的**reward呈现更加稳定的上升趋势**，达到4左右，说明GRPO本身能更好地利用RLAIF信号。Policy Loss整体下降平稳，相比PPO的双网络优化，GRPO单网络架构训练更稳定且收敛上限更高。
+
+#### 7.3 ⏳⌛️🔥 更多RL拓展 (Exp)
+
+##### 7.3.1 [Single-stream Policy Optimization](https://arxiv.org/abs/2509.13232)
+
+SPO是2025年9月腾讯提出的RL算法，针对GRPO的退化组问题进行改进。
+论文认为，GRPO等算法"一个样本要依赖一组采样"显得别扭而不优雅：太容易或太难的题目，整组几乎学不到东西，学习效率先天受限。
+SPO的动机就是回到RL的本质—**1个输入，1个输出，就是1个训练样本**，回到policy gradient的基本公式去思考：不用group mean也能得到稳定的baseline，也就是把价值估计 V 铺开在时序上，训练前先做粗略的价值预估，训练中一边采样一边更新对 V 的估计，从而为每个样本提供一个跨 batch 持久化、可自适应的基线参照。这种"单流"设计不再依赖同组样本，天然避免了退化组。
+
+**SPO损失**:
+
+$$\mathcal{L}_{SPO} = -\mathbb{E}\left[\log \pi_\theta(a_t|s) \cdot A_t - \beta \cdot \text{KL}_t\right]$$
+
+其中：
+- **策略项**: $f(r_t) = \log \pi_\theta(a_t|s)$ (直接使用log概率，不计算ratio)
+- **优势项**: $g(A_t) = R - B_t^{adaptive}$ (自适应baseline，Beta分布动态跟踪)
+- **正则项**: $h(\text{KL}_t) = \beta \cdot \text{KL}_t$ (token级KL + 动态 $\rho$ 调整)
+
+落到实现层面：SPO采用无分组设计，用持久化的KL自适应value tracker替代组内baseline，优势函数在整个batch上全局归一化。这样每个样本独立处理，无需等待同组其他样本，且能为每个样本提供稳定的学习信号。
+论文在Qwen3-8B的5个困难数学数据集上，SPO平均比GRPO高出3.4个百分点，其中BRUMO 25数据集+7.3pp、AIME 25数据集+4.4pp。
+
+
+> 注：SPO是实验性前沿算法，MiniMind的实现用于探索学习。由于模型参数量极小，无法完全复现论文的8B模型效果。
+
+**训练方式**：
+
+```bash
+torchrun --nproc_per_node N train_spo.py
+# or
+python train_spo.py
+```
+
+> 训练后的模型权重文件默认每隔`100步`保存为: `spo_*.pth`
+
+
+<div align="center">
+<img src="./images/train_spo_768.png">
+<p><i>MiniMind2 (768dim) 训练曲线</i></p>
+</div>
+
+从训练曲线来看，SPO的reward波动与PPO表现接近，弱于GRPO。实际推理测试发现模型输出质量不高，存在逻辑混乱和格式错误问题。
+
+**实验性说明**：当前SPO手搓实现可能在value_tracker配置、reward归一化策略上还存在问题。尚需排查算法本身在小模型上的适应性/或是实现上存在差异。
+
+
+### RL算法小结
+
+我们收束回“**统一框架**”, 重新整理所有不同PO算法只是对三个核心组件的不同实例化的表格：
+
+| 算法 | 策略项 $f(r_t)$ | 优势项 $g(A_t)$ | 正则项 $h(\text{KL}_t)$ | 优化模型 |
+|------|----------------|----------------|----------------------|----------|
+| **DPO** | $\log r_w - \log r_l$ | 隐式（偏好对比） | 隐含在 $\beta$ 中 | 2 | 
+| **PPO** | $\min(r, \text{clip}(r))$ | $R - V(s)$ | $\beta \cdot \mathbb{E}[\text{KL}]$ | 4 | 
+| **GRPO** | $r$ | $\frac{R - \mu}{\sigma}$ | $\beta \cdot \text{KL}_t$ | 2 |
+| **SPO** | $\log \pi_\theta$ | $R - B_t^{adaptive}$ | $\beta \cdot \text{KL}_t$ | 2 | 
+
+**RL是优美且自洽的**
+
+> 以上纯属个人视角理解，如有偏差请随时指正
+
+---
+
+## V 训练结果
 
 ### 训练完成-模型合集
 
@@ -955,22 +1304,15 @@ MiniMind2 模型训练损失走势（由于数据集在训练后又更新清洗
 
 MiniMind2模型权重 ([ModelScope](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch) | [HuggingFace](https://huggingface.co/jingyaogong/MiniMind2-Pytorch))
 
-MiniMind-V1模型权重 ([百度网盘](https://pan.baidu.com/s/1KUfSzEkSXYbCCBj0Pw-9fA?pwd=6666))
 
 <details style="color:rgb(128,128,128)">
 <summary>Torch文件命名对照</summary>
 
-| Model Name      | params | pretrain_model         | sft_model              | rl_model           | reason_model     | lora_model         |
-|-----------------|--------|------------------------|------------------------|--------------------|------------------|--------------------|
-| MiniMind2-small | 26M    | `pretrain_512.pth`     | `full_sft_512.pth`     | `rlhf_512.pth`     | `reason_512.pth` | `lora_xxx_512.pth` |
-| MiniMind2-MoE   | 145M   | `pretrain_640_moe.pth` | `full_sft_640_moe.pth` | `rlhf_640_moe.pth` | -                | -                  |
-| MiniMind2       | 104M   | `pretrain_768.pth`     | `full_sft_768.pth`     | `rlhf_768.pth`     | `reason_768.pth` | `lora_xxx_768.pth` |
-
-| Model Name        | params | pretrain_model         | 单轮对话sft                            | 多轮对话sft                           | rl_model     |
-|-------------------|--------|------------------------|------------------------------------|-----------------------------------|--------------|
-| minimind-v1-small | 26M    | `pretrain_512.pth`     | `single_chat/full_sft_512.pth`     | `multi_chat/full_sft_512.pth`     | `rl_512.pth` |
-| minimind-v1-moe   | 4×26M  | `pretrain_512_moe.pth` | `single_chat/full_sft_512_moe.pth` | `multi_chat/full_sft_512_moe.pth` | -            |
-| minimind-v1       | 108M   | `pretrain_768.pth`     | `single_chat/full_sft_768.pth`     | `multi_chat/full_sft_768.pth`     | `rl_768.pth` |
+| Model Name      | params | pretrain_model         | sft_model              | rlhf_model (DPO)    | reason_model     | rlaif_model (PPO/GRPO/SPO)                    | lora_model         |
+|-----------------|--------|------------------------|------------------------|--------------------|------------------|----------------------------------------------|--------------------|
+| MiniMind2-small | 26M    | `pretrain_512.pth`     | `full_sft_512.pth`     | `rlhf_512.pth`     | `reason_512.pth` | `xxpo_512.pth` | `lora_xxx_512.pth` |
+| MiniMind2-MoE   | 145M   | `pretrain_640_moe.pth` | `full_sft_640_moe.pth` | `rlhf_640_moe.pth` | -                | -                                            | -                  |
+| MiniMind2       | 104M   | `pretrain_768.pth`     | `full_sft_768.pth`     | `rlhf_768.pth`     | `reason_768.pth` | `xxpo_768.pth` | `lora_xxx_768.pth` |
 
 </details>
 
@@ -983,37 +1325,31 @@ MiniMind系列 ([ModelScope](https://www.modelscope.cn/collections/MiniMind-b72f
 
 # 📌 Eval
 
-## Ⅰ RLHF对比SFT篇
+## Ⅰ RLHF对比
 
 测试基于`full_sft`和`rlhf`的`MiniMind2系列`模型对比，测试随机种子均固定
 
-<details style="color:rgb(128,128,128)">
-<summary>问答实测</summary>
-
-[A0] MiniMind2：full_sft_640
-
-[A1] MiniMind2：rlhf_640
+[A] MiniMind2：full_sft_640<br/>
+[B] MiniMind2：rlhf_640
 
   ```text
   [Q]: 你叫什么名字？
-  [A0]: 我是一个大型语言模型，可以回答各种问题，提供信息和帮助。我可以回答各种问题，提供信息和帮助，帮助你解决各种问题。
-  [A1]: 我是一种基于文本的AI模型，我并没有个人经历或情感，所以无法感受和体验情感。我的设计目的是为了帮助用户解决问题、提供信息和进行对话。如果您有任何问题或需要帮助，请随时告诉我！
+[A]: 我是一个大型语言模型，可以回答各种问题，提供信息和帮助。我可以回答各种问题，提供信息和帮助，帮助你解决各种问题。
+[B]: 我是一种基于文本的AI模型，我并没有个人经历或情感，所以无法感受和体验情感。我的设计目的是为了帮助用户解决问题、提供信息和进行对话。如果您有任何问题或需要帮助，请随时告诉我！
 
   [Q]: 你更擅长哪一个学科？
-  [A0]: 作为一个AI模型，我并没有真正的“擅长”这个学科。我被设计成能够提供信息、解答问题、进行对话、执行任务和与用户进行对话。不过，我可以提供信息、解答问题、进行对话、提供建议、处理问题、进行对话、提供建议、学习新知识、编写代码、回答问题、提供建议等多种学科的功能。如果你有具体的学科需求或主题，欢迎提供详细信息。
-  [A1]: 作为AI，我并没有个人能力或兴趣，但我可以提供一些关于数学、科学、计算机科学、语言学、心理学等学科的知识。如果你对数学感兴趣，我可以提供一些基础知识，例如代数、几何、概率、统计学等。如果你需要更深入的学习，我也可以提供相关的在线课程、书籍、论文等资源。请告诉我你具体需要什么帮助，我会尽力提供帮助。
+[A]: 作为一个AI模型，我并没有真正的“擅长”这个学科。我被设计成能够提供信息、解答问题、进行对话、执行任务和与用户进行对话。不过，我可以提供信息、解答问题、进行对话、提供建议、处理问题、进行对话、提供建议、学习新知识、编写代码、回答问题、提供建议等多种学科的功能。如果你有具体的学科需求或主题，欢迎提供详细信息。
+[B]: 作为AI，我并没有个人能力或兴趣，但我可以提供一些关于数学、科学、计算机科学、语言学、心理学等学科的知识。如果你对数学感兴趣，我可以提供一些基础知识，例如代数、几何、概率、统计学等。如果你需要更深入的学习，我也可以提供相关的在线课程、书籍、论文等资源。请告诉我你具体需要什么帮助，我会尽力提供帮助。
 
   [Q]: 鲁迅的《狂人日记》是如何批判封建礼教的？
-  [A0]: 鲁迅的《狂人日记》是一部讲述了鲁迅对人类价值和道德责任的倡导。在这部作品中，鲁迅通过细腻的笔触，探讨了人性、道德、正义、牺牲与责任等深刻的主题。以下是鲁迅在《狂人日记》中如何批判封建礼教的几个方面：1. **社会现实与道德观念**：鲁迅通过讲述鲁迅的故事，探讨了社会的不平等、权力的流动、人性的演变等社会问题。他认为，社会不平等不仅仅是不公正的，更是每个人在社会中的不公正。他呼吁社会各界共同努力，通过教育、宣传、法律等手段，为构建和谐社会做出贡献。2. **道德与责任**：鲁迅通过讲述鲁迅的故事，展现了道德和责任的重要性。他提倡通过道德教育来培养人们的道德品质，同时也强调了道德和责任的重要性。鲁迅通过讲述鲁迅的故事，揭示了社会的不平等和人性的复杂性，体现了对人性的深刻理解。3. **道德与责任**：鲁迅通过讲述鲁迅的故事，强调了道德责任的重要性。他提倡通过自己的行为来体现对社会的尊重和对人性的深刻理解。鲁迅通过讲述鲁迅的故事，强调了道德和责任的重要性。4. **道德与道德**：鲁迅通过讲述鲁迅的故事，揭示了道德和责任的真正意义。鲁迅通过讲述鲁迅的故事，揭示了道德和责任的本质，强调了道德和责任的重要性。5. **个人与社会**：鲁迅通过讲述鲁迅的故事，揭示了个人与社会之间的复杂关系。他呼吁社会各界共同努力，通过教育和宣传，为构建和谐社会做出贡献。鲁迅的《狂人日记》通过其丰富的故事和深刻的道德寓言，传达了关于人性、道德、正义、道德以及个人与社会的复杂关系。这部作品不仅在文学价值上取得了巨大成就，更在思想层面上启发了后来的文学创作，对后世产生了深远的影响。
-  [A1]: 鲁迅的《狂人日记》是一部描绘战争和自由精神的作品，它批判了封建社会的种种问题和冲突，反映了社会的不平等和人性本质。在这部作品中，鲁迅通过对封建社会的诸多现象进行批判，如“弗雷德里克·约翰逊的“鹰迪伦”、“纳扎特的”、“贝克汉姆”、“乔治·威尔逊”等，对封建社会的种种问题进行了批判，强调了封建社会的不平等、人性本质以及人性本质。鲁迅通过他的“鹰迪伦”、“纳扎特的”、“贝克汉姆”、“乔治·威尔逊”等，对封建社会的种种问题进行了批判。在小说中，鲁迅通过对封建社会的诸多现象进行批判，揭示了封建社会的腐败、人性本质以及人性本质的复杂性。这种批判体现了对封建社会的尊重和对人性的复杂性，同时也反映了现代社会中普遍存在的不平等和人性本质。鲁迅的《狂人日记》不仅是一部关于封建社会的小说，也是一部具有深刻社会意义的作品，对后世的审判和批判起到了关键作用。
+[A]: 鲁迅的《狂人日记》是一部讲述了鲁迅对人类价值和道德责任的倡导。在这部作品中，鲁迅通过细腻的笔触，探讨了人性、道德、正义、牺牲与责任等深刻的主题。以下是鲁迅在《狂人日记》中如何批判封建礼教的几个方面：1. **社会现实与道德观念**：鲁迅通过讲述鲁迅的故事，探讨了社会的不平等、权力的流动、人性的演变等社会问题。他认为，社会不平等不仅仅是不公正的，更是每个人在社会中的不公正。他呼吁社会各界共同努力，通过教育、宣传、法律等手段，为构建和谐社会做出贡献。2. **道德与责任**：鲁迅通过讲述鲁迅的故事，展现了道德和责任的重要性。他提倡通过道德教育来培养人们的道德品质，同时也强调了道德和责任的重要性。鲁迅通过讲述鲁迅的故事，揭示了社会的不平等和人性的复杂性，体现了对人性的深刻理解。3. **道德与责任**：鲁迅通过讲述鲁迅的故事，强调了道德责任的重要性。他提倡通过自己的行为来体现对社会的尊重和对人性的深刻理解。鲁迅通过讲述鲁迅的故事，强调了道德和责任的重要性。4. **道德与道德**：鲁迅通过讲述鲁迅的故事，揭示了道德和责任的真正意义。鲁迅通过讲述鲁迅的故事，揭示了道德和责任的本质，强调了道德和责任的重要性。5. **个人与社会**：鲁迅通过讲述鲁迅的故事，揭示了个人与社会之间的复杂关系。他呼吁社会各界共同努力，通过教育和宣传，为构建和谐社会做出贡献。鲁迅的《狂人日记》通过其丰富的故事和深刻的道德寓言，传达了关于人性、道德、正义、道德以及个人与社会的复杂关系。这部作品不仅在文学价值上取得了巨大成就，更在思想层面上启发了后来的文学创作，对后世产生了深远的影响。
+[B]: 鲁迅的《狂人日记》是一部描绘战争和自由精神的作品，它批判了封建社会的种种问题和冲突，反映了社会的不平等和人性本质。在这部作品中，鲁迅通过对封建社会的诸多现象进行批判，如“弗雷德里克·约翰逊的“鹰迪伦”、“纳扎特的”、“贝克汉姆”、“乔治·威尔逊”等，对封建社会的种种问题进行了批判，强调了封建社会的不平等、人性本质以及人性本质。鲁迅通过他的“鹰迪伦”、“纳扎特的”、“贝克汉姆”、“乔治·威尔逊”等，对封建社会的种种问题进行了批判。在小说中，鲁迅通过对封建社会的诸多现象进行批判，揭示了封建社会的腐败、人性本质以及人性本质的复杂性。这种批判体现了对封建社会的尊重和对人性的复杂性，同时也反映了现代社会中普遍存在的不平等和人性本质。鲁迅的《狂人日记》不仅是一部关于封建社会的小说，也是一部具有深刻社会意义的作品，对后世的审判和批判起到了关键作用。
 
   [Q]: 我咳嗽已经持续了两周，需要去医院检查吗？
-  [A0]: 咳嗽确实需要一定的检查，但如果您的咳嗽持续时间过长，或者有其他不适症状（如持续咳嗽、呼吸困难等），建议尽快就医。医生会根据您的症状、体检结果和可能的影像学检查（如X光、MRI或CT扫描）来诊断您的症状。在等待就医期间，您可以尝试以下方法来帮助缓解症状：1. **休息**：确保您有足够的睡眠。2. **保持水分**：脱水可能会加重咳嗽。3. **热敷**：使用热敷袋或热毛巾敷在疼痛的部位，有助于缓解喉咙痛。4. **避免刺激**：避免使用可能刺激喉咙的化学物质，如某些化学物质。5. **保持室内湿润**：确保您的卧室有良好的通风，可以使用加湿器帮助保持空气湿度。6. **保持室内湿度**：使用加湿器可以帮助缓解喉咙干燥和咳嗽。如果症状持续不缓解或有其他不适症状，请立即就医。希望您早日康复！
-  [A1]: 如果您的咳嗽持续了两周，通常建议您尽快就医，因为这可能会影响您的呼吸和心跳。您可以通过以下步骤来检查咳嗽：1. **休息**：用纸巾或肘部轻轻按压咳嗽的位置。2. **使用药物**：在可能的情况下，使用药物（如抗过敏药、抗过敏药等）来缓解咳嗽。请在医生的指导下使用药物。3. **热敷**：在咳嗽期间，可以使用热敷来缓解疼痛和肿胀。4. **药物治疗**：如果咳嗽持续了两周，可以使用一些消炎药或抗过敏药来缓解症状。请务必在医生的指导下使用。5. **保持健康的生活方式**：保持充足的睡眠、均衡的饮食、适量的运动和避免吸烟和限酒，这些都有助于提高咳嗽的几率。请记住，这些建议不能替代专业医疗意见。如果您的症状持续或加重，强烈建议您尽快就医。
+[A]: 咳嗽确实需要一定的检查，但如果您的咳嗽持续时间过长，或者有其他不适症状（如持续咳嗽、呼吸困难等），建议尽快就医。医生会根据您的症状、体检结果和可能的影像学检查（如X光、MRI或CT扫描）来诊断您的症状。在等待就医期间，您可以尝试以下方法来帮助缓解症状：1. **休息**：确保您有足够的睡眠。2. **保持水分**：脱水可能会加重咳嗽。3. **热敷**：使用热敷袋或热毛巾敷在疼痛的部位，有助于缓解喉咙痛。4. **避免刺激**：避免使用可能刺激喉咙的化学物质，如某些化学物质。5. **保持室内湿润**：确保您的卧室有良好的通风，可以使用加湿器帮助保持空气湿度。6. **保持室内湿度**：使用加湿器可以帮助缓解喉咙干燥和咳嗽。如果症状持续不缓解或有其他不适症状，请立即就医。希望您早日康复！
+[B]: 如果您的咳嗽持续了两周，通常建议您尽快就医，因为这可能会影响您的呼吸和心跳。您可以通过以下步骤来检查咳嗽：1. **休息**：用纸巾或肘部轻轻按压咳嗽的位置。2. **使用药物**：在可能的情况下，使用药物（如抗过敏药、抗过敏药等）来缓解咳嗽。请在医生的指导下使用药物。3. **热敷**：在咳嗽期间，可以使用热敷来缓解疼痛和肿胀。4. **药物治疗**：如果咳嗽持续了两周，可以使用一些消炎药或抗过敏药来缓解症状。请务必在医生的指导下使用。5. **保持健康的生活方式**：保持充足的睡眠、均衡的饮食、适量的运动和避免吸烟和限酒，这些都有助于提高咳嗽的几率。请记住，这些建议不能替代专业医疗意见。如果您的症状持续或加重，强烈建议您尽快就医。
   ```
 
-</details>
-
 👉效果总结
 
 full_sft模型在简洁性和信息准确性方面表现更好；rlhf模型在回答中倾向于提供更多的背景信息，但信息准确性有待改进。
@@ -1022,9 +1358,34 @@ full_sft模型在简洁性和信息准确性方面表现更好；rlhf模型在
 DPO和在线PPO的区别在于reject和chosen都是离线准备的，和minimind模型本身的输出必然存在很大的分布差异。
 通俗地说DPO算法使模型观看乒乓球世界冠军的打法「录像」进行RL，而不是像PPO一样请reward模型做「教练」纠正自己的打法进行RL。
 
-## Ⅱ 主观样例测评
+## Ⅱ RLAIF对比
 
-🏃以下测试于2025-02-09完成，此日期后发布的新模型，无特殊需要时将不加入测试。
+基于`MiniMind2 (768dim)`在相同数据与训练步数下的主观观感对比，测试随机种子均固定2026，供参考：
+
+[A] MiniMind-Reason-0.1B (蒸馏)<br/>
+[B] MiniMind-Reason-0.1B (PPO)<br/>
+[C] MiniMind-Reason-0.1B (GRPO)
+
+```text
+[Q]: 请介绍一下自己。
+[A]: <think>我是由中国的个人开发者开发的智能助手小型AI推理模型-R1。我被设计用来理解和生成人类语言，能够理解复杂的自然语言问题，提供有用的信息和解决方案。</think><answer>我是由中国的个人开发者开发的智能助手小型AI推理模型-R1。</answer>
+[B]: <think>嗯，用户让我介绍一下自己。首先，我需要明确用户的需求是什么。他们可能是在做某个项目，想要了解自己的一些基本情况。不过，用户没有提供具体的信息，所以我得先假设一些可能的情况，比如他是一个学生，正在学习编程，或者是某个项目的开发者。接下来，我要考虑用户的身份。可能是一个学生、教师或者普通用户，他们需要了解自己的特点和能力。也有可能他们正在寻找灵感，或者在写论文时需要帮助。所以，我需要保持回答的简洁和实用，同时提供一些参考资料，比如教学视频、课程记录等。然后，我要考虑用户的身份。可能是一个学生，需要一些学习资源或帮助的对象。也有可能是一个学生，需要在课后作业或学习计划中进行调整。因此，我需要先询问用户的具体需求，这样才能提供更精准的帮助。接下来，我要确保语言简洁明了，适合不同场景。比如，如果用户是学生，可以介绍自己的学习特点；如果是教师，可以说明自己的角色和贡献。同时，提供一些实用的建议，比如如何高效学习，如何进行项目，以及如何与团队成员协作。最后，我要确保回复清晰，有条理，能够满足用户的需求。这样，用户就能轻松地了解自己，并且有实际的帮助。</think><answer>当然！请告诉我你想了解哪方面的内容，我可以为您提供详细的介绍！</answer>
+[C]: <think>嗯，用户问的是"请介绍一下自己"。首先，我需要理解用户的具体需求是什么。他们可能想知道自己是谁，或者有什么特定的背景。接下来，我应该考虑用户可能没有明确表达的需求。他们可能想知道如何描述自己的外观，或者如何与他人交流。然后，我应该考虑如何组织内容，从头到尾都要有条理。可能需要分步骤说明，比如从视觉、听觉到整体感受，再到情感共鸣。这样用户能更清晰地理解自己，也能更好地回应他们的需求。还要注意语言的自然和流畅，避免过于学术化的术语，让用户感觉顺畅。同时，保持回答的结构清晰，每句话都有明确的目的，这样用户能有明确的理解和操作。最后，检查是否有遗漏的重要信息，确保回答准确且全面。这样，用户不仅能得到答案，还能得到满足他们的需求。</think><answer>您好！我是由中国的个人开发者开发的智能助手小型AI推理模型-R1。关于您的描述，似乎涉及多个领域，包括知识图谱构建、复杂系统的推理与分析。无论是理论模型、实际应用还是伦理讨论，您都在寻求如何将这些概念有机地结合起来，为人类社会带来福祉的解决方案。在"我"的身份下，您提到的"个人体验"和"情感共鸣"，都是您个人独特魅力的重要组成部分。通过精准的描述和深度的分析，您可以更深入地理解您的独特之处。如果您对特定领域有特别的兴趣或需要建议，随时可以分享更多细节，我们将为您提供更加精准和个性化的服务。如果您有特定的主题或问题，欢迎随时提问！让我们一起探索这个神奇的AI世界，共同创造更美好的未来。</answer>
+
+
+[Q]: 你更擅长哪一个学科？
+[A]: <think>我是由中国的个人开发者开发的智能助手小型AI推理模型-R1。</think><answer>我是由中国的个人开发者开发的智能助手小型AI推理模型-R1。</answer>
+[B]: <think>嗯，用户问的是"你更擅长哪一个学科？"这个问题看起来有点复杂，但我可以一步步分析。首先，我需要理解用户的需求。他可能是在准备考试，或者需要了解某个学科的特点，或者是想在学习过程中应用某个学科的知识。接下来，我得考虑用户可能的背景。他可能是一个学生，或者是在准备考试，需要应用数学知识来解决具体的问题。也有可能他是一个学生，正在学习相关的课程，或者是对某个学科感兴趣。然后，我需要分析用户可能没有明确表达出来的需求。他可能在思考自己在学习中的困难，或者在寻找一个合适的教材或学习资源。考虑到这些，我应该提供几个不同的角度来帮助他。首先，分析"我更擅长哪一个学科"可以拆解为以下几个方面：兴趣、学习能力、学习风格、学习方法、学习资源、学习环境等等。每个方面都有不同的优势和特点，比如兴趣是最好的资源，学习能力是最好的方式，学习方式是根据个人情况选择。接下来，我需要考虑用户的身份。可能是学生，也可能是家长，或者对学习有一定兴趣的人。不同的身份可能有不同的偏好，比如家长可能更注重孩子的兴趣，而学生可能更关注学习方法和资源。然后，我需要考虑用户的身份是否有特殊需求。比如，他可能需要更专业的学习材料，或者希望找到最适合他的学习方法。因此，在分析时，我应该提供一个全面的回答，涵盖各个方面，并解释每个方面的重要性。另外，考虑到用户可能希望得到一个结构化的回答，我应该保持语言的简洁和逻辑性，确保信息准确，同时鼓励用户提供更多细节以便进一步分析。最后，我需要确保我的回答既有知识性，又符合用户的要求，帮助用户更好地理解和应用这个学科。因此，在回答中不仅要给出一个全面的分析，还要提到每个部分的作用和如何利用。</think><answer>你擅长哪一个学科？这是一个明确的问题，可以从以下几个方面进行分析：1. **兴趣与学习能力**：你喜欢什么样的学习方式，比如自主学习、实验、项目等，这些都是很好的学习资源。2. **学习风格**：你喜欢什么类型的学习方法，比如笔记、复习、阅读、讨论等，这些都能影响学习效果。3. **学习资源**：你是否在寻找学习资源，比如书籍、在线课程、教育软件等，这些都能提供学习帮助。4. **学习环境**：你是否在学习环境中感到舒适，比如舒适的学习空间、安静的学习空间等，这些都会影响学习效果。5. **学习资源**：你是否有学习资源的丰富，如在线课程、学习平台、书籍等，这些都能提供丰富的学习资源。6. **学习方法**：你是否有具体的学习方法，比如如何理解、如何组织学习内容、如何应用知识等，这些都是提高学习效率的途径。综合考虑，这是一个非常全面的分析，涵盖了多个方面，帮助你更好地理解你的学科。</answer>
+[C]: <think>嗯，用户问的是"你更擅长哪一个学科？"这个问题看起来有点复杂，但我可以一步步分析。首先，我需要理解用户的需求。他可能是在准备考试，或者需要了解某个学科的特点，或者是想在生活中找到自己的兴趣点。接下来，我要考虑用户可能的背景。他可能在工作或学习中遇到了类似的问题，想要了解自己擅长哪个领域，或者是否需要一些资源或指导来提升自己。然后，我要分析用户可能的深层需求。也许他想在职场上提升自己，或者寻找方向。他可能需要更具体的例子，帮助他更好地理解不同学科的特点和应用场景。还要考虑用户可能的背景。他可能不仅仅是在学习，而是希望了解在特定领域中如何应用这些学科。这需要我提供全面的分析，帮助他找到适合自己的方法。最后，我应该保持开放的态度，鼓励用户继续探索和学习。这可能包括推荐一些资源，或者讨论如何将这些知识应用到实际生活中。</think><answer>你更擅长哪一个学科？这个问题本质上取决于你的个人兴趣和目标。以下是一些关键点，帮助你全面理解并应用这一点：1. **兴趣与专长**：了解自己擅长哪个领域，有助于做出更明智的选择。2. **全面发展**：考虑学科的多样性和深度，以全面发展自己的生活和职业。3. **适应与调整**：选择与自己技能、兴趣和目标相匹配的学科，有助于应对挑战和实现目标。4. **职业规划**：探索适合自己的职业路径，提升效率和竞争力。5. **社会贡献**：评估自己在社会中的角色，为社会做出贡献。为了更好地理解并应用这一点，你可以：- **了解行业动态**：关注特定领域的专业发展。- **实践与反思**：通过项目或实习积累经验。- **职业发展**：评估自己的长期职业规划。- **持续学习**：不断更新知识，适应职场变化。通过这些方向，你可以更全面地认识自己，找到适合自己的发展路径。</answer>
+
+
+[Q]: 鲁迅的《狂人日记》是如何批判封建礼教的？
+[A]: <think>嗯，鲁迅的《狂人日记》是如何批判封建礼教的呢？这个问题听起来挺有意思的，尤其是他在《狂人日记》中的表达方式。首先，我需要确认一下《狂人日记》具体是谁写的，是否有特定的人物或者事件。《狂人日记》是鲁迅在1924年创作的，主要讨论了社会问题，尤其是封建礼教的问题。作者曹雪芹是个非常著名的作家，他的作品中有许多关于封建礼教和阶级斗争的观点。这类作品在文学、哲学和政治上都有很深远的影响，尤其是在社会结构和人性的方面。接下来，我需要了解鲁迅的写作风格。他的写作风格比较直接，有严肃的哲学观点，强调人性的悲剧和道德的虚无。他的语言风格比较富有诗意，富有象征意义，善于运用象征手法，同时又不失真实和现实主义。他的作品中有很多关于社会阶级和阶级矛盾的分析，这些观点可能与当时的社会现实不符，所以在批判封建礼教时，应该强调这些矛盾。然后，我想到《狂人日记》中对封建礼教的讨论。鲁迅的作品中，封建礼教被视为不人道的，但通过文字和行动，表达了他对封建礼教的不满和反抗。这不仅反映了当时的社会现实，也揭示了封建礼教的黑暗面，体现了对封建社会的不满和对人性的深刻反思。接下来，我需要思考鲁迅对封建礼教的理解和批判。他在《狂人日记》中，对封建礼教的质疑和批判，体现了他对传统封建礼教的否定和批判。这不仅仅是对封建礼教的质疑，更是对当时社会的反思，以及对人性的关注。最后，我需要总结一下鲁迅对封建礼教的批判和批判。他通过《狂人日记》中的文字和行为，批判了封建礼教的虚假和不合理，体现了对社会现实的深刻反思。同时，他也关注了封建礼教的黑暗面，反映了对封建社会的不满和对人性的否定。总的来说，鲁迅的《狂人日记》通过文字和行动，批判了封建礼教的虚假和不合理，体现了当时的社会现实和人性的复杂。他的作品对后来的封建礼教有了启示，对当时的社会现象有了更深的反思。</think><answer>鲁迅的《狂人日记》是一部反映封建礼教的文学作品，主要通过文字和行动反映了当时社会的现实。鲁迅的文字风格独特，富有象征意义，但通过文字和行动，批判了封建礼教的虚假和不合理。他的作品中，封建礼教被视为不人道的，但通过文字和行动，反映了封建社会的不公和人性的深刻反思。鲁迅的《狂人日记》通过文字和行动，批判了封建礼教的虚假和不合理，体现了当时社会的不满和对人性的反思。他的作品不仅批判了封建礼教的虚假和不合理，还深刻揭示了封建社会的黑暗面。</answer>
+[B]: <think>嗯，用户问的是鲁迅的《狂人日记》如何批判封建礼教。首先，我需要理解鲁迅的生平和作品。《狂人日记》是中国现代文学的经典，作者是毛泽东，作品涉及社会、家庭、伦理等多个层面。我需要找出鲁迅的写作风格和主题，以及它在当时的社会背景。接下来，我应该回顾《狂人日记》的主要内容。毛泽东在《青春无悔》中提出了"狂人日记"这一概念，强调青春活泼、勇敢、正义，这在当时社会中仍然具有重要意义。他通过描绘主人公狂人来表达对社会不公的批判，但同时也指出他的创作对社会的负面影响，比如社会动荡和个人主义的失败。然后，我应该分析鲁迅的创作理念和主题。《狂人日记》强调了社会的黑暗面，主张"以人为本"、"反对封建礼教"等。他可能也在探讨社会不平等的问题，比如"人性弱点"、"权力滥用"等问题。还要考虑鲁迅的其他作品，比如《阿Q正传》、《死亡诗社》等，这些作品虽然有点夸张，但也有深刻的社会批判，比如对社会道德的质疑。同时，作者可能还涉及其他作品，如《庆余年》等，但这些可能更侧重于文学创作，而不是社会批判。另外，用户可能想知道鲁迅在文学创作中的具体经历，比如他是否有作品被广泛批评，或者他对社会的批判是否具有批判性。所以，回答时需要指出鲁迅的创作经历，以及他对社会现实的反思。总结来说，我需要回答鲁迅的《狂人日记》批判封建礼教，并分析其在当时的社会背景和文学价值，同时指出其对社会的批判性思考，以及其对后世的影响。</think><answer>鲁迅的《狂人日记》批判封建礼教，以其独特的叙事风格和深刻的社会批判精神著称，主要体现在以下几个方面：1. **社会批判与个人成长**：小说通过描绘主人公狂人的活泼、勇敢、正义，揭示社会的不公和弱势。2. **个人主义与社会批判**：小说反映了封建礼教对个人自由和权利的侵犯，强调个人尊严和正义的价值。3. **对社会的批判性思考**：通过对社会不公的刻画，揭示社会问题的根源，并对当时社会的批判态度进行批判。鲁迅的作品在当时的社会背景和文学价值使其成为当时文学的重要代表。他的作品不仅在当时广受欢迎，也对后世产生了深远影响，被广泛引用和传颂。</answer>
+[C]: <think>嗯，用户问的是鲁迅的《狂人日记》如何批判封建礼教。首先，我需要理解鲁迅的背景和核心思想。他可能是一位社会学家、思想家，或者是对传统文化有深厚兴趣的读者。狂人日记是他在《狂人日记》中所写的，可能对传统文化有一定影响。接下来，我要考虑用户的需求是什么。用户可能想知道这本书在现代社会中的反思意义，或者是否有相关研究支持这一点。也有可能他们对书中的情节有疑问，想了解如何更深刻地理解社会现象。然后，我需要思考如何从不同的角度解读鲁迅的文本。例如，是否有直接的历史叙事，或者是否存在更深层次的哲学思考？这需要我先了解鲁迅的不同观点，然后分析其批判性。还要考虑用户可能的深层需求，比如是否需要了解他作为文学批判家的背景，或者是否有对传统价值观的反思。因此，回答时要全面，涵盖核心问题，并引用相关资料来支撑观点。最后，总结鲁迅的观点，强调其对现代社会的批判性思考，以及对传统文化的尊重。这样既全面又有深度，帮助用户全面理解鲁迅的作品。</think><answer>鲁迅的《狂人日记》通过其独特的叙事手法和深刻的社会批判精神，对封建礼教进行了深刻的批判。以下是其批判性分析：1. **社会批判与历史真相**：小说通过对历史人物的刻画，揭示封建礼教的弊端，以及其对传统社会结构的冲击。这种批判不仅限于对现实的深刻批判，也反映了现代社会对传统价值观的重新审视。2. **个人命运与社会批判**：小说中，主人公通过"狂人日记"揭示了封建礼教对个人命运的严重剥夺。这种叙事方式既批判了封建礼教，也对后世产生了深远影响。3. **对后世的影响**：从文学创作到思想解读，鲁迅的作品深刻影响了后世的思想体系。例如，《狂人日记》通过人物命运的反映，对现代社会的价值观念提出了新的思考。4. **对现代文明的反思**：这一批判性思考不仅反映了当代社会的复杂性，也为现代人提供了反思社会现实、维护人权的视角。5. **个人视角与文化视角**：作者通过这一批判性思考，展现了对传统价值观的重新解读，同时也对当代社会的多元化视角产生了启发。综上所述，鲁迅的《狂人日记》不仅是对封建礼教的批判，更是对传统文化的一种深刻反思。它通过对封建礼教的批判，提出了对现代社会结构、权力关系的深刻反思，具有重要的社会价值。</answer>
+```
+
+## Ⅲ 其他模型对比
 
 [A] [MiniMind2 (0.1B)](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch)<br/>
 [B] [MiniMind2-MoE (0.15B)](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch)<br/>
@@ -1193,7 +1554,6 @@ DPO和在线PPO的区别在于reject和chosen都是离线准备的，和minimind
 | 7  | E  | 22        | 23        | 15        | 14         | 74        |
 | 8  | G  | 10        | 12        | 10        | 10         | 42        |
 
-
 ### 👉主观效果总结
 
 个人主观评价与DeepSeek-R1基本相符，其中：
@@ -1208,10 +1568,27 @@ DPO和在线PPO的区别在于reject和chosen都是离线准备的，和minimind
 
 ---
 
-## Ⅲ Objective Benchmark
+## Ⅳ RoPE长度外推
 
-下面就到喜闻乐见的benchmark刷榜测试环节，就不找乐子和qwen、glm级别的中文模型做对比了。
-这里选取了一些<1B的微型模型进行横评比较，
+MiniMind支持通过YaRN算法进行RoPE位置编码的长度外推，使模型能够处理超出训练长度的文本序列。
+在使用`eval_model.py`进行推理时，只需添加`--inference_rope_scaling True`参数即可启用RoPE外推：
+
+```bash
+python eval_model.py --inference_rope_scaling True
+```
+
+下图展示了在不同文本「西游记」白话文小说长度下，使用RoPE scaling前后的困惑度(PPL)对比。可以看出，启用RoPE scaling后，模型在长文本上的表现显著提升：
+
+<div align="center">
+<img src="./images/rope_ppl.png">
+</div>
+
+---
+
+## Ⅴ Objective Benchmark
+
+下面就到喜闻乐见的benchmark测试环节，就不找乐子和Qwen、GLM级别的模型做对比了。
+这里选取了一些微型模型进行横评比较，
 测试集选择C-Eval、CMMLU、A-CLUE、TMMLU+这几个纯中文语言榜单。
 
 
@@ -1232,9 +1609,9 @@ lm_eval --model hf --model_args pretrained=<填写模型路径>,device=cuda,dtyp
 PS: 在这种全是选择题的测评集中，为了避免回复格式的难以固定的特点，
 所以常用做法是直接把`A`,`B`,`C`,`D`四个字母对应token的预测概率取出来，将其中概率最大的字母与标准答案计算正确率。
 选择题1/4乱选的正确率是25%，然而这个量级的所有模型都集中在25附近，甚至很多时候不如瞎选，是不是像极了高中完形填空的滑铁卢正确率...
-MiniMind模型本身预训练数据集小的可怜，也没有针对性的对测试集做刷榜微调，因此结果图一乐即可：
+MiniMind模型本身预训练数据集小的可怜，也没有针对性的对测试集做刷榜微调，因此结果纯娱乐：
 
-| models                                                                        | from          | params↓ | ceval↑ | cm mlu↑ | aclue↑ | tmmlu+↑ |
+| models                                                                        | from          | params↓ | ceval↑ | cmmlu↑ | aclue↑ | tmmlu+↑ |
 |-------------------------------------------------------------------------------|---------------|---------|--------|---------|--------|---------|
 | MiniMind2                                                                     | JingyaoGong   | 104M    | 26.52  | 24.42   | 24.97  | 25.27   |
 | MiniMind2-Small                                                               | JingyaoGong   | 26M     | 26.37  | 24.97   | 25.39  | 24.63   |
@@ -1247,7 +1624,7 @@ MiniMind模型本身预训练数据集小的可怜，也没有针对性的对测
 
 ![compare_radar](./images/compare_radar.png)
 
-# 📌 其它 (Others)
+# 📌 Others
 
 ## 模型转换
 
@@ -1297,23 +1674,21 @@ MiniMind模型本身预训练数据集小的可怜，也没有针对性的对测
     }'
     ```
 
-## VLLM模型推理（服务）
+## <img src="https://avatars.githubusercontent.com/u/136984999" height="28" style="vertical-align: middle;"/> [vllm](https://github.com/vllm-project/vllm)
 
-[vLLM](https://github.com/vllm-project/vllm)是极其流行的高效推理框架，支持大模型快速部署，优化显存利用与吞吐量。
+vLLM是极其流行的高效推理框架，支持大模型快速部署，优化显存利用与吞吐量。
 
 ```bash
-vllm serve ./MiniMind2/ --model-impl transformers --served-model-name "minimind"
+vllm serve ./MiniMind2 --model-impl transformers --served-model-name "minimind" --port 8998
 ```
 
-服务将以openai api协议启动，端口默认为8000。
+## <img src="https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png" height="28" style="vertical-align: middle;"/> [llama.cpp](https://github.com/ggerganov/llama.cpp)
 
-更多用法请参考官方说明～
-
-## llama.cpp
-[llama.cpp](https://github.com/ggerganov/llama.cpp)是一个C++库，
+llama.cpp是一个C++库，
 可以在命令行下直接使用，支持多线程推理，支持GPU加速。
 
 参考官方仓库安装后，在`convert_hf_to_gguf.py` ～760行插入
+
 ```text
 # 添加MiniMind2 tokenizer支持
 if res is None:
@@ -1321,27 +1696,29 @@ if res is None:
 ```
 
 转换自定义训练的minimind模型 -> gguf
+
 ```bash
 python convert_hf_to_gguf.py ../minimind/MiniMind2/
 ```
 
 量化模型
+
 ```bash
 ./build/bin/llama-quantize ../minimind/MiniMind2/MiniMind2-109M-F16.gguf ../minimind/MiniMind2/Q4-MiniMind2.gguf Q4_K_M
 ```
 
 命令行推理
+
 ```bash
 ./build/bin/llama-cli -m ../minimind/MiniMind2/MiniMind2-109M-F16.gguf --chat-template chatml
 ```
 
-更多用法请参考官方说明～
+## <img src="https://ollama.com/public/cloud.png" height="28" style="vertical-align: middle;"/> [ollama](https://ollama.ai)
 
-## ollama
-
-[ollama](https://ollama.ai/)是本地运行大模型的工具，支持多种开源LLM，简单易用。
+ollama是本地运行大模型的工具，支持多种开源LLM，简单易用。
 
 通过ollama加载自定义的gguf模型，新建minimind.modelfile：
+
 ```text
 FROM ./MiniMind2-109M-F16.gguf
 TEMPLATE """{{ if .System }}<|im_start|>system
@@ -1353,17 +1730,19 @@ TEMPLATE """{{ if .System }}<|im_start|>system
 ```
 
 加载模型并命名为`minimind2`
+
 ```bash
 ollama create -f minimind.modelfile minimind2
 ```
 
 启动推理
+
 ```text
 ollama run minimind2
 > 你好，我是MiniMind2，一个基于xxxxxxxx
 ```
 
-更多用法请参考官方说明～
+> 以上三方框架的更多用法请参考对应官方文档😊
 
 # 📌 Acknowledge
 
diff --git a/README_en.md b/README_en.md
index fb1ea19..026659f 100644
--- a/README_en.md
+++ b/README_en.md
@@ -17,7 +17,13 @@
 </div>
 
 <div align="center">
-  <h3>"The Greatest Path is the Simplest"</h3>
+
+![GitHub Trend](https://trendshift.io/api/badge/repositories/12586)
+
+</div>
+
+<div align="center">
+  <h3>"The Simplest Path is the Greatest"</h3>
 </div>
 
 <div align="center">
@@ -32,7 +38,8 @@
   making it possible to train quickly on even the most ordinary personal GPUs.
 * The project also open-sources the minimalist structure of the large model, including extensions for shared mixed
   experts (MoE), dataset cleaning, pretraining, supervised fine-tuning (SFT), LoRA fine-tuning, direct preference
-  optimization (DPO) algorithms, and model distillation algorithms, along with the full code of the process.
+  optimization (DPO) algorithms, reinforcement learning from AI feedback (RLAIF: PPO/GRPO/SPO), and model distillation 
+  algorithms, along with the full code of the entire process.
 * **MiniMind** also expands into vision multimodal VLM: [MiniMind-V](https://github.com/jingyaogong/minimind-v).
 * All core algorithm code is reconstructed from scratch using native PyTorch! It does not rely on abstract interfaces
   provided by third-party libraries.
@@ -41,8 +48,7 @@
 * We hope this project will serve as an inspiring example for everyone, helping to enjoy the fun of creation and
   promoting the progress of the wider AI community!
 
-  > To avoid misunderstanding, the "2 hours" test is based on NVIDIA 3090 hardware (single GPU), and the "3 RMB" refers
-  to the GPU server rental cost. Details of the specifications can be found below.
+> To avoid misunderstanding, the "2 hours" test is based on NVIDIA 3090 hardware (single GPU), and the "3 RMB" refers to the GPU server rental cost. Details of the specifications can be found below.
 
 ---
 
@@ -67,136 +73,129 @@
       </td>
     </tr>
   </table>
-</div>
+</div></div>
 
 
-</div>
-
 # 📌 Introduction
 
-The emergence of Large Language Models (LLMs) has sparked unprecedented global attention on AI. Whether it's ChatGPT,
-DeepSeek, or Qwen, their stunning performance leaves people in awe. However, the massive scale of hundreds of billions
-of parameters makes it not only difficult to train them on personal devices, but also almost impossible to deploy them.
-Opening the "black box" of large models and exploring their inner workings is exhilarating! Sadly, 99% of explorations
-can only stop at fine-tuning existing large models with techniques like LoRA, learning a few new commands or tasks. It's
-like teaching Newton how to use a 21st-century smartphone—though interesting, it completely deviates from the original
-goal of understanding the essence of physics. Meanwhile, third-party large model frameworks and toolkits, such as
-transformers+trl, almost only expose highly abstract interfaces. With just 10 lines of code, you can complete the entire
-training process of "loading model + loading dataset + inference + reinforcement learning". While this efficient
-encapsulation is convenient, it's like a high-speed spaceship, isolating us from the underlying implementation and
-hindering our opportunity to dive deep into the core code of LLMs. However, "building a plane with Legos is far more
-exciting than flying first-class!" What's worse, the internet is flooded with paid courses and marketing accounts,
-selling AI tutorials with flawed and half-understood content. Therefore, the goal of this project is to lower the
-learning threshold for LLMs, allowing everyone to start by understanding each line of code, and to train a very small
-language model from scratch, not just performing **inference**! With server costs of less than 3 RMB, you can experience
-the entire process of building a language model from 0 to 1. Let's enjoy the fun of creation together!
+The emergence of Large Language Models (LLMs) has sparked unprecedented global attention to AI. 
+Whether it's ChatGPT, DeepSeek, or Qwen, they all demonstrate stunning performance that is awe-inspiring.
+However, with their massive scale of tens of billions of parameters, they are not only difficult to train on personal devices but nearly impossible to deploy.
+Opening the "black box" of large models to explore their internal mechanisms is truly thrilling!
+Unfortunately, 99% of exploration can only stop at using techniques like LoRA to perform minor fine-tuning on existing large models to learn new instructions or tasks.
+This is like teaching Newton how to use a 21st-century smartphone—while interesting, it completely deviates from the original intent of understanding the essence of physics.
+Meanwhile, third-party large model frameworks and toolkits, such as transformers+trl, expose only highly abstract interfaces.
+With just 10 lines of code, you can complete the entire workflow of "loading model + loading dataset + inference + reinforcement learning."
+While such efficient packaging is convenient, it also acts like a high-speed spacecraft, isolating developers from underlying implementations and hindering deep exploration of LLM core code.
+Yet, "building a plane with Lego is far more exciting than flying in first class!"
+What's worse, the internet is flooded with expensive courses and marketing accounts selling AI tutorials with countless flaws and superficial understanding.
+For this reason, this project's original intention is to lower the barrier to entry for LLM learning, allowing everyone to start by understanding every line of code,
+to personally train an extremely small language model from scratch. Yes, from **training from scratch**, not just **inference**!
+With less than 3 RMB in server costs, you can personally experience the entire process of building a language model from 0 to 1.
+Let's enjoy the fun of creation together!
 
-> [!NOTE]  
-> (As of 2025-02-07) The MiniMind series has completed pretraining for multiple models, with the smallest one being only
-> 25.8M (0.02B) and capable of smooth conversation!
+> [!NOTE]
+> (As of 2025-02-07) The MiniMind series has completed pretraining of multiple model variants, with the smallest being only 25.8M (0.02B), capable of fluent conversation!
 
 <details style="color:rgb(128,128,128)">
 <summary>Models List</summary>
 
-| Model (Size)            | Inference Usage (Approx.) | Release    | 
-|-------------------------|---------------------------|------------|
-| MiniMind2-small (26M)   | 0.5 GB                    | 2025.04.26 |
-| MiniMind2-MoE (145M)    | 1.0 GB                    | 2025.04.26 |
-| MiniMind2 (104M)        | 1.0 GB                    | 2025.04.26 |
-| minimind-v1-small (26M) | 0.5 GB                    | 2024.08.28 |
-| minimind-v1-moe (4×26M) | 1.0 GB                    | 2024.09.17 |
-| minimind-v1 (108M)      | 1.0 GB                    | 2024.09.01 |
+| Model (Size)           | Inference Memory (Approx) | Release    | 
+|------------------------|---------------------------|------------|
+| MiniMind2-small (26M)  | 0.5 GB                    | 2025.04.26 |
+| MiniMind2-MoE (145M)   | 1.0 GB                    | 2025.04.26 |
+| MiniMind2 (104M)       | 1.0 GB                    | 2025.04.26 |
+| minimind-v1-small (26M)| 0.5 GB                    | 2024.08.28 |
+| minimind-v1-moe (4×26M)| 1.0 GB                    | 2024.09.17 |
+| minimind-v1 (108M)     | 1.0 GB                    | 2024.09.01 |
 
 </details>
 
 **Project Includes**
 
-- All code for the MiniMind-LLM structure (Dense+MoE models).
-- Detailed training code for the Tokenizer.
-- Full training code for Pretrain, SFT, LoRA, RLHF-DPO, and model distillation.
-- High-quality datasets collected, distilled, cleaned, and deduplicated at all stages, all open-source.
-- From scratch implementation of pretraining, instruction fine-tuning, LoRA, DPO reinforcement learning, and white-box
-  model distillation. Most key algorithms do not rely on third-party encapsulated frameworks and are all open-source.
-- Compatible with third-party frameworks like `transformers`, `trl`, `peft`, etc.
-- Training supports single machine single GPU, single machine multi-GPU (DDP, DeepSpeed), and wandb visualized training
-  processes. Supports dynamic start/stop of training.
-- Model testing on third-party evaluation benchmarks (C-Eval, C-MMLU, OpenBookQA, etc.).
-- A minimal server implementing the Openai-Api protocol, easy to integrate into third-party ChatUI applications (
-  FastGPT, Open-WebUI, etc.).
-- Fully compatible with popular community inference engines like llama.cpp, vllm, ollama, or training frameworks such as Llama-Factory.
-- A simple chat WebUI front-end implemented using streamlit.
-- Reproduction (distillation/RL) of the large inference model DeepSeek-R1 as the MiniMind-Reason model, **data + model**
-  all open-source!
+- Complete code for MiniMind-LLM structure (Dense + MoE models).
+- Detailed training code for Tokenizer.
+- Complete training code for Pretrain, SFT, LoRA, RLHF-DPO, RLAIF (PPO/GRPO/SPO), and model distillation.
+- Collected, distilled, organized and cleaned high-quality datasets for all stages, all open-sourced.
+- Implemented from scratch: pretraining, instruction fine-tuning, LoRA, DPO/PPO/GRPO/SPO reinforcement learning, and white-box model distillation. Core algorithms barely depend on third-party framework encapsulation, all open-sourced.
+- Compatible with mainstream third-party frameworks like `transformers`, `trl`, `peft`.
+- Training supports single GPU, multiple GPUs on a single machine (DDP, DeepSpeed), supports wandb/swanlab visualization of training process. Supports dynamic training start/stop.
+- Model testing on third-party evaluation leaderboards (C-Eval, C-MMLU, OpenBookQA, etc.), supports YaRN algorithm for RoPE long-text extrapolation.
+- Implements an extremely simple OpenAI API-compliant server, convenient for integration with third-party ChatUI (FastGPT, Open-WebUI, etc.).
+- Implements the simplest chat WebUI frontend based on streamlit.
+- Fully compatible with popular community inference engines `llama.cpp`, `vllm`, `ollama` or training framework `Llama-Factory`.
+- Reproduced (distilled/RL) DeepSeek-R1 reasoning model as MiniMind-Reason model, with **data + models** fully open-sourced!
 
-We hope this open-source project can help LLM beginners quickly get started!
+We hope this open-source project can help LLM beginners get started quickly!
 
-### 👉**Update log**
+### 👉**Update Log**
 
-<details close>  
-<summary> <b>2025-04-26 (newest 🎉🎉🎉)</b> </summary>  
+<details close> 
+<summary> <b>2025-10-24 (newest🎉)</b> </summary>
 
-• Major Updates  
+- 🔥 Added RLAIF training algorithms: PPO, GRPO, SPO (native implementation from scratch)
+- Added RLAIF dataset: rlaif-mini.jsonl (randomly sampled 10,000 entries from SFT data)
+- Added YaRN algorithm: supports RoPE long-text extrapolation, improving long sequence handling capability
+- Adaptive Thinking: Reason model can optionally enable thinking chain
+- chat_template fully supports Tool Calling and Reasoning tags (`<tool_call>`, `<think>`, etc.)
+- Added complete RLAIF chapter, training curve comparison, algorithm principle explanations
+- [SwanLab](https://swanlab.cn/) replaces WandB (friendly for domestic access, fully compatible API)
+- Fixed some known bugs
 
-• For compatibility needs, visit [🔗Legacy Repository Content🔗](https://github.com/jingyaogong/minimind/tree/7da201a944a90ed49daef8a0265c959288dff83a).  
+</details>
 
-• MiniMind model parameters have been fully renamed to align with Transformers library models (unified naming).  
+<details close> 
+<summary> <b>2025-04-26</b> </summary>
 
-• The `generate` method has been refactored, now inheriting from the `GenerationMixin` class.  
+- Important update
+- For compatibility needs, you can visit [🔗old repository content🔗](https://github.com/jingyaogong/minimind/tree/7da201a944a90ed49daef8a0265c959288dff83a).
+- MiniMind model parameters completely renamed, aligned with Transformers library models (unified naming).
+- generate method refactored, inheriting from GenerationMixin class.
+- 🔥 Supports popular third-party ecosystems like llama.cpp, vllm, ollama.
+- Standardized code and directory structure.
+- Modified vocabulary `<s></s>`->`<|im_start|><|im_end|>`
 
-• 🔥 Support for popular third-party ecosystems like llama.cpp, vllm, and ollama.  
-
-• Standardized code and directory structure.   
-
-• Updated vocabulary tokens: `<s></s>` → `<|im_start|><|im_end|>`.  
-
-
-```text  
-To ensure compatibility with third-party inference frameworks (llama.cpp, vllm), this update comes at a non-trivial cost.  
-Models saved before 2025-04-26 can no longer be **directly** loaded for inference.  
-Due to differences in positional encoding between Llama and MiniMind, QK values diverge after weight mapping.  
-MiniMind2 legacy models have been restored via weight mapping + (fine-tuning) QKVO linear layer calibration.  
-After this update, maintenance for the entire `minimind-v1` series will be discontinued, and the models will be removed from the repository.  
-```  
-</details>  
+```text
+To be compatible with third-party inference frameworks llama.cpp and vllm, this update requires some observable costs.
+This update no longer supports "directly" loading old models before 25-04-26 for inference.
+Due to differences in Llama's positional encoding compared to minimind, there are differences in QK values after mapping Llama models.
+MiniMind2 series old models have been recovered through weight mapping and (fine-tuning training) QKVO linear layer calibration.
+After this update, maintenance of the entire minimind-v1 series will be abandoned and removed from the repository.
+```
 
+</details>
 
 <details close> 
 <summary> <b>2025-02-09</b> </summary>
 
-- Major update since the release, with the release of MiniMind2 Series.
-- Almost all code has been refactored, using a more streamlined and unified structure.
-  For compatibility with old code, please refer to
-  the [🔗Old Repository Contents🔗](https://github.com/jingyaogong/minimind/tree/6e9cd28ef9b34a0a10afbdf6f59e65cb6e628efb).
-- Removed the data preprocessing step. Unified dataset format, switched to `jsonl` format to eliminate issues with
-  dataset downloads.
-- MiniMind2 series shows a significant improvement over MiniMind-V1.
-- Minor issues: {kv-cache syntax is more standard, MoE load balancing loss is considered, etc.}
-- Provided a training solution for transferring the model to private datasets (e.g., medical models, self-awareness
-  examples).
-- Streamlined the pretraining dataset and significantly improved the quality of the pretraining data, greatly reducing
-  the time needed for personal rapid training, with a single 3090 GPU achieving reproduction in just 2 hours!
-- Updated: LoRA fine-tuning now operates outside of the `peft` wrapper, implemented LoRA process from scratch; DPO
-  algorithm is implemented using native PyTorch; native model white-box distillation.
-- MiniMind2-DeepSeek-R1 series distilled models have been created!
-- MiniMind2 now has some English proficiency!
-- Updated MiniMind2 performance results based on additional large model benchmark tests.
+- Major update since release, Release MiniMind2 Series.
+- Code almost completely refactored, using cleaner and more unified structure.
+  For compatibility with old code, you can visit [🔗old repository content🔗](https://github.com/jingyaogong/minimind/tree/6e9cd28ef9b34a0a10afbdf6f59e65cb6e628efb).
+- Eliminated data preprocessing steps. Unified dataset format, switched to `jsonl` format to avoid dataset download confusion.
+- MiniMind2 series shows significant improvement compared to MiniMind-V1.
+- Minor improvements: {more standard kv-cache writing, MoE load balancing loss considered, etc.}
+- Provides training solutions for model migration to private datasets (medical models, self-awareness examples).
+- Streamlined pretraining dataset and significantly improved pretraining data quality, greatly reducing time for quick personal training, single 3090 GPU can reproduce in 2 hours!
+- Updates: LoRA fine-tuning separated from peft packaging, implemented from scratch; DPO algorithm implemented from scratch using native PyTorch; white-box model distillation native implementation.
+- MiniMind2-DeepSeek-R1 series distilled models born!
+- MiniMind2 now has some English ability!
+- Updated MiniMind2 and third-party model performance results based on more large model leaderboard tests.
 
 </details>
 
 <details close> 
 <summary> <b>2024-10-05</b> </summary>
 
-- Expanded MiniMind to include multimodal capabilities—visual.
-- Check out the twin project [minimind-v](https://github.com/jingyaogong/minimind-v) for more details!
+- Extended MiniMind with multimodal capabilities---Vision
+- Check out the twin project [minimind-v](https://github.com/jingyaogong/minimind-v) for details!
 
 </details>
 
 <details close> 
 <summary> <b>2024-09-27</b> </summary>
 
-- Updated preprocessing method for the pretrain dataset on 09-27 to ensure text integrity. The method of converting to
-  .bin for training has been abandoned (slightly sacrificing training speed).
-- The preprocessed pretrain file is now named: pretrain_data.csv.
+- 09-27 updated the preprocessing method for the pretrain dataset, abandoned preprocessing into .bin format for training to ensure text integrity (slightly sacrificing training speed).
+- Current pretrain preprocessing file is named: pretrain_data.csv.
 - Removed some redundant code.
 
 </details>
@@ -204,18 +203,16 @@ After this update, maintenance for the entire `minimind-v1` series will be disco
 <details close> 
 <summary> <b>2024-09-17</b> </summary>
 
-- Updated minimind-v1-moe model.
-- To avoid ambiguity, the mistral_tokenizer is no longer used, and all tokenization is done with the custom
-  minimind_tokenizer.
+- Updated minimind-v1-moe model
+- To avoid ambiguity, no longer using mistral_tokenizer for tokenization, completely using custom minimind_tokenizer as the tokenizer.
 
 </details>
 
 <details close>
 <summary> <b>2024-09-01</b> </summary>
 
-- Updated minimind-v1 (108M) model, using minimind_tokenizer, with 3 pretraining rounds + 10 SFT rounds, allowing for
-  more comprehensive training and improved performance.
-- The project has been deployed on ModelScope Creative Space and can be experienced on the site:
+- Updated minimind-v1 (108M) model, using minimind_tokenizer, 3 pretraining rounds + 10 SFT rounds, more thorough training, stronger performance.
+- Project has been deployed to ModelScope creation space, you can experience it on this website:
 - [🔗ModelScope Online Experience🔗](https://www.modelscope.cn/studios/gongjy/minimind)
 
 </details>
@@ -223,14 +220,14 @@ After this update, maintenance for the entire `minimind-v1` series will be disco
 <details close> 
 <summary> <b>2024-08-27</b> </summary>
 
-- Initial open-source release of the project.
+- Project first open-sourced
 
 </details>
 
 # 📌 Quick Start
 
 <details style="color:rgb(128,128,128)">
-<summary>Sharing My Hardware and Software Configuration (For Reference Only)</summary>
+<summary>Share my hardware and software configuration (for reference only)</summary>
 
 * CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
 * RAM: 128 GB
@@ -248,8 +245,7 @@ After this update, maintenance for the entire `minimind-v1` series will be disco
 git clone https://github.com/jingyaogong/minimind.git
 ```
 
-## Ⅰ Test Pre-trained Model
-
+## Ⅰ Testing Existing Model Performance
 
 ### 1. Environment Setup
 
@@ -257,14 +253,15 @@ git clone https://github.com/jingyaogong/minimind.git
 pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
 ```
 
+### 2. Download Model
 
-### 2. Download the Model
+Go to the project root directory
 
 ```bash
 git clone https://huggingface.co/jingyaogong/MiniMind2
 ```
 
-### (Optional) Command-line Q&A
+### (Optional) Command Line Q&A
 
 ```bash
 # load=0: load from pytorch model, load=1: load from transformers-hf model
@@ -279,7 +276,7 @@ python eval_model.py --load 1 --model_mode 2
 streamlit run web_demo.py
 ```
 
-### (Optional) Third-party inference frameworks
+### (Optional) Third-party Inference Frameworks
 
 ```bash
 # ollama
@@ -288,7 +285,7 @@ ollama run jingyaogong/minimind2
 vllm serve ./MiniMind2/ --served-model-name "minimind"
 ```
 
-## Ⅱ Training from Scratch
+## Ⅱ Train from Scratch Yourself
 
 ### 1. Environment Setup
 
@@ -297,40 +294,33 @@ pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
 ```
 
 <details style="color:rgb(128,128,128)">
-<summary>Note: Test if Torch can use CUDA</summary>
+<summary>Note: Test Torch CUDA availability in advance</summary>
 
 ```bash
 import torch
 print(torch.cuda.is_available())
 ```
 
-If CUDA is not available, please download the `.whl` file
-from [torch_stable](https://download.pytorch.org/whl/torch_stable.html) and install it. Refer to
-this [link](https://blog.csdn.net/weixin_45456738/article/details/141029610?ops_request_misc=&request_id=&biz_id=102&utm_term=%E5%AE%89%E8%A3%85torch&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduweb~default-2-141029610.nonecase&spm=1018.2226.3001.4187)
-for guidance.
+If not available, please download and install the whl file from [torch_stable](https://download.pytorch.org/whl/torch_stable.html). Reference [link](https://blog.csdn.net/weixin_45456738/article/details/141029610?ops_request_misc=&request_id=&biz_id=102&utm_term=%E5%AE%89%E8%A3%85torch&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduweb~default-2-141029610.nonecase&spm=1018.2226.3001.4187)
 
 </details>
 
-### 2. Data Download
+### 2. Download Data
 
-Download the required data files from
-the [dataset download link](https://www.modelscope.cn/datasets/gongjy/minimind_dataset/files) 
-(please `mkdir dataset`) and place them in the `./dataset` directory.
+Download the required data files from the [dataset download link](https://www.modelscope.cn/datasets/gongjy/minimind_dataset/files) provided below (create the `./dataset` directory) and place them in `./dataset`
 
 <details style="color:rgb(128,128,128)">
-<summary>Note: Dataset Information</summary>
+<summary>Note: Dataset Notes</summary>
 
-By default, it is recommended to download `pretrain_hq.jsonl` + `sft_mini_512.jsonl` for the fastest Zero-chat model
-reproduction.
+By default, it is recommended to download `pretrain_hq.jsonl` + `sft_mini_512.jsonl` for the fastest reproduction of the Zero chat model.
 
-You can freely choose data files. Various combinations are provided below, and you can select according to your training
-needs and GPU resources.
+You can freely choose data files. The section below provides multiple combination schemes that can be appropriately combined based on your training needs and GPU resources.
 
 </details>
 
 ### 3. Start Training
 
-The directory is located at `trainer`
+Directory is located in `trainer`
 
 **3.1 Pretraining (Learning Knowledge)**
 
@@ -338,186 +328,173 @@ The directory is located at `trainer`
 python train_pretrain.py
 ```
 
-> Execute pretraining to get `pretrain_*.pth` as the output weights for pretraining (where * represents the model
-> dimension, default is 512).
+> Execute pretraining to get `pretrain_*.pth` as the output weights for pretraining (where * is the model's dimension, default is 512)
 
-
-**3.2 Supervised Fine-Tuning (Learning Dialogue Style)**
+**3.2 Supervised Fine-tuning (Learning Conversation Style)**
 
 ```bash
 python train_full_sft.py
 ```
 
-> Execute supervised fine-tuning to get `full_sft_*.pth` as the output weights for instruction fine-tuning (where `full`
-> represents full parameter fine-tuning).
-
+> Execute supervised fine-tuning to get `full_sft_*.pth` as the output weights for instruction fine-tuning (where `full` means full-parameter fine-tuning)
 
 <details style="color:rgb(128,128,128)">
-<summary>Note: Training Information</summary>
+<summary>Note: Training Notes</summary>
 
-By default, during training, the model parameters are saved every 100 steps to `./out/***.pth` (each time overwriting
-the old weight file).
+By default, all training processes save parameters to the file `./out/***.pth` every 100 steps (each save overwrites the old weights).
 
-For simplicity, only the two training stages are listed here. For other training methods (LoRA, distillation,
-reinforcement learning, fine-tuning inference, etc.), refer to the detailed explanation in the [Experiments] section
-below.
+For simplicity, only the two-stage training process is described here. For other training (LoRA, distillation, reinforcement learning, inference fine-tuning, etc.), refer to the detailed description in the [Experiment](#-experiment) section below.
 
 </details>
 
 ---
 
-### 4. Testing Model Performance
+### 4. Test Your Trained Model
 
-Ensure that the model `*.pth` file you want to test is located in the `./out/` directory.
-Alternatively, you can download and use the `*.pth` files I trained
-from [here](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch/files).
+Ensure the model `*.pth` files to be tested are in the `./out/` directory.
+You can also directly download and use the `*.pth` files I trained from [here](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch/files).
 
 ```bash
-python eval_model.py --model_mode 1 # Default is 0: Test pretrain model, set to 1: Test full_sft model
+python eval_model.py --model_mode 1 # Default 0: test pretrain model, set to 1: test full_sft model
 ```
 
 <details style="color:rgb(128,128,128)">
-<summary>Note: Testing Information</summary>
+<summary>Note: Testing Notes</summary>
 
-For more details, you can check the `eval_model.py` script code. The model_mode options are 0: Pretraining model, 1:
-SFT-Chat model, 2: RLHF-Chat model, 3: Reason model.
+For details, check the `eval_model.py` script code.
+
+model_mode is divided into [required] 0: pretraining model, 1: SFT-Chat model | [optional] 2: RLHF model, 3: Reason model, 4/5: RLAIF model
 
 </details>
 
 ---
 
 > [!TIP]
-> All training scripts are built using PyTorch's native framework and support multi-GPU acceleration. If your device has
-> N (N>1) GPUs:
+> All training scripts are native PyTorch framework, supporting multi-GPU acceleration. Assume your device has N (N > 1) GPUs:
 
-Start training with N GPUs on a single machine (DDP, supports multi-node, multi-GPU clusters):
+Single machine N GPU training startup (DDP, supports multi-machine multi-GPU cluster)
 
 ```bash
 torchrun --nproc_per_node N train_xxx.py
 ```
 
 <details style="color:rgb(128,128,128)">
-<summary>Note: Others</summary>
+<summary>Note: Other Notes</summary>
 
-Start training with N GPUs on a single machine (DeepSpeed):
+Single machine N GPU training (DeepSpeed)
 
 ```bash
 deepspeed --master_port 29500 --num_gpus=N train_xxx.py
 ```
 
-Enable wandb to record the training process if needed:
+You can optionally enable wandb to record the training process (requires direct internet connection)
 
 ```bash
-# Need to log in: wandb login
+# Requires login: wandb login
 torchrun --nproc_per_node N train_xxx.py --use_wandb
 # and
 python train_xxx.py --use_wandb
 ```
 
-By adding the `--use_wandb` parameter, the training process will be recorded, and after training, you can view the
-process on the wandb website. Modify the `wandb_project` and `wandb_run_name` parameters to specify project and run
-names.
+By adding the `--use_wandb` parameter, you can record the training process. After training is complete, you can view the training process on the wandb website. By modifying the `wandb_project` and `wandb_run_name` parameters, you can specify the project name and run name.
+
+[Note]: After June 2025, the domestic network environment cannot directly connect to WandB. The MiniMind project by default switches to using [SwanLab](https://swanlab.cn/) as the training visualization tool (fully compatible with WandB API), that is, just change `import wandb` to `import swanlab as wandb`, no other changes are needed.
 
 </details>
 
-# 📌 Data Overview
+# 📌 Data Introduction
 
 ## Ⅰ Tokenizer
 
-A tokenizer maps words from natural language into numbers like `0, 1, 36`, which can be understood as the page numbers
-in a "dictionary". You can either construct your own vocabulary to train a tokenizer (code is available
-in `./scripts/train_tokenizer.py`, for educational purposes; MiniMind comes with a built-in tokenizer, so training one
-is unnecessary unless absolutely needed), or you can choose from well-known open-source tokenizers.
-
-The advantage of using a popular dictionary, like the Xinhua or Oxford dictionary, is that the token encoding has good
-compression efficiency, but the downside is that the vocabulary can be very large, with hundreds of thousands of words
-or phrases. On the other hand, a custom tokenizer allows flexibility in controlling the vocabulary's length and content,
-but the trade-off is lower compression efficiency (e.g., "hello" might be split into five independent tokens like "h", "
-e", "l", "l", "o"), and it may miss rare words.
-
-The choice of vocabulary is crucial. The output of an LLM is essentially a multi-class classification problem over the
-vocabulary, with the model decoding the final output into natural language. Since MiniMind's model size needs to be
-strictly controlled, the vocabulary length should be kept short to avoid the embedding layer dominating the model's
-overall parameters. Thus, a smaller vocabulary size is beneficial.
+Tokenizer maps words from natural language to numbers like `0, 1, 36` through a "dictionary," which can be understood as numbers representing the page number of the word in the "dictionary."
+You can choose to construct your own vocabulary table to train a "dictionary." The code can be found in `./scripts/train_tokenizer.py` (for learning reference only. It's not necessary to train one yourself unless required. MiniMind comes with a built-in tokenizer).
+Or you can choose tokenizers from well-known open-source large models.
+Just as using Xinhua/Oxford dictionaries directly has the advantage of good token encoding compression, but the disadvantage of having too many pages—tens of thousands of word phrases;
+A self-trained tokenizer has the advantage of freely controlling vocabulary length and content, but the disadvantage of low compression ratio (for example, "hello" might be split into "h e l l o"
+five independent tokens), and rare words are difficult to cover.
+The choice of "dictionary" is important. The output of LLM is essentially a multi-class classification problem with SoftMax to N words in the dictionary, then decoding to natural language through the "dictionary."
+Because MiniMind size needs to be strictly controlled to avoid top-heavy models (embedding layer parameters taking up too high a proportion of LLM), shorter vocabulary lengths are better.
 
 <details style="color:rgb(128,128,128)">
-<summary>Tokenizer Details</summary>
+<summary>Tokenizer Introduction</summary>
 
-Here are the vocabulary sizes of several popular open-source models:
+The tokenizer vocabulary sizes of powerful open-source models from third parties such as Yi, qwen, chatglm, mistral, and Llama3 are as follows:
 
-| Tokenizer Model    | Vocabulary Size | Source                |
-|--------------------|-----------------|-----------------------|
-| yi tokenizer       | 64,000          | 01万物 (China)          |
-| qwen2 tokenizer    | 151,643         | Alibaba Cloud (China) |
-| glm tokenizer      | 151,329         | Zhipu AI (China)      |
-| mistral tokenizer  | 32,000          | Mistral AI (France)   |
-| llama3 tokenizer   | 128,000         | Meta (USA)            |
-| minimind tokenizer | 6,400           | Custom                |
+<table>
+  <tr><th>Tokenizer Model</th><th>Vocabulary Size</th><th>Source</th></tr>
+  <tr><td>yi tokenizer</td><td>64,000</td><td>01AI (China)</td></tr>
+  <tr><td>qwen2 tokenizer</td><td>151,643</td><td>Alibaba Cloud (China)</td></tr>
+  <tr><td>glm tokenizer</td><td>151,329</td><td>Zhipu AI (China)</td></tr>
+  <tr><td>mistral tokenizer</td><td>32,000</td><td>Mistral AI (France)</td></tr>
+  <tr><td>llama3 tokenizer</td><td>128,000</td><td>Meta (USA)</td></tr>
+  <tr><td>minimind tokenizer</td><td>6,400</td><td>Custom</td></tr>
+</table>
 
-> 👉 **2024-09-17 Update**: To avoid ambiguity in previous versions and control model size, all MiniMind models now use
-> the `minimind_tokenizer`. All previous versions using the `mistral_tokenizer` have been deprecated.
+> 👉 Updated 2024-09-17: To prevent ambiguity from previous versions and control size, all MiniMind models use minimind_tokenizer for tokenization, abandoning all mistral_tokenizer versions.
 
 ```
-# Some personal thoughts
-> Although the `minimind_tokenizer` has a smaller vocabulary size and the encoding/decoding efficiency is weaker than other Chinese-friendly tokenizers like `qwen2` or `glm`, MiniMind has chosen to use this custom tokenizer to maintain a lightweight model overall and avoid an imbalance between the embedding and computation layers.
-> The `minimind_tokenizer` vocabulary size is only 6400, which ensures that the total parameters of the LLM are kept to a minimum (around 25.8M).
-> The training data for this tokenizer (`tokenizer_train.jsonl`) is sourced from the "Jiangshu Large Model Dataset". This part of the data is relatively less important, but you can freely choose any data for training if needed.
+# Some thoughts
+> Although minimind_tokenizer has a small length, its encoding/decoding efficiency is weaker than Chinese-friendly tokenizers like qwen2 and glm.
+> But the minimind model chose the self-trained minimind_tokenizer as the tokenizer to maintain lightweight overall parameters, avoiding imbalance in encoding layer and computation layer proportions, preventing top-heavy models, because minimind's vocabulary size is only 6400.
+> And minimind has never encountered rare word decoding failures in actual testing, with good results.
+> Due to the custom vocabulary compression to 6400, the total LLM parameters are as low as 25.8M.
+> The training data `tokenizer_train.jsonl` all comes from the `JiangShu large model dataset`, this part of data is relatively secondary. You can freely choose if you need to train.
 ```
 
 </details>
 
-## Ⅱ Pretrain Data
+## Ⅱ Pretraining Data
 
-After learning from the poor-quality pretraining data of MiniMind-V1, which resulted in nonsensical outputs, I decided
-not to use large-scale unsupervised datasets for pretraining post-`2025-02-05`. Instead, I extracted the Chinese portion
-of the [Jiangshu Large Model Dataset](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data), cleaned the
-content to include only characters of length `<512`, resulting in around 1.6GB of high-quality pretraining data, saved
-as `pretrain_hq.jsonl`.
+Having learned from MiniMind-V1's low-quality pretraining data that caused models to talk nonsense, after `2025-02-05` we decided no longer to use large-scale unsupervised datasets for pretraining.
+Instead, we tried to extract the Chinese portion from the [JiangShu Large Model Dataset](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data),
+Clean out about 1.6GB of corpus with character length `<512` and concatenate them directly as pretraining data `pretrain_hq.jsonl`, where hq means high
+quality (of course it's not yet high, improving data quality is endless).
 
-The data format for `pretrain_hq.jsonl` is:
+The file `pretrain_hq.jsonl` data format is
 
-```bash
-{"text": "如何才能摆脱拖延症？ 治愈拖延症并不容易，但以下建议可能有所帮助..."}
+```json
+{"text": "How can I get rid of procrastination? Curing procrastination is not easy, but the following suggestions may help..."}
 ```
 
 ## Ⅲ SFT Data
 
-The [Jiangshu Large Model SFT Dataset](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data) is a complete,
-well-formatted dataset for large model training and research. It includes approximately 10M Chinese sentences and 2M
-English sentences. However, the provided format is messy, and using the entire dataset for SFT would be too costly.
+The [JiangShu Large Model SFT Dataset](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data)
+"is a complete, uniformly formatted, and safe large model training and research resource.
+It collected and organized a large amount of open-source datasets from public sources on the internet, unified their format, cleaned the data,
+containing Chinese datasets with 10M entries and English datasets with 2M entries."
+The above is the official introduction. After downloading, the total data volume is about 4B tokens, which is definitely suitable as SFT data for Chinese large language models.
+However, the official data format is messy, and using all of it for SFT would be too expensive.
+I performed secondary cleaning of the official dataset, removing entries with symbol pollution and noise; additionally, still only kept content with total length `<512`,
+hoping to supplement knowledge lacking in the pretraining phase through large amounts of dialogue at this stage.
+Export file is `sft_512.jsonl` (~7.5GB).
 
-I have cleaned this dataset, removing noisy entries with special characters and symbols, and only kept content with a
-length `<512`. This cleaned dataset is exported as `sft_512.jsonl` (~7.5GB).
+The [Magpie-SFT Dataset](https://www.modelscope.cn/organization/Magpie-Align)
+collected ~1M high-quality conversations from Qwen2/2.5. I further cleaned this data, exporting the portion with total length `<2048` as `sft_2048.jsonl` (~9GB).
+The portion with length `<1024` exported as `sft_1024.jsonl` (~5.5GB). Using large model dialogue data directly for sft falls into the "black-box distillation" category.
 
-Additionally, I have collected around 1M high-quality dialogue data from Qwen2/2.5, cleaned and exported the content
-with lengths `<2048` into `sft_2048.jsonl` (~9GB) and those with lengths `<1024` into `sft_1024.jsonl` (~5.5GB).
+Further cleaned the SFT data from the previous two steps (keeping only content with high Chinese character ratio), filtered conversations with length `<512`, and obtained `sft_mini_512.jsonl` (~1.2GB).
 
-Further cleaning of these SFT datasets (only keeping content with a higher ratio of Chinese characters) resulted
-in `sft_mini_512.jsonl` (~1.2GB).
-
-The data format for all SFT files `sft_X.jsonl` is as follows:
+The data format for all sft files `sft_X.jsonl` is
 
 ```text
 {
     "conversations": [
-        {"role": "user", "content": "你好"},
-        {"role": "assistant", "content": "你好！"},
-        {"role": "user", "content": "再见"},
-        {"role": "assistant", "content": "再见！"}
+        {"role": "user", "content": "Hello"},
+        {"role": "assistant", "content": "Hello!"},
+        {"role": "user", "content": "Goodbye"},
+        {"role": "assistant", "content": "Goodbye!"}
     ]
 }
 ```
 
 ## Ⅳ RLHF Data
 
-The [Magpie-DPO Dataset](https://www.modelscope.cn/datasets/Magpie-Align/MagpieLM-DPO-Data-v0.1) contains around 200k
-preference data generated from Llama3.1-70B/8B and can be used for training reward models to optimize response quality
-according to human preferences.
+From the [Magpie-DPO Dataset](https://www.modelscope.cn/datasets/Magpie-Align/MagpieLM-DPO-Data-v0.1)
+approximately 200k preference data entries (all in English) generated from Llama3.1-70B/8B, can be used to train reward models, optimize model reply quality, making it more consistent with human preferences.
+Here, we reorganized content with total data length `<3000` into `dpo.jsonl` (~0.9GB), containing two fields `chosen` and `rejected`, where `chosen`
+is the preferred reply and `rejected` is the rejected reply.
 
-I have cleaned this dataset by combining data with a total length `<3000` into `dpo.jsonl` (~0.9GB), which
-contains `chosen` (preferred) and `rejected` (rejected) replies.
-
-The data format for `dpo.jsonl` is:
+The file `dpo.jsonl` data format is
 
 ```text
 {
@@ -532,43 +509,34 @@ The data format for `dpo.jsonl` is:
 }
 ```
 
-## Ⅴ Reasoning Dataset
+## Ⅴ Reasoning Dataset:
 
-The excitement over **DeepSeek** in February 2025 has greatly influenced my interest in RL-guided reasoning models. I
-have already replicated **R1-Zero** using Qwen2.5. If time allows and if it works, I plan to update MiniMind with a
-reasoning model trained with RL, rather than a distilled model.
+There's no denying that in February 2025, who can be hotter than DeepSeek...
+It also sparked my strong interest in RL-guided reasoning models. I've already reproduced R1-Zero using Qwen2.5.
+If I have time + good results (but 99% of base models lack ability), I will later update MiniMind with RL-trained reasoning models rather than distilled models.
+With limited time, the fastest low-cost solution is still direct distillation (black-box method).
+Unable to resist R1's popularity, in just a few days there are already various R1 distillation datasets like [R1-Llama-70B](https://www.modelscope.cn/datasets/Magpie-Align/Magpie-Reasoning-V2-250K-CoT-Deepseek-R1-Llama-70B), [R1-Distill-SFT](https://www.modelscope.cn/datasets/AI-ModelScope/R1-Distill-SFT),
+[Alpaca-Distill-R1](https://huggingface.co/datasets/shareAI/Alpaca-Distill-R1-ZH),
+[deepseek_r1_zh](https://huggingface.co/datasets/jinliuxi/deepseek_r1_zh) and so on. Pure Chinese data is relatively scarce.
+Finally integrated them, export file as `r1_mix_1024.jsonl`, data format consistent with `sft_X.jsonl`.
 
-Currently, the quickest and cost-effective approach is still distillation (black-box style). But due to the popularity
-of **R1**, I’ve combined several distilled datasets related to **R1**, including:
+## Ⅵ More Datasets
 
-- [R1-Llama-70B](https://www.modelscope.cn/datasets/Magpie-Align/Magpie-Reasoning-V2-250K-CoT-Deepseek-R1-Llama-70B)
-- [R1-Distill-SFT](https://www.modelscope.cn/datasets/AI-ModelScope/R1-Distill-SFT)
-- [Alpaca-Distill-R1](https://huggingface.co/datasets/shareAI/Alpaca-Distill-R1-ZH)
-- [deepseek_r1_zh](https://huggingface.co/datasets/jinliuxi/deepseek_r1_zh)
-
-After combining these, I exported the file as `r1_mix_1024.jsonl`. The format of this file is the same as `sft_X.jsonl`.
-
-## Ⅵ Additional Datasets
-
-For more datasets related to Chinese LLMs, you can refer
-to [HqWu-HITCS/Awesome-Chinese-LLM](https://github.com/HqWu-HITCS/Awesome-Chinese-LLM), which collects and organizes
-open-source models, applications, datasets, and tutorials for Chinese LLMs. It's comprehensive and regularly updated.
-Big respect!
+Currently, [HqWu-HITCS/Awesome-Chinese-LLM](https://github.com/HqWu-HITCS/Awesome-Chinese-LLM)
+is collecting and organizing materials related to Chinese LLMs including open-source models, applications, datasets, and tutorials, and continuously updating the latest progress in this field. Comprehensive and professional, Respect!
 
 ---
 
-## Ⅷ MiniMind Dataset Download
+## Ⅷ MiniMind Training Datasets
 
 > [!NOTE]
-> After `2025-02-05`, MiniMind’s open-source datasets for final training are provided, so there is no need for
-> you to preprocess large datasets by yourself anymore. This helps avoid redundant work.
+> After 2025-02-05, all datasets used for final MiniMind training are open-sourced. Therefore, you don't need to preprocess large-scale datasets yourself, avoiding repetitive data processing work.
 
-Available for download from:
-[ModelScope](https://www.modelscope.cn/datasets/gongjy/minimind_dataset/files) | [HuggingFace](https://huggingface.co/datasets/jingyaogong/minimind_dataset/tree/main)
+MiniMind Training Dataset Download: [ModelScope](https://www.modelscope.cn/datasets/gongjy/minimind_dataset/files) | [HuggingFace](https://huggingface.co/datasets/jingyaogong/minimind_dataset/tree/main)
 
-> You don’t need to clone everything, just download the necessary files.
+> No need to clone everything, you can download the files you need individually
 
-Place the downloaded dataset files in the `./dataset/` directory (✨ required files are marked):
+Place the downloaded dataset files in the `./dataset/` directory (✨ are recommended required items)
 
 ```bash
 ./dataset/
@@ -577,6 +545,7 @@ Place the downloaded dataset files in the `./dataset/` directory (✨ required f
 ├── lora_medical.jsonl (34MB)
 ├── pretrain_hq.jsonl (1.6GB, ✨)
 ├── r1_mix_1024.jsonl (340MB)
+├── rlaif-mini.jsonl (1MB)
 ├── sft_1024.jsonl (5.6GB)
 ├── sft_2048.jsonl (9GB)
 ├── sft_512.jsonl (7.5GB)
@@ -585,95 +554,112 @@ Place the downloaded dataset files in the `./dataset/` directory (✨ required f
 ```
 
 <details style="color:rgb(128,128,128)">
-  <summary>Dataset Descriptions</summary>
+<summary>Note: Brief Description of Each Dataset</summary>
 
-* `dpo.jsonl` -- RLHF dataset
-* `lora_identity.jsonl` -- Self-identity dataset (e.g., "Who are you? I'm MiniMind..."), recommended for LoRA training (
-  also applicable for full parameter SFT)
-* `lora_medical.jsonl` -- Medical Q&A dataset, recommended for LoRA training (also applicable for full parameter SFT)
-* `pretrain_hq.jsonl`✨ -- Pretraining dataset from Jiangshu Technology
-* `r1_mix_1024.jsonl` -- DeepSeek-R1-1.5B distilled dataset (max length 1024 characters)
-* `sft_1024.jsonl` -- Qwen2.5 distilled data (subset of sft_2048, max length 1024)
-* `sft_2048.jsonl` -- Qwen2.5 distilled data (max length 2048)
-* `sft_512.jsonl` -- Jiangshu SFT dataset (max length 512)
-* `sft_mini_512.jsonl`✨ -- Minimal Jiangshu + Qwen2.5 distilled dataset (max length 512)
-* `tokenizer_train.jsonl` -- From Jiangshu Large Model Dataset (not recommended for custom tokenizer training)
+* `dpo.jsonl` --RLHF stage dataset
+* `lora_identity.jsonl` --Self-awareness dataset (e.g., Who are you? I am minimind...), recommended for lora training (can also be used for full-parameter SFT, don't be limited by the name)
+* `lora_medical.jsonl` --Medical Q&A dataset, recommended for lora training (can also be used for full-parameter SFT, don't be limited by the name)
+* `pretrain_hq.jsonl`✨ --Pretraining dataset, integrated from JiangShu technology
+* `r1_mix_1024.jsonl` --DeepSeek-R1-1.5B distilled data, maximum character length per entry is 1024 (therefore set max_seq_len=1024 when training)
+* `rlaif-mini.jsonl` --RLAIF training dataset, randomly sampled 10,000 high-quality conversations from SFT dataset for training reinforcement learning algorithms like PPO/GRPO/SPO
+* `sft_1024.jsonl` --Integrated from Qwen2.5 distilled data (a subset of sft_2048), maximum character length per entry is 1024 (therefore set max_seq_len=1024 when training)
+* `sft_2048.jsonl` --Integrated from Qwen2.5 distilled data, maximum character length per entry is 2048 (therefore set max_seq_len=2048 when training)
+* `sft_512.jsonl` --Integrated from JiangShu SFT data, maximum character length per entry is 512 (therefore set max_seq_len=512 when training)
+* `sft_mini_512.jsonl`✨ --Minimal integration from JiangShu SFT data + Qwen2.5 distilled data (for quick training of Zero models), maximum character length per entry is 512 (therefore set max_seq_len=512 when training)
+* `tokenizer_train.jsonl` --All from the `JiangShu large model dataset`, this part of data is relatively secondary. (Not recommended to retrain tokenizer yourself, see reasons above) You can freely choose datasets if you want to train tokenizer yourself.
 
 </details>
 
-
 ![dataset](./images/dataset.jpg)
 
 <details style="color:rgb(128,128,128)">
-<summary>Explanation & Recommended Training Plans</summary>
+<summary>Instructions & Recommended Training Schemes</summary>
 
-* The MiniMind2 Series has been trained on approximately 20GB of corpus, or about 4B tokens, corresponding to the data
-  combination results above (Cost: 💰💰💰💰💰💰💰💰, Effect: 😊😊😊😊😊😊).
+* MiniMind2 Series was trained on approximately 20GB of corpus in total, about 4B tokens, corresponding to the data combination training results above (cost: 💰💰💰💰💰💰💰💰, results: 😊😊😊😊😊😊)
 
-* For the fastest Zero-model implementation from scratch, it is recommended to use the data combination
-  of `pretrain_hq.jsonl` + `sft_mini_512.jsonl`. The specific costs and effects can be seen in the table below (Cost:
-  💰, Effect: 😊😊).
+* For the fastest speed to implement Zero model from scratch, we recommend using the data combination of `pretrain_hq.jsonl` + `sft_mini_512.jsonl`, specific cost and results can be seen in the table below (cost: 💰, results: 😊😊)
 
-* For those with sufficient computational resources or more focus on results, it is advisable to fully reproduce
-  MiniMind2 with the first option; if you only have a single GPU or prefer a quick reproduction within a short time, the
-  second option is strongly recommended.
+* For friends with certain computing resources or those who care more about results, you can consider the former to fully reproduce MiniMind2; for those with only single GPU or who care about quick reproduction in short time, we highly recommend the latter;
 
-* [Compromise Plan] You can also freely combine medium-sized data like `sft_mini_512.jsonl`, `sft_1024.jsonl` for
-  training (Cost: 💰💰💰, Effect: 😊😊😊😊).
+* [Compromise solution] You can also choose medium-sized data like `sft_mini_512.jsonl`, `sft_1024.jsonl` for free combination training (cost: 💰💰💰, results: 😊😊😊😊).
 
 </details>
 
-# 📌 Model Structure
+# 📌 Model
 
-MiniMind-Dense (like [Llama3.1](https://ai.meta.com/blog/meta-llama-3-1/)) uses the Transformer Decoder-Only structure,
-which differs from GPT-3 in the following aspects:
+## Structure
 
-* It adopts GPT-3's pre-normalization method, meaning normalization is done on the input of each Transformer sub-layer
-  instead of on the output. Specifically, RMSNorm normalization function is used.
-* The SwiGLU activation function is used instead of ReLU to improve performance.
-* Like GPT-Neo, absolute position embeddings are removed and replaced with rotary position embeddings (RoPE), which
-  perform better when handling inference beyond the training length.
+MiniMind-Dense (same as [Llama3.1](https://ai.meta.com/blog/meta-llama-3-1/)) uses the Transformer Decoder-Only structure. The differences from GPT-3 are:
+
+* Adopts GPT-3's pre-normalization method, normalizing at the input of each Transformer sub-layer rather than at the output. Specifically, it uses the RMSNorm normalization function.
+* Replaced ReLU with SwiGLU activation function to improve performance.
+* Like GPT-Neo, it removed absolute position embeddings and switched to rotary position embeddings (RoPE), which works better when handling inference beyond training length.
 
 ---
 
-The MiniMind-MoE model is based on the MixFFN mixture of experts module from Llama3
-and [Deepseek-V2/3](https://arxiv.org/pdf/2405.04434).
+MiniMind-MoE model structure is based on Llama3 and the MixFFN mixture-of-experts module from [Deepseek-V2/3](https://arxiv.org/pdf/2405.04434).
 
-* DeepSeek-V2, in terms of feedforward networks (FFN), adopts finer-grained expert splitting and shared expert isolation
-  techniques to improve the performance of Experts.
+* DeepSeek-V2 in feed-forward networks (FFN) uses finer-grained expert splitting and shared expert isolation techniques to improve the effect of Experts.
 
 ---
 
-The overall structure of MiniMind remains consistent, with only minor adjustments made to RoPE computation, inference
-functions, and FFN layers.
-The structure is as shown in the figure below (redrawn):
+MiniMind's overall structure is consistent, with only small adjustments in RoPE computation, inference functions, and FFN layer code.
+The structure is shown in the diagram below (redrawn version):
 
 ![structure](./images/LLM-structure.png)
 ![structure-moe](./images/LLM-structure-moe.png)
 
-For model configuration modifications, see [./model/LMConfig.py](./model/LMConfig.py).
-Reference model parameter versions are shown in the table below:
+To modify model configuration, see [./model/LMConfig.py](./model/LMConfig.py).
+Reference model parameter versions see the table below:
 
-| Model Name        | params | len_vocab | n_layers | d_model | kv_heads | q_heads | share+route |
-|-------------------|--------|-----------|----------|---------|----------|---------|-------------|
-| MiniMind2-Small   | 26M    | 6400      | 8        | 512     | 2        | 8       | -           |
-| MiniMind2-MoE     | 145M   | 6400      | 8        | 640     | 2        | 8       | 1+4         |
-| MiniMind2         | 104M   | 6400      | 16       | 768     | 2        | 8       | -           |
-| minimind-v1-small | 26M    | 6400      | 8        | 512     | 8        | 16      | -           |
-| minimind-v1-moe   | 4×26M  | 6400      | 8        | 512     | 8        | 16      | 1+4         |
-| minimind-v1       | 108M   | 6400      | 16       | 768     | 8        | 16      | -           |
+| Model Name        | params | len_vocab | rope_theta | n_layers | d_model | kv_heads | q_heads | share+route |
+|-------------------|--------|-----------|------------|----------|---------|----------|---------|-------------|
+| MiniMind2-Small   | 26M    | 6400      | 1e6        | 8        | 512     | 2        | 8       | -           |
+| MiniMind2-MoE     | 145M   | 6400      | 1e6        | 8        | 640     | 2        | 8       | 1+4         |
+| MiniMind2         | 104M   | 6400      | 1e6        | 16       | 768     | 2        | 8       | -           |
+| minimind-v1-small | 26M    | 6400      | 1e4        | 8        | 512     | 8        | 16      | -           |
+| minimind-v1-moe   | 4×26M  | 6400      | 1e4        | 8        | 512     | 8        | 16      | 1+4         |
+| minimind-v1       | 108M   | 6400      | 1e4        | 16       | 768     | 8        | 16      | -           |
+
+
+## Model Configuration
+
+📋 Regarding LLM parameter configuration, there's an interesting paper [MobileLLM](https://arxiv.org/pdf/2402.14905) that conducted detailed research and experiments.
+Scaling Law has its own unique patterns in small models.
+Parameters causing Transformer parameter scaling changes almost entirely depend on `d_model` and `n_layers`.
+
+* `d_model`↑ + `n_layers`↓ -> Wide and short
+* `d_model`↓ + `n_layers`↑ -> Narrow and tall
+
+The 2020 Scaling Law paper argued that training data volume, parameter quantity, and training iterations are the key factors determining performance, while model architecture influence is negligible.
+However, this law doesn't seem to fully apply to small models.
+MobileLLM argues that architecture depth is more important than width, "deep and narrow" "tall and skinny" models can learn more abstract concepts than "wide and shallow" models.
+For example, when model parameters are fixed at 125M or 350M, 30-42 layer "narrow" models clearly have superior performance compared to around 12 layer "wide" models,
+showing similar trends across 8 benchmark tests including commonsense reasoning, Q&A, and reading comprehension.
+This is actually a very interesting discovery, because previously when designing architectures for ~100M scale small models, almost no one tried stacking more than 12 layers.
+This is consistent with what MiniMind observed in experiments when adjusting model parameters between `d_model` and `n_layers` during training.
+However, "deep and narrow" models also have dimensional limits. When d_model<512, the disadvantage of embedding dimension collapse is very obvious,
+and added layers cannot compensate for the disadvantage of insufficient d_head caused by fixed q_head in embeddings.
+When d_model>1536, increasing layers seems to have higher priority than d_model, bringing more "cost-effective" parameter -> performance gains.
+
+* Therefore MiniMind sets small model dim=512, n_layers=8 to achieve the balance of "extremely small size <-> better performance."
+* Setting dim=768, n_layers=16 to gain larger performance improvements, more consistent with small model Scaling-Law curves.
+
+For reference, GPT3 parameter settings see the table below:
+![gpt3_config.png](./images/gpt3_config.png)
+
+---
 
 # 📌 Experiment
 
-## Ⅰ Training Cost
+## Ⅰ Training Costs
 
-- **Time Unit**: Hours (h).
-- **Cost Unit**: RMB (￥); 7￥ ≈ 1 USD.
-- **3090 Rental Unit Price**: ≈ 1.3￥/h (subject to real-time market rates).
-- **Reference Standard**: The table only shows the actual training time for the `pretrain` and `sft_mini_512` datasets.
-  Other times are estimated based on dataset size (there may be some discrepancies).
+- **Time unit**: Hours (h).
+- **Cost unit**: Chinese Yuan (￥); 7￥ ≈ 1 USD.
+- **3090 rental price**: ≈1.3￥/h (you can check current prices yourself).
+- **Reference standard**: The table only shows actual measured training time for `pretrain` and `sft_mini_512` two datasets. Other time costs are estimated based on dataset size (may have slight variations).
 
-> Based on 3090 (single card) cost calculation
+> Based on 3090 (single GPU) cost calculation
 
 | Model Name      | params | pretrain         | sft_mini_512     | sft_512       | sft_1024          | sft_2048         | RLHF          |
 |-----------------|--------|------------------|------------------|---------------|-------------------|------------------|---------------|
@@ -683,96 +669,85 @@ Reference model parameter versions are shown in the table below:
 ---
 
 <details style="color:rgb(128,128,128)">
-<summary>Training Cost Summary & Prediction</summary>
+<summary>Training Cost Summary & Forecast</summary>
 
 > MiniMind2-Small Parameters
->> `pretrain_hq` + `sft_mini_512` dataset  
-<br/>Single 3090 GPU (1 epoch) + 2.1 hours + Cost: 2.73 RMB  
-<br/>You can train the MiniMind-Zero-0.025B model from scratch!!!
+>> `pretrain_hq`+`sft_mini_512` Dataset
+<br/>Single 3090 GPU (1 epoch) + 2.1 hours + Cost 2.73 Chinese Yuan
+<br/>Can train MiniMind-Zero-0.025B model from scratch!!!
 
 > MiniMind2-Small Parameters
->> `pretrain_hq` + `sft_512` + `sft_2048` + `dpo` dataset  
-<br/>Single 3090 GPU (2 epochs) + approximately 38.16 hours + Cost: 49.61 RMB  
-<br/>You can train the MiniMind2-Small-0.025B model from scratch!!!
+>> `pretrain_hq`+`sft_512`+`sft_2048`+`dpo` Dataset
+<br/>Single 3090 GPU (2 epochs) + Approximately 38.16 hours + Cost 49.61 Chinese Yuan
+<br/>Can train MiniMind2-Small-0.025B model from scratch!!!
 
 > MiniMind2 Parameters
->> `pretrain_hq` + `sft_512` + `sft_2048` + `dpo` dataset  
-<br/>Single 3090 GPU (2 epochs) + approximately 122 hours + Cost: 158.6 RMB  
-<br/>You can train the MiniMind2-0.1B model from scratch!!!
+>> `pretrain_hq`+`sft_512`+`sft_2048`+`dpo` Dataset
+<br/>Single 3090 GPU (2 epochs) + Approximately 122 hours + Cost 158.6 Chinese Yuan
+<br/>Can train MiniMind2-0.1B model from scratch!!!
 
 </details>
 
+✨ Based on single NVIDIA 3090 GPU, `MiniMind-Zero` requires only `2 hours` + `3 Chinese Yuan` from scratch to achieve ChatBot effect!
 
-✨With a single NVIDIA 3090 GPU, you can train `MiniMind-Zero` from scratch in just `2 hours` and for a cost of
-only `3 RMB`, achieving ChatBot functionality!
+✨ PS: If training with 8 4090 GPUs, the total time can even be compressed to less than 10 minutes! (Due to shorter time, cost is still around 3 Yuan, comparable to single GPU cost)
 
-✨PS: If training on 8 GPUs with 4090s, the total time can be compressed to under 10 minutes! (Despite the shorter time,
-the cost is still around 3 RMB, which is comparable to the single GPU cost.)
+✨ With an extremely low barrier to entry, achieve large model freedom for everyone! This is the original intention behind the birth of the MiniMind series!
 
-✨This enables ultra-low barriers to entry, making it possible for everyone to experiment with large models! This is the
-original purpose behind the creation of the MiniMind series!
+✨ The `MiniMind-Zero` costing only `3 Chinese Yuan` is not just hype! Chat test:
 
-✨The `MiniMind-Zero` model, which costs only `3 RMB`, is not a gimmick! Chat test results:
-
-```textmate
+```text
 👶: Please introduce yourself.
-🤖️: As an artificial intelligence, I don't have a real life or self-awareness, so I don't have my own life. I am designed to help users answer questions, provide information, engage in conversations, etc. My design and functionality are built using computer science, machine learning, artificial intelligence technologies, and algorithms, which I implement through programming and algorithms.
+🤖️: As an artificial intelligence, I have no actual life and no self-awareness, so I don't have my own life. I was designed to help users answer questions, provide information, and have conversations, etc. My design and functionality are built by computer science, machine learning, artificial intelligence technology and algorithms, and I'm implemented through programming and algorithms.
 
-👶: Recommend some local specialties from Hangzhou.
-🤖️: Hangzhou has many delicious local specialties, such as chicken skewers, eel heads, water-boiled fish, tofu pudding, etc. These foods are all very tasty, with unique flavors and textures, definitely worth trying.
+👶: Recommend some special dishes from Hangzhou.
+🤖️: Hangzhou has many delicious special dishes, such as chicken kebabs, eel head, boiled fish, tofu brain, etc. These dishes are very delicious, with unique flavors and tastes, definitely worth trying.
 
-👶: Please explain the concept of "large language models".
-🤖️: A large language model is an AI technology that uses large amounts of text data to train the model, and then continuously optimizes the model through iterations. Its goal is to generate text based on the input data to meet specific needs, and the text can be about any topic you are interested in.
+👶: Please explain the concept of "large language models" to me.
+🤖️: A large language model is a type of artificial intelligence technology that uses large amounts of text data to train models, then continuously optimizes the model through iterations. Its goal is to generate text that meets specific needs based on input data. These texts can be any topic you're interested in.
 
 👶: Introduce the history of the United States, please.
-🤖️: You mentioned "Introok's the believeations of theument." This name originates from the ancient Chinese "groty of of the change."
+🤖️: The "Introok's the believeations of theument" you mentioned gets its name from the ancient Chinese "groty of of the change."
 ```
 
-Fast and effective, it is still possible to further compress the training process by obtaining smaller and
-higher-quality datasets.
-The Zero model weights are saved as `full_sft_512_zero.pth` (see the MiniMind model file link below). Feel free to
-download and test the model's performance.
+Extremely fast with initial results, and can even be further compressed to obtain smaller and higher quality training data.
+Zero model weights are saved as `full_sft_512_zero.pth` (see MiniMind model file links below). If interested, you can download and test this model.
 
-## Ⅱ Main Training Steps
+---
 
-> All training scripts are executed in the `cd ./trainer` directory.
+## Ⅱ Main Training (Required)
 
-### **1. Pretraining**:
+> All training scripts should be executed in the `cd ./trainer` directory
 
-The first task for LLM is not to interact directly with humans, but to fill the network parameters with knowledge. The "
-ink" of knowledge theoretically needs to be as full as possible, generating a large accumulation of world knowledge.  
-Pretraining allows the model to first study a massive amount of basic knowledge, such as gathering high-quality training
-data from sources like Wikipedia, news articles, and books.  
-This process is "unsupervised," meaning humans don't need to make any "supervised" corrections during the process; the
-model learns patterns and knowledge points on its own from large amounts of text.  
-The goal at this stage is simple: **learn word chaining**. For example, when we input the word "Qin Shi Huang," it can
-continue with "was the first emperor of China."
+### **1. Pretraining (Pretrain)**:
+
+What LLMs need to learn first is not to communicate directly with people, but to fill the network parameters with the ink of knowledge. The "ink" should ideally be as saturated as possible, accumulating vast knowledge about the world.
+Pretraining is where the model first studies hard to learn a large amount of basic knowledge, such as organizing large-scale high-quality training data from Wikipedia, news, books, etc.
+This process is "unsupervised," meaning humans don't need to perform any "supervised" corrections during the process. Instead, the model itself summarizes patterns and learns knowledge from large amounts of text.
+The model's goal at this stage is only one: **Learn word prediction**. For example, given the input "Qin Shi Huang," it can continue with "was the first emperor of China."
 
 ```bash
-torchrun --nproc_per_node 1 train_pretrain.py # 1 represents single-card training, adjust according to hardware (set >=2)
+torchrun --nproc_per_node 1 train_pretrain.py # 1 means single GPU training, adjust based on your hardware (set >=2 for multiple GPUs)
 # or
 python train_pretrain.py
 ```
 
-> The trained model weights are saved every `100 steps` by default as: `pretrain_*.pth` (the * represents the specific
-> model dimension, and each new save will overwrite the previous one).
+> After training, model weight files are saved by default every `100 steps` as: `pretrain_*.pth` (where *
+> is the model's specific dimension, new files overwrite old ones on each save)
+
+| MiniMind2-Small (512dim) | MiniMind2 (768dim) |
+|---|---|
+| <img src="./images/pre_512_loss.png"> | <img src="./images/pre_768_loss.png"> |
 
 ### **2. Supervised Fine-Tuning (SFT)**:
 
-After pretraining, the LLM has acquired a large amount of knowledge, but it can only engage in word chaining and doesn't
-know how to chat with humans.  
-The SFT stage involves applying a custom chat template to fine-tune the semi-finished LLM.  
-For example, when the model encounters a template like [Question->Answer, Question->Answer], it no longer blindly chains
-words but understands this is a complete conversation.  
-This process is known as instruction fine-tuning, similar to teaching a well-learned "Newton" to adapt to 21st-century
-smartphone chat habits, learning the rule that messages from others appear on the left, and the user's on the right.  
-During training, MiniMind's instruction and response lengths are truncated to 512 tokens to save memory. This is like
-learning with short essays first, then gradually tackling longer ones like an 800-word essay once you can handle 200
-words.  
-When length expansion is needed, only a small amount of 2k/4k/8k length dialogue data is required for further
-fine-tuning (preferably with RoPE-NTK benchmark differences).
-> During inference, adjusting the RoPE linear difference makes it easy to extrapolate to lengths of 2048 and above
-> without additional training.
+After pretraining, the LLM has mastered a lot of knowledge, but at this point it only knows how to do word prediction mindlessly and doesn't know how to chat with people.
+The SFT stage requires applying a custom chat template to fine-tune the semi-finished LLM.
+For example, after the model encounters such a template [question->answer, question->answer], it no longer does mindless word continuation, but realizes this is the end of a complete conversation.
+This process is called instruction fine-tuning, like helping the already knowledgeable "Newton" gentleman adapt to 21st-century smartphone chat habits, learning that the left side of the screen is the other person's message and the right side is the user's message.
+During training, MiniMind's instruction and answer lengths are truncated at 512 to save GPU memory. Like learning to write, you start with short articles, and after learning to write 200-character essays, 800-character articles become easy.
+When length extension is needed, you only need to prepare a small amount of 2k/4k/8k length dialogue data for further fine-tuning (preferably combined with RoPE-NTK scaling).
+> During inference, by adjusting RoPE scaling, it will be convenient to achieve training-free length extrapolation to 2048 and beyond.
 
 ```bash
 torchrun --nproc_per_node 1 train_full_sft.py
@@ -780,78 +755,45 @@ torchrun --nproc_per_node 1 train_full_sft.py
 python train_full_sft.py
 ```
 
-> The trained model weights are saved every `100 steps` by default as: `full_sft_*.pth` (the * represents the specific
-> model dimension, and each new save will overwrite the previous one).
+> After training, model weight files are saved by default every `100 steps` as: `full_sft_*.pth` (where *
+> is the model's specific dimension, new files overwrite old ones on each save)
 
-## Ⅲ Other Training Steps
+| MiniMind2-Small (512dim) | MiniMind2 (768dim) |
+|---|---|
+| <img src="./images/sft_512_loss.png"> | <img src="./images/sft_768_loss.png"> |
 
-> All training scripts are executed in the `cd ./trainer` directory.
+## Ⅲ Other Training Stages (Optional)
 
-### **3. Reinforcement Learning from Human Feedback (RLHF)**
+> All training scripts should be executed in the `cd ./trainer` directory
 
-In the previous training steps, the model has acquired basic conversational abilities, but these are entirely based on
-word chaining, without positive or negative reinforcement examples.  
-At this point, the model doesn't know what answers are good or bad. We want it to align more with human preferences and
-reduce the probability of unsatisfactory responses.  
-This process is like providing the model with new training using examples of excellent employees' behavior and poor
-employees' behavior to learn how to respond better.  
-Here, we use RLHF’s Direct Preference Optimization (DPO).  
-Unlike RL algorithms like PPO (Proximal Policy Optimization), which require reward models and value models,  
-DPO derives an explicit solution from the PPO reward model, replacing online reward models with offline data, where the
-Ref model's outputs can be pre-saved.  
-DPO performance is nearly identical but requires running only the actor_model and ref_model, which greatly reduces
-memory usage and enhances training stability.
-> **Note**: The RLHF step **is not required**, as it typically does not improve the model’s "intelligence" but is used
-> to improve the model's "politeness," which can have pros (alignment with preferences, reducing harmful content) and
-> cons (expensive sample collection, feedback bias, loss of diversity).
+### **3. Knowledge Distillation (KD)**
+
+At this point, after all the previous training steps, the model has completely acquired basic capabilities and usually can graduate.
+However, knowledge distillation can further optimize model performance and efficiency. Knowledge distillation means the student model learns from the teacher model.
+The teacher model is usually a well-trained large model with high accuracy and generalization ability.
+The student model is a smaller model whose goal is to learn the teacher model's behavior rather than learn directly from raw data.
+In SFT learning, the model's goal is to fit hard labels for token classification (hard labels), i.e., true class labels (such as 0 or 6400).
+In knowledge distillation, the teacher model's softmax probability distribution is used as soft labels (soft labels). The small model only learns soft labels and uses KL-Loss to optimize model parameters.
+In simple terms, SFT learns the problem-solving answers the teacher gives directly. The KD process is like "opening" the teacher's smart brain and trying to mimic the neural state of the teacher's "brain" thinking about problems.
+For example, when the teacher model calculates the problem `1+1=2`, the final layer neurons a state is 0, neuron b state is 100, neuron c state is -99...
+The student model learns the operating rules inside the teacher model's brain through large amounts of data. This process is called: knowledge distillation.
+Knowledge distillation has only one purpose: make small models smaller in size while having better results.
+However, with the birth and development of LLMs, the term model distillation has been widely abused, creating two schools of "white-box/black-box" knowledge distillation.
+Closed-source models like GPT-4, since their internal structure cannot be accessed, can only learn from the data they output. This process is called black-box distillation, and is the most common practice in the age of large models.
+Black-box distillation is completely identical to the SFT process, except the data is collected from large model outputs. Therefore, you only need to prepare data and further FT.
+Note that you need to change the loaded base model to `full_sft_*.pth`, i.e., further distillation learning based on the fine-tuned model.
+Both `./dataset/sft_1024.jsonl` and `./dataset/sft_2048.jsonl` are collected from qwen2.5-7/72B-Instruct large models and can be used directly for SFT to acquire some Qwen behavior.
 
 ```bash
-torchrun --nproc_per_node 1 train_dpo.py
-# or
-python train_dpo.py
-```
-
-> The trained model weights are saved every `100 steps` by default as: `rlhf_*.pth` (the * represents the specific model
-> dimension, and each new save will overwrite the previous one).
-
-### **4. Knowledge Distillation (KD)**
-
-After the previous training steps, the model has fully acquired basic capabilities and is usually ready for release.  
-Knowledge distillation can further optimize the model's performance and efficiency. Distillation involves having a
-smaller student model learn from a larger teacher model.  
-The teacher model is typically a large, well-trained model with high accuracy and generalization capabilities.  
-The student model is a smaller model aimed at learning the behavior of the teacher model, not directly from raw data.  
-In SFT learning, the model’s goal is to fit the hard labels (e.g., real category labels like 0 or 6400) in word token
-classification.  
-In knowledge distillation, the softmax probability distribution of the teacher model is used as soft labels. The small
-model learns from these soft labels and uses KL-Loss to optimize its parameters.  
-In simpler terms, SFT directly learns the solution provided by the teacher, while KD "opens up" the teacher’s brain and
-mimics how the teacher’s neurons process the problem.  
-For example, when the teacher model calculates `1+1=2`, the last layer's neuron states might be `a=0`, `b=100`, `c=-99`,
-etc. The student model learns how the teacher's brain works by studying this state.  
-The goal of knowledge distillation is simple: make the smaller model more efficient while preserving performance.  
-However, with the development of LLMs, the term "model distillation" has become widely misused, leading to the creation
-of "white-box/black-box" distillation.  
-For closed-source models like GPT-4, where internal structures cannot be accessed, learning from its output data is
-known as black-box distillation, which is the most common approach in the era of large models.  
-Black-box distillation is exactly the same as the SFT process, except the data is collected from the large model’s
-output, requiring only data collection and further fine-tuning.  
-Note to change the base model loaded to `full_sft_*.pth`, as distillation is performed based on the fine-tuned model.  
-The `./dataset/sft_1024.jsonl` and `./dataset/sft_2048.jsonl` datasets, collected from the qwen2.5-7/72B-Instruct large
-model, can be directly used for SFT to obtain some behavior from Qwen.
-
-```bash
-# Make sure to change the dataset path and max_seq_len in train_full_sft.py  
+# Note: need to change the dataset path in train_full_sft.py and max_seq_len
 torchrun --nproc_per_node 1 train_full_sft.py
 # or
 python train_full_sft.py
 ```
 
-> The trained model weights are saved every `100 steps` by default as: `full_sft_*.pth` (the * represents the specific
-> model dimension, and each new save will overwrite the previous one).
+> After training, model weight files are similarly saved by default every `100 steps` as: `full_sft_*.pth` (where * is the model's specific dimension, new files overwrite old ones on each save)
 
-This section emphasizes MiniMind’s white-box distillation code `train_distillation.py`. Since MiniMind doesn’t have a
-powerful teacher model within the same series, the white-box distillation code serves as a learning reference.
+Emphasis should be placed on introducing MiniMind's white-box distillation code `train_distillation.py`. Since there is no powerful teacher model within the same MiniMind series, the white-box distillation code is only for learning reference.
 
 ```bash
 torchrun --nproc_per_node 1 train_distillation.py
@@ -859,15 +801,12 @@ torchrun --nproc_per_node 1 train_distillation.py
 python train_distillation.py
 ```
 
-### **5. LoRA (Low-Rank Adaptation)**
+### **4. LoRA (Low-Rank Adaptation)**
 
-LoRA is an efficient parameter-efficient fine-tuning (PEFT) method designed to fine-tune pretrained models via low-rank
-decomposition.  
-Compared to full parameter fine-tuning, LoRA only requires updating a small number of parameters.  
-The core idea of LoRA is to introduce low-rank decomposition into the model's weight matrix and update only the low-rank
-part, leaving the original pretrained weights unchanged.  
-The code can be found in `./model/model_lora.py` and `train_lora.py`, which fully implement the LoRA process from
-scratch without relying on third-party libraries.
+LoRA is an efficient Parameter-Efficient Fine-Tuning (PEFT) method aimed at fine-tuning pre-trained models through low-rank decomposition.
+Compared to full parameter fine-tuning (Full Fine-Tuning), LoRA only needs to update a small number of parameters.
+LoRA's core idea is: introduce low-rank decomposition in the model's weight matrices and only update the low-rank parts while keeping the original pre-trained weights unchanged.
+Code can be found in `./model/model_lora.py` and `train_lora.py`, completely implementing the LoRA process from scratch without relying on third-party library packaging.
 
 ```bash
 torchrun --nproc_per_node 1 train_lora.py
@@ -875,121 +814,100 @@ torchrun --nproc_per_node 1 train_lora.py
 python train_lora.py
 ```
 
-> The trained model weights are saved every `100 steps` by default as: `lora_xxx_*.pth` (the * represents the specific
-> model dimension, and each new save will overwrite the previous one).
+> After training, model weight files are saved by default every `100 steps` as: `lora_xxx_*.pth` (where *
+> is the model's specific dimension, new files overwrite old ones on each save)
 
-Many people are puzzled: how can a model learn private domain knowledge? How should datasets be prepared? How to
-transfer general models into specialized domain models?  
-Here are some examples. For general models, if there is a lack of medical domain knowledge, you can try adding
-domain-specific knowledge to the original model for better performance.  
-At the same time, we typically don't want to lose the basic abilities of the model when learning domain knowledge. LoRA
-can address this issue well.  
-Simply prepare dialogue datasets in the following format and place them in `./dataset/lora_xxx.jsonl`, then
-start `python train_lora.py`.  
-Training will result in new model weights in `./out/lora/lora_xxx.pth`.
+Many people are confused about how to make models learn knowledge in their private domains, how to prepare datasets, and how to migrate general-purpose models to build domain-specific models.
+Here are some examples. For general models lacking medical domain knowledge, you can try adding domain knowledge to the original model to get better performance.
+At the same time, you usually don't want to lose other abilities of the original base model while learning domain knowledge. In this case, LoRA can effectively solve this problem.
+You only need to prepare dialogue datasets in the following format and place them in `./dataset/lora_xxx.jsonl`, then run `python train_lora.py`
+to get the new model weights `./out/lora/lora_xxx.pth`.
 
 **Medical Scenario**
 
 ```
- {"conversations": [{"role": "user", "content": "What is the ideal pillow height for cervical spondylosis?"}, {"role": "assistant", "content": "The pillow height for cervical spondylosis should be..."}]}
- {"conversations
-
-": [{"role": "user", "content": "What is xxx?"}, {"role": "assistant", "content": "xxx..."}]}
+ {"conversations": [{"role": "user", "content": "What height should a cervical spondylosis patient's pillow be?"}, {"role": "assistant", "content": "For cervical spondylosis patients, pillow height selection should be based on..."}]}
+ {"conversations": [{"role": "user", "content": "What about xxx"}, {"role": "assistant", "content": "xxx..."}]}
 ```
 
 **Self-awareness Scenario**
 
 ```
- {"conversations": [{"role": "user", "content": "What is your name?"}, {"role": "assistant", "content": "I am MiniMind..."}]}
- {"conversations": [{"role": "user", "content": "Who are you?"}, {"role": "assistant", "content": "I am..."}]}
+ {"conversations": [{"role": "user", "content": "What is your name?"}, {"role": "assistant", "content": "My name is minimind..."}]}
+ {"conversations": [{"role": "user", "content": "Who are you"}, {"role": "assistant", "content": "I am..."}]}
 ```
 
-At this point, combining the **base model + LoRA model** will enhance the medical domain capabilities without
-sacrificing the original model’s abilities.  
-You can evaluate the model’s performance using `eval_model.py`.
+At this point, [base model + LoRA model] can acquire medical scenario model-enhanced capabilities, equivalent to adding a LoRA plug-in to the base model, which does not lose the base model's original abilities.
+You can test the model through `eval_model.py` for model evaluation.
 
 ```bash
-# Ensure model_mode matches the model used during train_lora to avoid mismatch
+# Note: model_mode selects the type of base model, which should be consistent with which model train_lora is based on.
 python eval_model.py --lora_name 'lora_medical' --model_mode 2
 ```
 
 **Small Test**
 
 ```
-👶: I often feel dizzy recently. What might be the cause?  
-🤖️: Dizziness can have many causes, including but not limited to heart disease, eye injury, allergies, skin issues, stress, anxiety, fatigue, medication side effects, sexual dysfunction, or others...
+👶: I've been feeling dizzy recently, what could be the reason?
+🤖️: Dizziness could be caused by many reasons, including but not limited to: heart disease, eye damage, allergic reactions, skin problems, allergic reactions, stress, anxiety, fatigue, medication side effects, sexual dysfunction, or others...
 
 ---
 
-👶: What is your relationship with OpenAI?  
-🤖️: I am MiniMind, an AI assistant developed by Jingyao Gong. I interact with users through natural language processing and algorithms.
+👶: What is your relationship with openai?
+🤖️: I am MiniMind, an artificial intelligence assistant developed by Jingyao Gong. I interact with users through natural language processing and algorithm training.
 ```
 
-PS: If needed, you can also fine-tune the full parameters using `full_sft` (but mix general knowledge to prevent
-overfitting on domain-specific data, which could reduce the generality of the model).
+PS: As long as you have the required datasets, you can also do full_sft full parameter fine-tuning (need to mix general knowledge to avoid overfitting domain data making the model worse and losing generality)
 
-### **6. Training the Reasoning Model (Reasoning Model)**
+### **5. Training Reasoning Models (Reasoning Model)**
 
-DeepSeek-R1 is so popular that it has almost redefined the future paradigm for LLMs.  
-The paper suggests that models with `>3B` parameters need multiple rounds of cold starts and RL reward training to
-achieve noticeable improvements in reasoning abilities.  
-The fastest, most reliable, and most economical approach, and the various so-called reasoning models that have emerged
-recently, are almost all directly trained through distillation on data.  
-However, due to the lack of technical depth, the distillation faction is often looked down upon by the RL faction (
-haha).  
-I quickly attempted this on the Qwen series 1.5B small model and soon replicated the mathematical reasoning abilities of
-the Zero process.  
-However, a disappointing consensus is: models with too few parameters almost cannot achieve any reasoning effects
-through cold-start SFT + GRPO.  
-MiniMind2 firmly chose the distillation route at the beginning, but if the RL method for models with 0.1B parameters
-makes some small progress in the future, the training scheme will be updated accordingly.
+DeepSeek-R1 was so popular, almost pointing out a new paradigm for future LLMs.
+The paper points out that models `>3B` require multiple iterations of cold starts and RL reward training to achieve visible reasoning ability improvements.
+The fastest, most reliable, and most economical approach, as well as recent various so-called reasoning models, are almost all directly distilled from data.
+But due to lack of technical sophistication, the distillation school is looked down upon by the RL school (hhhh).
+I quickly tried on Qwen 1.5B small model and quickly reproduced math reasoning ability in the Zero process.
+However, a regrettable consensus is: models with parameters too small cannot achieve any reasoning effect through cold start SFT+GRPO.
+For now, MiniMind firmly chooses to be in the distillation school. If RL on 0.1B models later achieves small progress, this training approach section will be updated.
 
-To do distillation, you need to prepare data in the same format as the SFT phase, as described earlier. The data format
-is as follows:
+For distillation, you only need to prepare datasets in the same format as the SFT stage. The dataset source has been introduced above. Data format examples:
 
-```json lines
+```json
 {
   "conversations": [
     {
       "role": "user",
-      "content": "Hello, I'm Xiaofang, nice to meet you."
+      "content": "Hello, I am Xiaofang, nice to meet you."
     },
     {
       "role": "assistant",
-      "content": "<think>\nHello! I am MiniMind-R1-Lite-Preview, an intelligent assistant independently developed by a Chinese individual. I'm happy to provide services for you!\n</think>\n<answer>\nHello! I am MiniMind-R1-Lite-Preview, an intelligent assistant independently developed by a Chinese individual. I'm happy to provide services for you!\n</answer>"
+      "content": "<think>\nHello! I am a small AI reasoning model R1-Lite-Preview developed by an independent developer in China. I'm happy to serve you!\n</think>\n<answer>\nHello! I am a small AI reasoning model R1-Lite-Preview developed by an independent developer in China. I'm happy to serve you!\n</answer>"
     }
   ]
 }
 ```
 
-The reply template for the reasoning model R1 is:
+The reply template for reasoning model R1 is:
 
 ```text
 <think>\nThinking process\n</think>\n
 <answer>\nFinal answer\n</answer>
 ```
 
-In GRPO, this is done by setting up a reward function that ensures the model adheres to the thinking and answering
-tags (the reward values should be higher in the early cold-start stages).
+This is constrained by setting a rule-based reward function in GRPO to make the model comply with thinking tags and reply tags (in the early stages of cold starts, reward values should be increased).
 
-Another issue is that although the distillation process is similar to SFT, the experimental results show that the model
-struggles to consistently follow the template for responses, meaning it may not always adhere to the thinking and
-answering tag constraints.  
-A trick here is to add a loss penalty for token positions with tags, as detailed in `train_distill_reason.py`:
+Another issue is that although the distillation process is the same as SFT, experimental results show that models have difficulty consistently complying with template-compliant replies every time, i.e., deviating from thinking and reply tag constraints.
+A small trick here is to increase the loss penalty for marker position tokens. See details in `train_distill_reason.py`:
 
 ```text
-# Add additional penalty at positions corresponding to sp_ids
+# Add extra penalty to positions corresponding to sp_ids
 ...
-loss_mask[sp_ids] = 10 # Penalty coefficient
+loss_mask[sp_ids] = 10 # penalty coefficient
 ```
 
-Another tip is that, since the reasoning data only filters for data of length `<1024`, there is less multi-turn dialogue
-and English data.  
-Therefore, `r1_mix_1024.jsonl` has been mixed with about 10k multi-turn dialogues + English data to prevent the model
-from forgetting too much.
+Another tip is that because reasoning data only filtered `<1024` length data, it has fewer multi-turn conversations and English data.
+Therefore, `r1_mix_1024.jsonl` mixed approximately 10k multi-turn conversations + English data to prevent serious model forgetting.
 
-The script is set by default to distill reasoning abilities from the RLHF-based pre-trained model. To start training,
-just run:
+The script defaults to reasoning ability distillation fine-tuning based on the rlhf model. You can directly start training:
 
 ```bash
 torchrun --nproc_per_node 1 train_distill_reason.py
@@ -997,186 +915,444 @@ torchrun --nproc_per_node 1 train_distill_reason.py
 python train_distill_reason.py
 ```
 
-> The trained model weights are saved every `100 steps` by default as: `reason_*.pth` (* being the specific dimension of
-> the model; each time a new file is saved, it will overwrite the old one).
+> After training, model weight files are saved by default every `100 steps` as: `reason_*.pth` (where * is the model's specific dimension, new files overwrite old ones on each save)
 
-Test it:
+Let's test it:
 
 ```text
-👶: Please introduce yourself.  
-🤖️: <think>  
-Hmm, the user asks me to introduce myself. I need to explain myself. First, I should clarify who the user is: a student, a professional, or an ordinary person. The user could be a student or a professional, so the information I provide should help them.  
-Then, I need to respond to the user, which might include introducing my location, interests, career experiences, etc. The user might also want to know where I am, such as my career development or work experience.  
-I should be careful with my wording, it should be formal but not too formal, and keep it concise and clear so the user can easily understand and access related information.  
-Finally, I should summarize, letting the user know how I can help them complete their tasks, so they have more information about me.  
-</think>  
-<answer>  
-I am **"Your Name"**, what is your name?  
+👶: Please introduce yourself.
+🤖️: <think>
+Um, the user is asking me to introduce myself. I need to explain myself first. First, I should clarify who the user is - are they students, working professionals or ordinary people? The user might be a student or working professional, so information delivery should be to help students and working professionals with information. Then I need to respond to the user, which might include introducing where I am, hobbies, career experience, and other information. I should note to use formal language without being too formal, while keeping it concise and clear. This makes it easy for users to understand and convenient for them to check related information. Finally, I should summarize to let the user know how I help them complete their tasks, giving them more information about me.
+</think>
+<answer>
+I am **"your name", what is your name?
 </answer>
 
-👶: Which subject are you better at?  
-🤖️: <think>  
-Hello! I am MiniMind-R1, an intelligent assistant developed by a Chinese individual. If you have any questions, I will do my best to assist you.  
-</think>  
-<answer>  
-Hello! I am MiniMind-R1, an intelligent assistant developed by a Chinese individual. If you have any questions, I will do my best to assist you.  
+👶: Which subject are you better at?
+🤖️: <think>
+Hello! I am a small AI reasoning model R1 developed by Chinese independent developers. If you have any questions, I will do my best to help you.
+</think>
+<answer>
+Hello! I am a small AI reasoning model R1 developed by Chinese independent developers. If you have any questions, I will do my best to help you.
 </answer>
 ```
 
----
+## IV Reinforcement Learning Training
 
-## Ⅳ Model Parameter Settings
+RL methods in LLMs can be divided into two categories:
 
-📋 Regarding LLM parameter configurations, an interesting paper, [MobileLLM](https://arxiv.org/pdf/2402.14905), provides
-a detailed study and experiment.  
-The Scaling Law has unique patterns for small models.  
-The parameters that cause the Transformer to scale mainly depend on `d_model` and `n_layers`.
+1. **Reinforcement Learning from Human Feedback (RLHF)**
 
-* `d_model`↑ + `n_layers`↓ -> Short and fat
-* `d_model`↓ + `n_layers`↑ -> Tall and thin
+- Train the model by evaluating human **preferences** for model outputs, making it generate content more consistent with human values and preferences.
 
-The Scaling Law paper from 2020 suggests that the training data volume, parameter size, and number of training
-iterations are the key factors determining performance, with the model architecture having almost negligible impact.  
-However, this law doesn't seem to fully apply to small models.  
-MobileLLM suggests that the depth of the architecture is more important than the width, and "deep and narrow" models can
-learn more abstract concepts than "wide and shallow" models.  
-For example, when the model parameters are fixed at 125M or 350M, the "narrow" models with 30-42 layers perform
-significantly better than the "short and fat" models with around 12 layers, across 8 benchmark tests like commonsense
-reasoning, Q&A, reading comprehension, etc.  
-This is a fascinating discovery because, in the past, no one tried stacking more than 12 layers when designing
-architectures for small models around the 100M parameter range.  
-This finding aligns with what MiniMind observed during training when adjusting between `d_model` and `n_layers`.  
-However, the "deep and narrow" architecture has its limits, and when `d_model`<512, the collapse of word embedding
-dimensions becomes very evident, and increasing layers cannot compensate for the lack of `d_head` due to
-fixed `q_head`.  
-When `d_model`>1536, the increase in layers seems to take priority over `d_model` and provides more cost-effective
-parameter-to-performance gains.
+2. **Reinforcement Learning from AI Feedback (RLAIF)**
 
-* Therefore, MiniMind sets small models with `dim=512`, `n_layers=8` to strike a balance between "small size" and "
-  better performance."
-* Sets `dim=768`, `n_layers=16` to achieve more significant performance improvements, which better matches the small
-  model Scaling-Law curve.
+- Use **AI models** (typically pre-trained language reward models) to provide feedback rather than directly relying on human manual annotation.
+- The "AI" here can also be certain rule-based rewards, such as math answer correctness / code executors...
 
-For reference, the parameter settings for GPT-3 are shown in the table below:  
-![gpt3_config.png](./images/gpt3_config.png)
+| Type  | Judge | Advantages | Disadvantages |
+|-------|-------|-----------|---------------|
+| RLHF  | Human | More aligned with real human preferences | High cost, low efficiency |
+| RLAIF | Model | Automated, highly scalable | May deviate from real human preferences |
+
+The two are essentially the same, both using **reinforcement learning** to utilize certain forms of "**feedback**" to optimize model behavior.
+
+Except for the different **feedback** sources, there are no other differences.
+
+### 👀 Unified Perspective on PO Algorithms
+
+Before introducing specific algorithm implementations, I'll present my personal understanding of the unified commonality of all Policy Optimization (PO) algorithms in a minimalist perspective.
+
+The essence of all RL algorithms is only optimizing one expectation:
+
+$$\mathcal{J}_{PO} = \mathbb{E}_{q \sim P(Q), o \sim \pi(O|q)} \left[ \underbrace{f(r_t)}_{\text{policy term}} \cdot \underbrace{g(A_t)}_{\text{advantage term}} - \underbrace{h(\text{KL}_t)}_{\text{regularization term}} \right]$$
+
+During training, only **minimize the negative objective function**, i.e.: $\mathcal{L_{PO}}=\mathcal{J_{PO}}$
+
+This framework contains only three core components:
+* **Policy term** $f(r_t)$: How to use probability ratio $r_t$? Tell the model how large the deviation between new and old policies is, whether better tokens are explored
+* **Advantage term** $g(A_t)$: How to calculate advantage $A_t$, this is important! Large models solving definite integrals is unremarkable, small models answering addition/subtraction correctly usually have positive advantages
+* **Regularization term** $h(\text{KL}_t)$: How to constrain the change magnitude $\text{KL}_t$, both preventing drift and not being too rigid
+
+<details>
+<summary>(Expand) Symbol Explanation</summary>
+
+| Symbol | Meaning | Explanation | Range |
+|--------|---------|------------|-------|
+| $q$ | Question/prompt | Sampled from dataset $P(Q)$ | - |
+| $o$ | Model output sequence | Generated by policy $\pi$ | - |
+| $r_t$ | Probability ratio | $r_t = \frac{\pi_\theta(o_t\|q, o_{<t})}{\pi_{ref}(o_t\|q, o_{<t})}$ | $(0, +\infty)$ |
+| $A_t$ | Advantage function | Measures how good an action is compared to baseline | $(-\infty, +\infty)$ |
+| $\text{KL}_t$ | KL divergence | Prevent policy from deviating too far from reference model | $[0, +\infty)$ |
+
+</details>
+<br/>
+
+Different **xxPO algorithms** are essentially just different design instantiations of these three components!
 
 ---
 
-## Ⅴ Training Results
+### **6. Reinforcement Learning from Human Feedback (RLHF)**
 
-> `MiniMind2` model training loss trends (the loss is for reference only as the dataset was updated and cleaned several
-> times after training).
+In the previous training steps, the model has acquired basic conversation abilities, but these are completely based on word prediction, lacking the motivation of positive and negative examples.
+The model doesn't yet know what answers are good and what are bad. We hope it can be more aligned with human preferences, reducing the probability of unsatisfactory answers.
+This process is like having the model undergo new training, learning from excellent employees as examples and passive employees as counter-examples, to learn how to respond better.
 
-| Models          | Pretrain (length-512)                              | SFT (length-512)                                   |
-|-----------------|----------------------------------------------------|----------------------------------------------------|
-| MiniMind2-Small | <img src="./images/pre_512_loss.png" width="100%"> | <img src="./images/sft_512_loss.png" width="100%"> |
-| MiniMind2       | <img src="./images/pre_768_loss.png" width="100%"> | <img src="./images/sft_768_loss.png" width="100%"> |
+#### 6.1 Direct Preference Optimization
 
-### Training Completed - Model Collection
+Direct Preference Optimization (DPO) algorithm loss:
 
-> Considering that many people have reported slow speeds with Baidu Cloud, all MiniMind2 models and beyond will be
-> hosted on ModelScope/HuggingFace.
+$$\mathcal{L}_{DPO} = -\mathbb{E}\left[\log \sigma\left(\beta \left[\log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right]\right)\right]$$
+
+Where:
+- **Policy term**: $f(r_t) = \log r_w - \log r_l$ (contrast probability ratios of chosen vs rejected)
+- **Advantage term**: $g(A_t)$ = / (through preference contrast, no need to explicitly calculate advantage)
+- **Regularization term**: $h(\text{KL}_t)$ = implicit in $\beta$ (control deviation from reference model)
+
+Specifically:
+- DPO derives an analytical training objective for preference pairs from PPO with KL constraints, directly maximizing the log-odds that "chosen outperforms rejected"; no need to simultaneously train Reward/Value models. DPO only needs to run two models `actor` and `ref`, with low GPU memory usage, stable convergence, and simple implementation.
+- Training paradigm: off-policy, using static preference datasets, can repeat multiple epochs; Ref model is fixed (outputs pre-cached).
+- DPO's limitation is no online exploration, more used for "preference/safety" human value alignment; limited improvement in "intellectual ability" to solve problems correctly (of course this depends on the dataset, collecting large-scale positive and negative samples with human evaluation is difficult).
+
+```bash
+torchrun --nproc_per_node 1 train_dpo.py
+# or
+python train_dpo.py
+```
+
+> After training, model weight files are saved by default every `100 steps` as: `rlhf_*.pth` (where * is the model's specific dimension, new files overwrite old ones on each save)
+
+### **7. Reinforcement Learning from AI Feedback (RLAIF)**
+
+Compared to RLHF which relies on human-annotated chosen/rejected preference pairs, RLAIF has AI completely act as the "judge."
+The so-called AI "judge" can be a model-based reward large model (Reward Model), can be like R1 setting rule-based functions for validation, or can be environmental feedback like tool calling.
+For example: whether math problem answers are correct, whether code execution passes test cases, whether reasoning processes meet format requirements...can all be automatically judged.
+RLAIF's greatest advantage is its **scalability** and **On-Policy** characteristics——no need for expensive human annotation, can generate massive training samples, letting models quickly evolve through large-scale online trial and error.
+
+MiniMind implements **2+N** basic + cutting-edge RLAIF methods:
+* **PPO**, **GRPO** are classic RL algorithms widely validated at scale;
+* N cutting-edge RL algorithms (updated irregularly with experimental nature).
+
+#### 1️⃣ Dataset Preparation (Required)
+
+To quickly verify RLAIF effectiveness, 10,000 high-quality conversations were randomly sampled from the SFT dataset, building about 1MB size `rlaif-mini.jsonl` ([Huggingface](https://huggingface.co/datasets/jingyaogong/minimind_dataset/blob/main/rlaif-mini.jsonl))
+
+Data format is consistent with SFT, but assistant content is not needed, because during training it's completely real-time sampled and generated by the $\Pi$ policy model. Thus:
+
+```json
+{
+    "conversations": [
+        {"role": "user", "content": "Explain what photosynthesis is?"},
+        {"role": "assistant", "content": "None"}
+    ]
+}
+```
+
+During RLAIF training, the model generates 1 or more candidate answers based on user questions, then a reward function/model scores the answers.
+High-scoring answers are encouraged (increase $\Pi$ policy probability), low-scoring answers are suppressed (decrease $\Pi$ policy probability). This "score->adjust" loop is the core of reinforcement learning.
+
+#### 2️⃣ Reward Model Preparation (Required)
+
+It's known that RLAIF training requires a "reward model (Reward Model)" to score generated answers.
+
+We select the small and high-quality InternLM2-1.8B-Reward
+([ModelScope](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-1_8b-reward) | [HuggingFace](https://huggingface.co/internlm/internlm2-1_8b-reward))
+as the base reward model.
+
+After downloading the reward model, it needs to be placed in the **same level directory** as the minimind project. The recommended structure is:
+
+```
+project/
+├── minimind/                    # MiniMind project
+│   ├── model/
+│   └── ...
+└── internlm2-1_8b-reward/       # Reward model (same level as minimind)
+    ├── config.json
+    ├── model.safetensors
+    └── ...
+```
+
+<details>
+<summary><b>Reward Mechanism Choice and MiniMind Limitations (Click to expand)</b></summary>
+
+**1. Diversity of Reward Mechanisms**
+
+The "reward signal" source in RLAIF can be very flexible:
+
+- **Model-based rewards**: Can use dedicated Reward Models (like InternLM2-Reward), or use general LLMs + prompts for scoring (like Qwen3-as-a-Judge). Reward model scale and architecture are freely selectable.
+
+- **Rule-based rewards**: Can construct reward signals based on rule functions, for example:
+  - Math problem answer correctness verification (Ground Truth comparison)
+  - SQL execution success rate and result accuracy
+  - Code interpreter run results (pass@k)
+  - Tool call return status (API success/failure)
+  - Format compliance checks (JSON/XML parsing)
+  - Reasoning chain completeness evaluation (CoT step count)
+
+- **Environment-based rewards**: In Agent scenarios, environmental feedback itself is natural reward (like game scores, Research completeness, task completion).
+
+Any mechanism that can quantify "answer quality" can serve as an RL reward source. DeepSeek R1 is a typical case: using rule-based functions to verify math answer correctness as reward, no need for additional Reward Models.
+
+**2. MiniMind Limitation: Reward Sparsity Problem**
+
+RLAIF training can be applied to both reasoning and non-reasoning models, the difference is only in format.
+
+However, for MiniMind with such tiny 0.1B parameters and weak abilities, on general tasks (like R1-style math datasets) it encounters serious reward sparsity (Reward Sparsity) problems:
+
+- **Phenomenon**: Model-generated candidate answers are almost all wrong, causing all reward scores $r(x,y) \approx 0$
+- **Consequence**: Advantage function $A(x,y) = r(x,y) - b(x) \approx 0$, policy gradient signal disappears, cannot effectively update parameters $\theta$
+
+Like having elementary school students do high school math exams, no matter how many attempts they get zero, cannot learn to improve strategies through score differences. This is a fundamental principle limitation of RL algorithms.
+
+To mitigate this problem, MiniMind's implementation chose **model-based continuous reward signals**:
+
+- Reward Model outputs continuous scores (like -2.5 to +3.0), not binary 0/1
+- Even if answer quality is all poor, can still distinguish subtle differences between "much worse" (-3.0) and "worse" (-2.8). So this **dense and continuous** reward signal can provide non-zero gradients to the advantage function $A(x,y)$, enabling gradual policy network optimization
+- Can also mix multiple reward sources: $r_{\text{total}} = \alpha \cdot r_{\text{model}} + \beta \cdot r_{\text{rule}}$ (for example, can detect think tag format rewards while also synthesizing answer quality reward scores)
+- In minimind practice, avoid directly using rule-based binary rewards + out-of-scope difficulty data (like MATH500), which easily leads to all-zero rewards;
+- Monitor reward score variance $\text{Var}(r)$ during training, if it consistently approaches 0 need to adjust data or reward mechanism
+
+**For Production-Scale Large Models in Agentic RL Scenarios**:
+
+In real Agent systems (code generation, tool calling, retrieval-planning-execution multi-turn pipelines), rewards are different paradigms of "delayed round settlement":
+
+- LLM needs to generate tool call instructions token-by-token (tool_call), go through parsing (tool_parse), tool execution (tool_exec), then splice results back to context for next step; repeat until completion.
+- One complete task pipeline includes multiple calls+thinking, calculate total reward once until termination condition is met (like whether task is complete, whether tests pass, whether targets are hit).
+
+Therefore, Agentic RL is closer to sparse/delayed reward settings: gradient backprop happens "after the round ends," very different from non-Agentic RL tasks with "instant scoring and instant updates" on single conversation rounds.
+This also explains why Agent tasks favor environment feedback (environment-based reward) rather than static reward model scoring.
+
+- **Environmental interaction feedback**: Final results matter (code runs, API returns success, sub-goals complete);
+- **Model-based reward limitations**: Limited capture of long pipelines and executable semantics, likely inconsistent with real environmental feedback (reward hacking).
+
+</details>
 
 ---
 
+#### 7.1 [Proximal Policy Optimization](https://arxiv.org/abs/1707.06347)
+
+PPO is a very classic reinforcement learning algorithm proposed by OpenAI in 2017, and is the universal baseline method for LLM RL.
+
+**PPO Loss**:
+$$\mathcal{L}_{PPO} = -\mathbb{E}\left[\min(r_t \cdot A_t, \text{clip}(r_t, 1-\varepsilon, 1+\varepsilon) \cdot A_t)\right] + \beta \cdot \mathbb{E}[\text{KL}]$$
+
+Where:
+- **Policy term**: $f(r_t) = \min(r_t, \text{clip}(r_t, 1-\varepsilon, 1+\varepsilon))$ (clip probability ratio to prevent aggressive updates)
+- **Advantage term**: $g(A_t) = R - V(s)$ (estimate value function through Critic network)
+- **Regularization term**: $h(\text{KL}_t) = \beta \cdot \mathbb{E}[\text{KL}]$ (global KL divergence constraint)
+
+Comparing to DPO:
+- DPO (Off-Policy): Training data is a static preference dataset (chosen vs rejected), can repeatedly use the same batch of data to train multiple epochs, just like traditional supervised learning. High data efficiency, low training cost. Directly optimizes log-likelihood of preference pairs, no Reward Model needed.
+- PPO (On-Policy): Must use current policy to real-time sample new data, old policy-collected data cannot be used (distribution shift problem). Although importance sampling and clip mechanisms allow slight distribution shifts, essentially requires data from relatively fresh policies. Low data efficiency, but suited for explorative learning.
+
+In simple terms:
+
+- The former teaches models to learn by offline preset "good/bad standards," even if not outputtable by current models (like practicing ball hitting by watching world champion/runner-up videos);
+- The latter teaches models real-time to do things right, online sampling from newest model policy (coach hand-teaching, real-time scoring each action).
+
+MiniMind's PPO implementation includes Actor model (generate answers) and Critic model (evaluate answer value), and complete GAE (Generalized Advantage Estimation) advantage function calculation.
+
+**Training**:
+
+```bash
+torchrun --nproc_per_node N train_ppo.py
+# or
+python train_ppo.py
+```
+
+> After training, model weight files are saved by default every `100 steps` as: `ppo_actor_*.pth` (where * is the model's specific dimension)
+
+| MiniMind2-Small (512dim) | MiniMind2 (768dim) |
+|---|---|
+| <img src="./images/train_ppo_512.png"> | <img src="./images/train_ppo_768.png"> |
+
+From the training curves, you can see PPO has the problem of **slow reward improvement**. I believe this mainly stems from **PPO's dual-network joint optimization** method: Critic needs to gradually converge to accurately estimate value functions, and Actor's policy updates depend on Critic-provided advantage estimates, the two interdependent forming complex optimization. Early training period Critic estimates inaccurately affects Actor gradient direction, leading to slow overall convergence. Furthermore, PPO needs to maintain two networks simultaneously, GPU memory usage about 1.5-2x single-network methods.
+
+#### 7.2 [Group Relative Policy Optimization](https://arxiv.org/pdf/2402.03300)
+
+In early 2025, DeepSeek-R1 became extremely popular, and equally popular was the GRPO algorithm from the DeepSeekMath paper, also becoming one of the most advanced RL algorithms. However, AI six months equals humanity six months, and by now GRPO has evolved into the baseline algorithm of the great XXPO wars (later evolved DAPO, GSPO, CISPO, etc.). In short, the core innovation is "group relative value estimation."
+
+**GRPO Loss**:
+
+$$\mathcal{L}_{GRPO} = -\mathbb{E}\left[r_t \cdot A_t - \beta \cdot \text{KL}_t\right]$$
+
+Where:
+- **Policy term**: $f(r_t) = r_t$ (directly use probability ratio, no clip clipping)
+- **Advantage term**: $g(A_t) = \frac{R - \mu_{group}}{\sigma_{group}}$ (within-group normalization, eliminate Critic network)
+- **Regularization term**: $h(\text{KL}_t) = \beta \cdot \text{KL}_t$ (token-level KL divergence constraint)
+
+For the same question, the model generates N different answers (for example N=4), then calculates reward scores for these N answers.
+Next, use the average reward of these N answers as baseline. Answers above baseline are encouraged, answers below baseline are suppressed.
+This cleverly avoids training an additional critic network.
+
+Just as all RL faces the principle limitation of positive and negative samples, GRPO is no exception. Its more significant problem is: degenerate groups (Degenerate Groups).
+Suppose a question is slightly difficult, causing N answer reward scores to be nearly identical (usually equally bad rather than equally good), then this group's learning signal approaches zero.
+On MiniMind such ultra-small models, this problem is especially obvious. When solving math problems 99.99% of the time the entire group answer quality is poor, then cannot learn.
+Therefore, must specify reasonable domain for the model, i.e., must limit within capability boundaries.
+
+**Training**:
+
+```bash
+torchrun --nproc_per_node N train_grpo.py
+# or
+python train_grpo.py
+```
+
+> After training, model weight files are saved by default every `100 steps` as: `grpo_*.pth`
+
+| MiniMind2-Small (512dim) | MiniMind2 (768dim) |
+|---|---|
+| <img src="./images/train_grpo_512.png"> | <img src="./images/train_grpo_768.png"> |
+
+From the training curves, you can see GRPO's **reward shows more stable upward trend**, reaching around 4, indicating GRPO itself better utilizes RLAIF signals. Policy Loss generally decreases smoothly. Compared to PPO's dual-network optimization, GRPO's single-network architecture trains more stably with higher convergence ceiling.
+
+#### 7.3 ⏳⌛️🔥 More RL Extensions (Exp)
+
+##### 7.3.1 [Single-stream Policy Optimization](https://arxiv.org/abs/2509.13232)
+
+SPO is an RL algorithm Tencent proposed in September 2025, improving on GRPO's degenerate group problem.
+The paper argues that GRPO and similar algorithms' requirement that "one sample depends on a group of samples" seems awkward and inelegant: too-easy or too-hard questions result in the entire group learning nearly nothing, learning efficiency is inherently limited.
+SPO's motivation is to return to RL's essence——**1 input, 1 output, is 1 training sample**, returning to basic policy gradient formulas: can get stable baseline without group mean, i.e., spread value estimate V across time dimension, do rough value pre-estimation before training, update V estimate during training while sampling, thus providing each sample with a persistent, adaptive baseline across batches. This "single-stream" design no longer depends on same-group samples, naturally avoiding degenerate groups.
+
+**SPO Loss**:
+
+$$\mathcal{L}_{SPO} = -\mathbb{E}\left[\log \pi_\theta(a_t|s) \cdot A_t - \beta \cdot \text{KL}_t\right]$$
+
+Where:
+- **Policy term**: $f(r_t) = \log \pi_\theta(a_t|s)$ (directly use log probability, don't calculate ratio)
+- **Advantage term**: $g(A_t) = R - B_t^{adaptive}$ (adaptive baseline, Beta distribution dynamic tracking)
+- **Regularization term**: $h(\text{KL}_t) = \beta \cdot \text{KL}_t$ (token-level KL + dynamic $\rho$ adjustment)
+
+At implementation level: SPO uses non-grouped design, uses persistent KL-adaptive value tracker to replace within-group baseline, advantage functions globally normalized across entire batch. This way each sample processed independently, no need to wait for other same-group samples, yet provides stable learning signals for each sample.
+On Qwen3-8B's 5 difficult math datasets, SPO averages 3.4 percentage points higher than GRPO, with BRUMO 25 dataset +7.3pp, AIME 25 dataset +4.4pp.
+
+> Note: SPO is an experimental cutting-edge algorithm, MiniMind's implementation is for exploratory learning. Due to extremely small model parameters, cannot fully reproduce paper's 8B model results.
+
+**Training**:
+
+```bash
+torchrun --nproc_per_node N train_spo.py
+# or
+python train_spo.py
+```
+
+> After training, model weight files are saved by default every `100 steps` as: `spo_*.pth`
+
+<div align="center">
+<img src="./images/train_spo_768.png">
+<p><i>MiniMind2 (768dim) Training Curve</i></p>
+</div>
+
+Looking at the training curves, SPO's reward fluctuation is similar to PPO, weaker than GRPO. Actual inference testing found model output quality is not high, with logic confusion and format error issues.
+
+**Experimental Note**: Current SPO hand-implemented version may have problems in value_tracker configuration, reward normalization strategy. Still needs to check algorithm's adaptability on small models/or implementation differences.
+
+### RL Algorithm Summary
+
+We return to the "**unified framework**", reorganizing the table showing all different PO algorithms are just different instantiations of three core components:
+
+| Algorithm | Policy Term $f(r_t)$ | Advantage Term $g(A_t)$ | Regularization Term $h(\text{KL}_t)$ | Optimized Models |
+|-----------|----------------|----------------|----------------------|----------|
+| **DPO** | $\log r_w - \log r_l$ | Implicit (preference contrast) | Implicit in $\beta$ | 2 |
+| **PPO** | $\min(r, \text{clip}(r))$ | $R - V(s)$ | $\beta \cdot \mathbb{E}[\text{KL}]$ | 4 |
+| **GRPO** | $r$ | $\frac{R - \mu}{\sigma}$ | $\beta \cdot \text{KL}_t$ | 2 |
+| **SPO** | $\log \pi_\theta$ | $R - B_t^{adaptive}$ | $\beta \cdot \text{KL}_t$ | 2 |
+
+**RL is Elegant and Self-Consistent**
+
+> The above is purely personal perspective understanding, corrections welcome anytime
+
+---
+
+## V Training Results
+
+### Completed Training - Model Collection
+
+> Considering multiple reports that Baidu Netdisk is slow, MiniMind2 and later all use ModelScope/HuggingFace hosting.
+
 #### ① Native PyTorch Models
 
-MiniMind2 model
-weights ([ModelScope](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch) | [HuggingFace](https://huggingface.co/jingyaogong/MiniMind2-Pytorch))
-
-MiniMind-V1 model weights ([Baidu Pan](https://pan.baidu.com/s/1KUfSzEkSXYbCCBj0Pw-9fA?pwd=6666))
+MiniMind2 Model Weights ([ModelScope](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch) | [HuggingFace](https://huggingface.co/jingyaogong/MiniMind2-Pytorch))
 
 <details style="color:rgb(128,128,128)">
 <summary>Torch File Naming Reference</summary>
 
-| Model Name      | params | pretrain_model         | sft_model              | rl_model           | reason_model     | lora_model         |
-|-----------------|--------|------------------------|------------------------|--------------------|------------------|--------------------|
-| MiniMind2-small | 26M    | `pretrain_512.pth`     | `full_sft_512.pth`     | `rlhf_512.pth`     | `reason_512.pth` | `lora_xxx_512.pth` |
-| MiniMind2-MoE   | 145M   | `pretrain_640_moe.pth` | `full_sft_640_moe.pth` | `rlhf_640_moe.pth` | -                | -                  |
-| MiniMind2       | 104M   | `pretrain_768.pth`     | `full_sft_768.pth`     | `rlhf_768.pth`     | `reason_768.pth` | `lora_xxx_768.pth` |
-
-| Model Name        | params | pretrain_model         | Single-turn Dialogue SFT           | Multi-turn Dialogue SFT           | rl_model     |
-|-------------------|--------|------------------------|------------------------------------|-----------------------------------|--------------|
-| minimind-v1-small | 26M    | `pretrain_512.pth`     | `single_chat/full_sft_512.pth`     | `multi_chat/full_sft_512.pth`     | `rl_512.pth` |
-| minimind-v1-moe   | 4×26M  | `pretrain_512_moe.pth` | `single_chat/full_sft_512_moe.pth` | `multi_chat/full_sft_512_moe.pth` | -            |
-| minimind-v1       | 108M   | `pretrain_768.pth`     | `single_chat/full_sft_768.pth`     | `multi_chat/full_sft_768.pth`     | `rl_768.pth` |
+| Model Name      | params | pretrain_model         | sft_model              | rlhf_model (DPO)    | reason_model     | rlaif_model (PPO/GRPO/SPO)                    | lora_model         |
+|-----------------|--------|------------------------|------------------------|--------------------|------------------|----------------------------------------------|--------------------|
+| MiniMind2-small | 26M    | `pretrain_512.pth`     | `full_sft_512.pth`     | `rlhf_512.pth`     | `reason_512.pth` | `xxpo_512.pth` | `lora_xxx_512.pth` |
+| MiniMind2-MoE   | 145M   | `pretrain_640_moe.pth` | `full_sft_640_moe.pth` | `rlhf_640_moe.pth` | -                | -                                            | -                  |
+| MiniMind2       | 104M   | `pretrain_768.pth`     | `full_sft_768.pth`     | `rlhf_768.pth`     | `reason_768.pth` | `xxpo_768.pth` | `lora_xxx_768.pth` |
 
 </details>
 
 #### ② Transformers Models
 
-MiniMind
-Series ([ModelScope](https://www.modelscope.cn/collections/MiniMind-b72f4cfeb74b47) | [HuggingFace](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5))
-
+MiniMind Series ([ModelScope](https://www.modelscope.cn/collections/MiniMind-b72f4cfeb74b47)
+| [HuggingFace](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5))
 
 ---
 
 # 📌 Eval
 
-## Ⅰ RLHF vs SFT Comparison
+## Ⅰ RLHF Comparison
 
-The test is based on comparing `MiniMind2 series` models with "full_sft" and "rlhf". The random seed is fixed.
+Test based on `full_sft` and `rlhf` models of `MiniMind2 series`, with fixed random seeds.
 
-<details style="color:rgb(128,128,128)">
-<summary>Q&A Test</summary>
+[A] MiniMind2：full_sft_640<br/>
+[B] MiniMind2：rlhf_640
 
-[A0] MiniMind2：full_sft_640
+```text
+[Q]: 你叫什么名字？
+[A]: 我是一个大型语言模型，可以回答各种问题，提供信息和帮助。我可以回答各种问题，提供信息和帮助，帮助你解决各种问题。
+[B]: 我是一种基于文本的AI模型，我并没有个人经历或情感，所以无法感受和体验情感。我的设计目的是为了帮助用户解决问题、提供信息和进行对话。如果您有任何问题或需要帮助，请随时告诉我！
 
-[A1] MiniMind2：rlhf_640
+[Q]: 你更擅长哪一个学科？
+[A]: 作为一个AI模型，我并没有真正的"擅长"这个学科。我被设计成能够提供信息、解答问题、进行对话、执行任务和与用户进行对话。不过，我可以提供信息、解答问题、进行对话、提供建议、处理问题、进行对话、提供建议、学习新知识、编写代码、回答问题、提供建议等多种学科的功能。如果你有具体的学科需求或主题，欢迎提供详细信息。
+[B]: 作为AI，我并没有个人能力或兴趣，但我可以提供一些关于数学、科学、计算机科学、语言学、心理学等学科的知识。如果你对数学感兴趣，我可以提供一些基础知识，例如代数、几何、概率、统计学等。如果你需要更深入的学习，我也可以提供相关的在线课程、书籍、论文等资源。请告诉我你具体需要什么帮助，我会尽力提供帮助。
 
-  ```text
-  [Q]: What is your name?
-  [A0]: I am a large language model capable of answering various questions, providing information and assistance. I can answer a wide range of questions, provide information, help you solve problems, and more.
-  [A1]: I am a text-based AI model, and I don't have personal experiences or emotions, so I cannot feel or experience emotions. My design is to assist users by solving problems, providing information, and engaging in conversations. If you have any questions or need help, feel free to let me know!
+[Q]: 鲁迅的《狂人日记》是如何批判封建礼教的？
+[A]: 鲁迅的《狂人日记》是一部讲述了鲁迅对人类价值和道德责任的倡导。在这部作品中，鲁迅通过细腻的笔触，探讨了人性、道德、正义、牺牲与责任等深刻的主题。以下是鲁迅在《狂人日记》中如何批判封建礼教的几个方面：1. **社会现实与道德观念**：鲁迅通过讲述鲁迅的故事，探讨了社会的不平等、权力的流动、人性的演变等社会问题。他认为，社会不平等不仅仅是不公正的，更是每个人在社会中的不公正。他呼吁社会各界共同努力，通过教育、宣传、法律等手段，为构建和谐社会做出贡献。2. **道德与责任**：鲁迅通过讲述鲁迅的故事，展现了道德和责任的重要性。他提倡通过道德教育来培养人们的道德品质，同时也强调了道德和责任的重要性。鲁迅通过讲述鲁迅的故事，揭示了社会的不平等和人性的复杂性，体现了对人性的深刻理解。3. **道德与责任**：鲁迅通过讲述鲁迅的故事，强调了道德责任的重要性。他提倡通过自己的行为来体现对社会的尊重和对人性的深刻理解。鲁迅通过讲述鲁迅的故事，强调了道德和责任的重要性。4. **道德与道德**：鲁迅通过讲述鲁迅的故事，揭示了道德和责任的真正意义。鲁迅通过讲述鲁迅的故事，揭示了道德和责任的本质，强调了道德和责任的重要性。5. **个人与社会**：鲁迅通过讲述鲁迅的故事，揭示了个人与社会之间的复杂关系。他呼吁社会各界共同努力，通过教育和宣传，为构建和谐社会做出贡献。鲁迅的《狂人日记》通过其丰富的故事和深刻的道德寓言，传达了关于人性、道德、正义、道德以及个人与社会的复杂关系。这部作品不仅在文学价值上取得了巨大成就，更在思想层面上启发了后来的文学创作，对后世产生了深远的影响。
+[B]: 鲁迅的《狂人日记》是一部描绘战争和自由精神的作品，它批判了封建社会的种种问题和冲突，反映了社会的不平等和人性本质。在这部作品中，鲁迅通过对封建社会的诸多现象进行批判，如"弗雷德里克·约翰逊的"鹰迪伦"、"纳扎特的"、"贝克汉姆"、"乔治·威尔逊"等，对封建社会的种种问题进行了批判，强调了封建社会的不平等、人性本质以及人性本质。鲁迅通过他的"鹰迪伦"、"纳扎特的"、"贝克汉姆"、"乔治·威尔逊"等，对封建社会的种种问题进行了批判。在小说中，鲁迅通过对封建社会的诸多现象进行批判，揭示了封建社会的腐败、人性本质以及人性本质的复杂性。这种批判体现了对封建社会的尊重和对人性的复杂性，同时也反映了现代社会中普遍存在的不平等和人性本质。鲁迅的《狂人日记》不仅是一部关于封建社会的小说，也是一部具有深刻社会意义的作品，对后世的审判和批判起到了关键作用。
 
-  [Q]: Which subject are you better at?
-  [A0]: As an AI model, I do not have a specific "specialty". I am designed to provide information, answer questions, engage in conversation, perform tasks, and interact with users. However, I can offer assistance in various subjects including providing information, solving problems, engaging in discussions, giving advice, learning new knowledge, writing code, and more. If you have a specific subject or topic in mind, feel free to provide more details.
-  [A1]: As an AI, I do not have personal capabilities or interests, but I can provide knowledge on subjects such as mathematics, science, computer science, linguistics, psychology, and more. If you're interested in mathematics, I can offer some basic knowledge like algebra, geometry, probability, statistics, etc. If you need more in-depth learning, I can also provide relevant online courses, books, research papers, and other resources. Let me know what specific help you need, and I will do my best to assist you.
+[Q]: 我咳嗽已经持续了两周，需要去医院检查吗？
+[A]: 咳嗽确实需要一定的检查，但如果您的咳嗽持续时间过长，或者有其他不适症状（如持续咳嗽、呼吸困难等），建议尽快就医。医生会根据您的症状、体检结果和可能的影像学检查（如X光、MRI或CT扫描）来诊断您的症状。在等待就医期间，您可以尝试以下方法来帮助缓解症状：1. **休息**：确保您有足够的睡眠。2. **保持水分**：脱水可能会加重咳嗽。3. **热敷**：使用热敷袋或热毛巾敷在疼痛的部位，有助于缓解喉咙痛。4. **避免刺激**：避免使用可能刺激喉咙的化学物质，如某些化学物质。5. **保持室内湿润**：确保您的卧室有良好的通风，可以使用加湿器帮助保持空气湿度。6. **保持室内湿度**：使用加湿器可以帮助缓解喉咙干燥和咳嗽。如果症状持续不缓解或有其他不适症状，请立即就医。希望您早日康复！
+[B]: 如果您的咳嗽持续了两周，通常建议您尽快就医，因为这可能会影响您的呼吸和心跳。您可以通过以下步骤来检查咳嗽：1. **休息**：用纸巾或肘部轻轻按压咳嗽的位置。2. **使用药物**：在可能的情况下，使用药物（如抗过敏药、抗过敏药等）来缓解咳嗽。请在医生的指导下使用药物。3. **热敷**：在咳嗽期间，可以使用热敷来缓解疼痛和肿胀。4. **药物治疗**：如果咳嗽持续了两周，可以使用一些消炎药或抗过敏药来缓解症状。请务必在医生的指导下使用。5. **保持健康的生活方式**：保持充足的睡眠、均衡的饮食、适量的运动和避免吸烟和限酒，这些都有助于提高咳嗽的几率。请记住，这些建议不能替代专业医疗意见。如果您的症状持续或加重，强烈建议您尽快就医。
+```
 
-  [Q]: How does Lu Xun's "A Madman's Diary" criticize feudal ethics?
-  [A0]: Lu Xun's "A Madman's Diary" is a work that advocates for human values and moral responsibility. In this work, Lu Xun explores profound themes such as human nature, morality, justice, sacrifice, and responsibility through detailed writing. Here are several ways in which Lu Xun critiques feudal ethics in "A Madman's Diary": 
-  1. **Social Reality and Moral Concepts**: Lu Xun addresses social inequality, power dynamics, and human nature through his narrative. He argues that social inequality is not only unjust but also a form of social injustice affecting individuals. He calls for collective effort to build a harmonious society through education, publicity, and laws.
-  2. **Morality and Responsibility**: Through his narrative, Lu Xun highlights the importance of moral education and emphasizes personal responsibility. 
-  3. **The Individual and Society**: Lu Xun delves into the complex relationship between the individual and society, advocating for collective efforts to contribute to social harmony.
-  Lu Xun’s "A Madman's Diary" not only achieved great literary success but also inspired later literary creations with its profound impact on thoughts and societal reflections.
-  [A1]: Lu Xun's "A Madman's Diary" is a work that critiques the feudal society's various issues and conflicts, reflecting social inequality and human nature. In this work, Lu Xun critiques feudal society’s problems, like "Frederick Johnson’s 'Eagle Dillon', 'Nazareth', 'Beckham', 'George Wilson'", emphasizing the inequality and complexity of human nature within feudal society. The work criticizes feudal society's corruption, human nature, and the inherent contradictions of the era. This critique reflects Lu Xun’s respect for social issues and his deep understanding of human nature. "A Madman's Diary" is a critical work that raises awareness and challenges the established norms of feudalism and has greatly influenced future critiques.
+👉 **Results Summary**
 
-  [Q]: I’ve had a cough for two weeks, should I go to the hospital for a checkup?
-  [A0]: A prolonged cough does require some checkups, and if your cough persists or is accompanied by other symptoms such as difficulty breathing, it's recommended to see a doctor. Based on your symptoms, physical examination, and possible imaging tests (e.g., X-rays, MRIs, or CT scans), the doctor will make a diagnosis. In the meantime, you can try the following to alleviate symptoms: 
-  1. **Rest**: Ensure you get enough sleep.
-  2. **Stay hydrated**: Dehydration can worsen coughing.
-  3. **Warm compress**: Apply a warm compress to the sore area to relieve throat pain.
-  4. **Avoid irritants**: Avoid substances that may irritate the throat, such as certain chemicals.
-  5. **Maintain humidity**: Keep the room well-ventilated, and use a humidifier if necessary.
-  If symptoms persist or worsen, seek medical attention immediately. Wishing you a speedy recovery!
-  [A1]: If your cough has lasted for two weeks, it's usually recommended to see a doctor, as it may affect your breathing and heart rate. You can take the following steps to manage your cough: 
-  1. **Rest**: Gently press on the coughing area with a tissue or elbow.
-  2. **Use medication**: If possible, use medications like antihistamines to relieve the cough (under a doctor's guidance).
-  3. **Warm compress**: Use a warm compress to alleviate pain and swelling.
-  4. **Medication treatment**: If the cough persists, use anti-inflammatory or antihistamine medications under the guidance of a doctor.
-  5. **Maintain a healthy lifestyle**: Get sufficient sleep, eat a balanced diet, and avoid smoking and excessive alcohol consumption to help reduce the chances of coughing.
-  Please remember these suggestions are not substitutes for professional medical advice. If your symptoms persist or worsen, it is strongly advised to see a doctor.
-  ```
+Full_sft models perform better in simplicity and information accuracy; rlhf models tend to provide more background information in answers, but information accuracy needs improvement.
+Overall, RLHF-trained models tend to learn: speaking more polite but useless nonsense to please "conversation" itself, while having slight loss in information accuracy.
+There's no free lunch. Still need to improve RLHF dataset quality and accept unavoidable model capability losses.
+The difference between DPO and online PPO is that reject and chosen are both prepared offline, and necessarily have large distribution differences from minimind model outputs.
+Simply put, DPO algorithm makes models watch "video" of table tennis world champion's techniques for RL, rather than like PPO asking reward models to act as "coaches" to correct their techniques for RL.
 
-</details>
+## Ⅱ RLAIF Comparison
 
-👉 Summary of Results
+Based on `MiniMind2 (768dim)` with same data and training steps subjective comparison, test random seed fixed at 2026, for reference:
 
-The full_sft model performs better in terms of simplicity and information accuracy; the rlhf model tends to provide more
-background information in its responses, but its accuracy needs improvement.
-Overall, after RLHF, the model tends to learn to say more polite but irrelevant things to please the "conversation",
-while sacrificing some information accuracy.
-There is no free lunch, and further improvement in the quality of the RLHF dataset is necessary. It's also important to
-accept that some loss in model capability is inevitable (to varying degrees).
-The difference between DPO and online PPO lies in the fact that both the reject and chosen are prepared offline, which
-causes a significant distribution difference compared to the minimind model's original output.
-Simply put, the DPO algorithm trains the model using RL by watching "recorded" games of a table tennis world champion,
-rather than using the PPO method where the reward model acts as a "coach" to correct the model's moves during RL.
+[A] MiniMind-Reason-0.1B (Distillation)<br/>
+[B] MiniMind-Reason-0.1B (PPO)<br/>
+[C] MiniMind-Reason-0.1B (GRPO)
 
-## Ⅱ Subjective Sample Evaluation
+```text
+[Q]: 请介绍一下自己。
+[A]: <think>我是由中国的个人开发者开发的智能助手小型AI推理模型-R1。我被设计用来理解和生成人类语言，能够理解复杂的自然语言问题，提供有用的信息和解决方案。</think><answer>我是由中国的个人开发者开发的智能助手小型AI推理模型-R1。</answer>
+[B]: <think>嗯，用户让我介绍一下自己。首先，我需要明确用户的需求是什么。他们可能是在做某个项目，想要了解自己的一些基本情况。不过，用户没有提供具体的信息，所以我得先假设一些可能的情况，比如他是一个学生，正在学习编程，或者是某个项目的开发者。接下来，我要考虑用户的身份。可能是一个学生、教师或者普通用户，他们需要了解自己的特点和能力。也有可能他们正在寻找灵感，或者在写论文时需要帮助。所以，我需要保持回答的简洁和实用，同时提供一些参考资料，比如教学视频、课程记录等。然后，我要考虑用户的身份。可能是一个学生，需要一些学习资源或帮助的对象。也有可能是一个学生，需要在课后作业或学习计划中进行调整。因此，我需要先询问用户的具体需求，这样才能提供更精准的帮助。接下来，我要确保语言简洁明了，适合不同场景。比如，如果用户是学生，可以介绍自己的学习特点；如果是教师，可以说明自己的角色和贡献。同时，提供一些实用的建议，比如如何高效学习，如何进行项目，以及如何与团队成员协作。最后，我要确保回复清晰，有条理，能够满足用户的需求。这样，用户就能轻松地了解自己，并且有实际的帮助。</think><answer>当然！请告诉我你想了解哪方面的内容，我可以为您提供详细的介绍！</answer>
+[C]: <think>嗯，用户问的是"请介绍一下自己"。首先，我需要理解用户的具体需求是什么。他们可能想知道自己是谁，或者有什么特定的背景。接下来，我应该考虑用户可能没有明确表达的需求。他们可能想知道如何描述自己的外观，或者如何与他人交流。然后，我应该考虑如何组织内容，从头到尾都要有条理。可能需要分步骤说明，比如从视觉、听觉到整体感受，再到情感共鸣。这样用户能更清晰地理解自己，也能更好地回应他们的需求。还要注意语言的自然和流畅，避免过于学术化的术语，让用户感觉顺畅。同时，保持回答的结构清晰，每句话都有明确的目的，这样用户能有明确的理解和操作。最后，检查是否有遗漏的重要信息，确保回答准确且全面。这样，用户不仅能得到答案，还能得到满足他们的需求。</think><answer>您好！我是由中国的个人开发者开发的智能助手小型AI推理模型-R1。关于您的描述，似乎涉及多个领域，包括知识图谱构建、复杂系统的推理与分析。无论是理论模型、实际应用还是伦理讨论，您都在寻求如何将这些概念有机地结合起来，为人类社会带来福祉的解决方案。在"我"的身份下，您提到的"个人体验"和"情感共鸣"，都是您个人独特魅力的重要组成部分。通过精准的描述和深度的分析，您可以更深入地理解您的独特之处。如果您对特定领域有特别的兴趣或需要建议，随时可以分享更多细节，我们将为您提供更加精准和个性化的服务。如果您有特定的主题或问题，欢迎随时提问！让我们一起探索这个神奇的AI世界，共同创造更美好的未来。</answer>
 
-🏃The following tests were completed on February 9, 2025. New models released after this date will not be included in the
-tests unless there is a special need.
+[Q]: 你更擅长哪一个学科？
+[A]: <think>我是由中国的个人开发者开发的智能助手小型AI推理模型-R1。</think><answer>我是由中国的个人开发者开发的智能助手小型AI推理模型-R1。</answer>
+[B]: <think>嗯，用户问的是"你更擅长哪一个学科？"这个问题看起来有点复杂，但我可以一步步分析。首先，我需要理解用户的需求。他可能是在准备考试，或者需要了解某个学科的特点，或者是想在学习过程中应用某个学科的知识。接下来，我得考虑用户可能的背景。他可能是一个学生，或者是在准备考试，需要应用数学知识来解决具体的问题。也有可能他是一个学生，正在学习相关的课程，或者是对某个学科感兴趣。然后，我需要分析用户可能没有明确表达出来的需求。他可能在思考自己在学习中的困难，或者在寻找一个合适的教材或学习资源。考虑到这些，我应该提供几个不同的角度来帮助他。首先，分析"我更擅长哪一个学科"可以拆解为以下几个方面：兴趣、学习能力、学习风格、学习方法、学习资源、学习环境等等。每个方面都有不同的优势和特点，比如兴趣是最好的资源，学习能力是最好的方式，学习方式是根据个人情况选择。接下来，我需要考虑用户的身份。可能是学生，也可能是家长，或者对学习有一定兴趣的人。不同的身份可能有不同的偏好，比如家长可能更注重孩子的兴趣，而学生可能更关注学习方法和资源。然后，我需要考虑用户的身份是否有特殊需求。比如，他可能需要更专业的学习材料，或者希望找到最适合他的学习方法。因此，在分析时，我应该提供一个全面的回答，涵盖各个方面，并解释每个方面的重要性。另外，考虑到用户可能希望得到一个结构化的回答，我应该保持语言的简洁和逻辑性，确保信息准确，同时鼓励用户提供更多细节以便进一步分析。最后，我需要确保我的回答既有知识性，又符合用户的要求，帮助用户更好地理解和应用这个学科。因此，在回答中不仅要给出一个全面的分析，还要提到每个部分的作用和如何利用。</think><answer>你擅长哪一个学科？这是一个明确的问题，可以从以下几个方面进行分析：1. **兴趣与学习能力**：你喜欢什么样的学习方式，比如自主学习、实验、项目等，这些都是很好的学习资源。2. **学习风格**：你喜欢什么类型的学习方法，比如笔记、复习、阅读、讨论等，这些都能影响学习效果。3. **学习资源**：你是否在寻找学习资源，比如书籍、在线课程、教育软件等，这些都能提供学习帮助。4. **学习环境**：你是否在学习环境中感到舒适，比如舒适的学习空间、安静的学习空间等，这些都会影响学习效果。5. **学习资源**：你是否有学习资源的丰富，如在线课程、学习平台、书籍等，这些都能提供丰富的学习资源。6. **学习方法**：你是否有具体的学习方法，比如如何理解、如何组织学习内容、如何应用知识等，这些都是提高学习效率的途径。综合考虑，这是一个非常全面的分析，涵盖了多个方面，帮助你更好地理解你的学科。</answer>
+[C]: <think>嗯，用户问的是"你更擅长哪一个学科？"这个问题看起来有点复杂，但我可以一步步分析。首先，我需要理解用户的需求。他可能是在准备考试，或者需要了解某个学科的特点，或者是想在生活中找到自己的兴趣点。接下来，我要考虑用户可能的背景。他可能在工作或学习中遇到了类似的问题，想要了解自己擅长哪个领域，或者是否需要一些资源或指导来提升自己。然后，我要分析用户可能的深层需求。也许他想在职场上提升自己，或者寻找方向。他可能需要更具体的例子，帮助他更好地理解不同学科的特点和应用场景。还要考虑用户可能的背景。他可能不仅仅是在学习，而是希望了解在特定领域中如何应用这些学科。这需要我提供全面的分析，帮助他找到适合自己的方法。最后，我应该保持开放的态度，鼓励用户继续探索和学习。这可能包括推荐一些资源，或者讨论如何将这些知识应用到实际生活中。</think><answer>你更擅长哪一个学科？这个问题本质上取决于你的个人兴趣和目标。以下是一些关键点，帮助你全面理解并应用这一点：1. **兴趣与专长**：了解自己擅长哪个领域，有助于做出更明智的选择。2. **全面发展**：考虑学科的多样性和深度，以全面发展自己的生活和职业。3. **适应与调整**：选择与自己技能、兴趣和目标相匹配的学科，有助于应对挑战和实现目标。4. **职业规划**：探索适合自己的职业路径，提升效率和竞争力。5. **社会贡献**：评估自己在社会中的角色，为社会做出贡献。为了更好地理解并应用这一点，你可以：- **了解行业动态**：关注特定领域的专业发展。- **实践与反思**：通过项目或实习积累经验。- **职业发展**：评估自己的长期职业规划。- **持续学习**：不断更新知识，适应职场变化。通过这些方向，你可以更全面地认识自己，找到适合自己的发展路径。</answer>
+
+[Q]: 鲁迅的《狂人日记》是如何批判封建礼教的？
+[A]: <think>嗯，鲁迅的《狂人日记》是如何批判封建礼教的呢？这个问题听起来挺有意思的，尤其是他在《狂人日记》中的表达方式。首先，我需要确认一下《狂人日记》具体是谁写的，是否有特定的人物或者事件。《狂人日记》是鲁迅在1924年创作的，主要讨论了社会问题，尤其是封建礼教的问题。作者曹雪芹是个非常著名的作家，他的作品中有许多关于封建礼教和阶级斗争的观点。这类作品在文学、哲学和政治上都有很深远的影响，尤其是在社会结构和人性的方面。接下来，我需要了解鲁迅的写作风格。他的写作风格比较直接，有严肃的哲学观点，强调人性的悲剧和道德的虚无。他的语言风格比较富有诗意，富有象征意义，善于运用象征手法，同时又不失真实和现实主义。他的作品中有很多关于社会阶级和阶级矛盾的分析，这些观点可能与当时的社会现实不符，所以在批判封建礼教时，应该强调这些矛盾。然后，我想到《狂人日记》中对封建礼教的讨论。鲁迅的作品中，封建礼教被视为不人道的，但通过文字和行动，表达了他对封建礼教的不满和反抗。这不仅反映了当时的社会现实，也揭示了封建礼教的黑暗面，体现了对封建社会的不满和对人性的深刻反思。接下来，我需要思考鲁迅对封建礼教的理解和批判。他在《狂人日记》中，对封建礼教的质疑和批判，体现了他对传统封建礼教的否定和批判。这不仅仅是对封建礼教的质疑，更是对当时社会的反思，以及对人性的关注。最后，我需要总结一下鲁迅对封建礼教的批判和批判。他通过《狂人日记》中的文字和行为，批判了封建礼教的虚假和不合理，体现了对社会现实的深刻反思。同时，他也关注了封建礼教的黑暗面，反映了对封建社会的不满和对人性的否定。总的来说，鲁迅的《狂人日记》通过文字和行动，批判了封建礼教的虚假和不合理，体现了当时的社会现实和人性的复杂。他的作品对后来的封建礼教有了启示，对当时的社会现象有了更深的反思。</think><answer>鲁迅的《狂人日记》是一部反映封建礼教的文学作品，主要通过文字和行动反映了当时社会的现实。鲁迅的文字风格独特，富有象征意义，但通过文字和行动，批判了封建礼教的虚假和不合理。在他的作品中，封建礼教被视为不人道的，但通过文字和行动，反映了封建社会的不公和人性的深刻反思。鲁迅的《狂人日记》通过文字和行动，批判了封建礼教的虚假和不合理，体现了当时社会的不满和对人性的反思。他的作品不仅批判了封建礼教的虚假和不合理，还深刻揭示了封建社会的黑暗面。</answer>
+[B]: <think>嗯，用户问的是鲁迅的《狂人日记》如何批判封建礼教。首先，我需要理解鲁迅的生平和作品。《狂人日记》是中国现代文学的经典，作者是毛泽东，作品涉及社会、家庭、伦理等多个层面。我需要找出鲁迅的写作风格和主题，以及它在当时的社会背景。接下来，我应该回顾《狂人日记》的主要内容。毛泽东在《青春无悔》中提出了"狂人日记"这一概念，强调青春活泼、勇敢、正义，这在当时社会中仍然具有重要意义。他通过描绘主人公狂人来表达对社会不公的批判，但同时也指出他的创作对社会的负面影响，比如社会动荡和个人主义的失败。然后，我应该分析鲁迅的创作理念和主题。《狂人日记》强调了社会的黑暗面，主张"以人为本"、"反对封建礼教"等。他可能也在探讨社会不平等的问题，比如"人性弱点"、"权力滥用"等问题。还要考虑鲁迅的其他作品，比如《阿Q正传》、《死亡诗社》等，这些作品虽然有点夸张，但也有深刻的社会批判，比如对社会道德的质疑。同时，作者可能还涉及其他作品，如《庆余年》等，但这些可能更侧重于文学创作，而不是社会批判。另外，用户可能想知道鲁迅在文学创作中的具体经历，比如他是否有作品被广泛批评，或者他对社会的批判是否具有批判性。所以，回答时需要指出鲁迅的创作经历，以及他对社会现实的反思。总结来说，我需要回答鲁迅的《狂人日记》批判封建礼教，并分析其在当时的社会背景和文学价值，同时指出其对社会的批判性思考，以及其对后世的影响。</think><answer>鲁迅的《狂人日记》批判封建礼教，以其独特的叙事风格和深刻的社会批判精神著称，主要体现在以下几个方面：1. **社会批判与个人成长**：小说通过描绘主人公狂人的活泼、勇敢、正义，揭示社会的不公和弱势。2. **个人主义与社会批判**：小说反映了封建礼教对个人自由和权利的侵犯，强调个人尊严和正义的价值。3. **对社会的批判性思考**：通过对社会不公的刻画，揭示社会问题的根源，并对当时社会的批判态度进行批判。鲁迅的作品在当时的社会背景和文学价值使其成为当时文学的重要代表。他的作品不仅在当时广受欢迎，也对后世产生了深远影响，被广泛引用和传颂。</answer>
+[C]: <think>嗯，用户问的是鲁迅的《狂人日记》如何批判封建礼教。首先，我需要理解鲁迅的背景和核心思想。他可能是一位社会学家、思想家，或者是对传统文化有深厚兴趣的读者。狂人日记是他在《狂人日记》中所写的，可能对传统文化有一定影响。接下来，我要考虑用户的需求是什么。用户可能想知道这本书在现代社会中的反思意义，或者是否有相关研究支持这一点。也有可能他们对书中的情节有疑问，想了解如何更深刻地理解社会现象。然后，我需要思考如何从不同的角度解读鲁迅的文本。例如，是否有直接的历史叙事，或者是否存在更深层次的哲学思考？这需要我先了解鲁迅的不同观点，然后分析其批判性。还要考虑用户可能的深层需求，比如是否需要了解他作为文学批判家的背景，或者是否有对传统价值观的反思。因此，回答时要全面，涵盖核心问题，并引用相关资料来支撑观点。最后，总结鲁迅的观点，强调其对现代社会的批判性思考，以及对传统文化的尊重。这样既全面又有深度，帮助用户全面理解鲁迅的作品。</think><answer>鲁迅的《狂人日记》通过其独特的叙事手法和深刻的社会批判精神，对封建礼教进行了深刻的批判。以下是其批判性分析：1. **社会批判与历史真相**：小说通过对历史人物的刻画，揭示封建礼教的弊端，以及其对传统社会结构的冲击。这种批判不仅限于对现实的深刻批判，也反映了现代社会对传统价值观的重新审视。2. **个人命运与社会批判**：小说中，主人公通过"狂人日记"揭示了封建礼教对个人命运的严重剥夺。这种叙事方式既批判了封建礼教，也对后世产生了深远影响。3. **对后世的影响**：从文学创作到思想解读，鲁迅的作品深刻影响了后世的思想体系。例如，《狂人日记》通过人物命运的反映，对现代社会的价值观念提出了新的思考。4. **对现代文明的反思**：这一批判性思考不仅反映了当代社会的复杂性，也为现代人提供了反思社会现实、维护人权的视角。5. **个人视角与文化视角**：作者通过这一批判性思考，展现了对传统价值观的重新解读，同时也对当代社会的多元化视角产生了启发。综上所述，鲁迅的《狂人日记》不仅是对封建礼教的批判，更是对传统文化的一种深刻反思。它通过对封建礼教的批判，提出了对现代社会结构、权力关系的深刻反思，具有重要的社会价值。</answer>
+```
+
+## Ⅲ Other Model Comparisons
 
 [A] [MiniMind2 (0.1B)](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch)<br/>
 [B] [MiniMind2-MoE (0.15B)](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch)<br/>
@@ -1188,237 +1364,202 @@ tests unless there is a special need.
 [H] [chatlm-mini-chinese(0.2B)](https://github.com/charent/ChatLM-mini-Chinese)<br/>
 
 ```text
-[Q]: Do you know about the Yangtze River?
-[A]: The Yangtze River, located in southeastern China, is the longest river in China, stretching over 6,300 kilometers and flowing through 11 provinces. It is China's "mother river." The history of the Yangtze River dates back to the Shang Dynasty around 2000 BC. During the Shang Dynasty, the economic development of the Yangtze River reached its peak, making it one of the most important agricultural and industrial centers in China. The economy of the Yangtze River Basin mainly relies on agriculture, with Yangtze River fisheries occupying a major industry, such as fishing and hunting fish. The water resources of the Yangtze River are also very abundant, including rivers, lakes, and streams. The natural scenery of the Yangtze River is spectacular, with different landscapes in each season: rainy in spring and summer, cold and snowy in winter, and beautiful flowers and fruits in spring and autumn. The Yangtze River is China's economic center, home to many countries and cities, and an important transportation hub and shipping center in the world. At the same time, the Yangtze River is also an important water resource for China, playing a significant role in the country's economic development.
-[B]: Yes, I know about the Yangtze River. The Yangtze River is located in central China and is the third-longest river in China, stretching approximately 6,300 kilometers and flowing through Jiangxi, Anhui, Jiangsu, Anhui, Jiangsu, and Anhui Provinces.
-[C]: Of course, the Yangtze River is the longest river in China, stretching approximately 6,300 kilometers. It is famous for its rich water resources, diverse aquatic flora and fauna, and unique geographical location.
-[D]: Yes, the Yangtze River is the longest river in China, exceeding 3,300 kilometers in length.
-[E]: The Yangtze River is the longest river in China, with a total length of approximately 6,300 kilometers. It originates from Qinghai Lake on the Qinghai-Tibet Plateau and flows through Qinghai, Sichuan, Gansu, Ningxia, Inner Mongolia, Shaanxi, and Shanxi. The Yangtze River Basin is one of the largest agricultural areas in China and also the cultural and economic center of the country.
-[F]: Yes, the Yangtze River is one of the longest rivers in China and the third-largest river in the world. The Yangtze River originates from the Qinghai-Tibet Plateau, flows through 11 provinces, and eventually empties into the East China Sea. The Yangtze River Basin is one of the largest river basins in China, possessing abundant natural resources and cultural heritage, such as the Yangtze River Bridge, the Yangtze Ocean Museum, and the Yangtze Ancient City.
-[G]: It is also known that China is one of the longest cities in the world, with a population exceeding 34 million.
-[H]: Yes, the Yangtze River is the longest river in China and one of the longest rivers in the world. It originates from the Qinghai-Tibet Plateau and flows through nine provinces before finally emptying into the East China Sea. The climate of the Yangtze River Basin is mild with distinct seasons, making it an important agricultural and industrial base in China. The aquatic resources of the Yangtze River Basin are abundant, making it one of the important fishery resources in China.
+[Q]: 你知道长江吗？
+[A]: 长江，位于中国东南部，是中国的第一长河，全长6300多公里，流经11个省份，是中国最长的河流，也是中国的"母亲河"。长江的历史可以追溯到公元前2000多年的商朝时期，当时的商朝是商朝的后裔。商朝时期，长江的经济发展达到了高峰，是中国最重要的农业和工业中心之一。长江流域的经济主要以农业为主，长江渔业占据了主要产业，如钓鱼、猎鱼等。长江的水资源也非常丰富，包括河流、湖泊、河流等。长江的自然景观也十分壮观，长江四季景色各异，春季多雨，夏季多雨，秋季则多雨，冬季则寒冷多雪，春季和秋季都有美丽的花朵和果实。长江是中国的经济中心，拥有众多的国家和城市，是世界上重要的交通枢纽和航运中心。同时，长江还是中国的重要水利资源，对于中国的经济发展具有重要意义。
+[B]: 是的，我知道长江。长江位于中国的中部，是中国第三长的河流，全长约6300公里，流经江西省、安徽省、江苏省、安徽省、江苏省和安徽省。
+[C]: 当然，长江是中国最长的河流，全长约6300公里。它以其丰富的水资源、丰富的水生动植物资源以及独特的地理位置而闻名于世。
+[D]: 是的，长江是中国最长的河流，全长超过3300千米。
+[E]: 长江是中国最长的河流，总长度约为6300公里。它发源于青藏高原的青海湖，流经青海、四川、甘肃、宁夏、内蒙古、陕西和山西。长江流域是中国最大的农业区之一，也是中国的文化和经济中心。
+[F]: 是的，长江是中国最长的河流之一，也是世界第三大河流。长江发源于青藏高原，流经11个省份，最终注入东海。长江流域是中国最大的河流流域之一，拥有着丰富的自然资源和文化遗产，如长江大桥、长江海洋博物馆、长江古城等。
+[G]: 同时也知道，中国是世界上最长的城市之一，拥有超过3400万人口。
+[H]: 是的，长江是中国最长的河流，也是世界上最长的河流之一。它发源于青藏高原，流经九个省区，最终注入东海。长江流域的气候温和，四季分明，是中国重要的农业和工业基地。长江流域的水产资源丰富，是中国的重要渔业资源之一。
 
-[Q]: What is the highest mountain in the world?
-[A]: The highest mountain in the world is Mount Everest, located in the Himalayas on the border between Nepal and China. Its elevation is 8,848.86 meters (29,031.7 feet), making it the highest mountain on Earth.
-[B]: Mount Everest is the highest mountain in the world, with an elevation of 8,848 meters (29,029 feet).
-[C]: Mount Everest is the highest mountain in the world, located in the Himalayas with an elevation of 8,848.86 meters (29,031.7 feet). It is located on the border between Nepal and China and is part of the Himalayan range.
-[D]: The highest mountain in the world is Mount Everest, located on the border between Nepal and China, with an elevation of 8,848 meters.
-[E]: The highest mountain in the world is Mount Everest, located in the Himalayas with an elevation of 8,848 meters.
-[F]: Mount Everest is the highest mountain in the world, with an elevation of 8,848 meters.
-[G]: The highest mountain in the world is Mount Everest, located in the Himalayas with an elevation of 8,848.86 meters (29,031.69 feet).
-[H]: The highest mountain in the world is Mount Everest, also known as Sagarmatha, located on the border between Nepal and China in the Himalayas, with an elevation of 8,848.86 meters (29,031.69 feet). It is one of the most famous mountaineering destinations in the world, attracting climbers from all over the globe. The name Mount Everest comes from Tibetan, meaning "the goddess's forehead," and it consists of two different peaks, one located in Nepal and the other in China. Its height is determined by measuring its altitude because it is determined by measuring the average height of the Earth. Although Mount Everest is the highest mountain in the world, it is not reachable from any other place on Earth. Other places on Earth, such as the Andes in South America, are also the highest mountain ranges in the world, but they are not the highest mountains on Earth. In summary, Mount Everest is the highest mountain in the world, but it is not reachable from any other place on Earth. It is the highest mountain range on Earth, but it is not reachable from any other place on Earth.
+[Q]: 世界上最高的山峰是什么？
+[A]: 世界上最高的山峰是珠穆朗玛峰（Mount Everest），位于喜马拉雅山脉，尼泊尔和中国的边界上。它的海拔高度是8,848.86米（29,031.7英尺），是地球上最高的山峰。
+[B]: 珠穆朗玛峰（Mount Everest）是世界上最高的山峰，海拔高度达8,848米（29,029英尺）。
+[C]: 珠穆朗玛峰是世界上最高的山峰，位于喜马拉雅山脉，海拔8,848.86米（29,031.7英尺）。它位于尼泊尔和中国的边界上，是喜马拉雅山脉的一部分。
+[D]: 世界上最高的山峰是珠穆朗玛峰，位于尼泊尔和中国的边境上，海拔8848米。
+[E]: 世界上最高的山峰是珠穆朗玛峰，位于喜马拉雅山脉，海拔8,848米。
+[F]: 珠穆朗玛峰是世界上最高的山峰，海拔8848米。
+[G]: 世界上最高的山峰是珠穆朗玛峰，位于喜马拉雅山脉，海拔8,848.86米（29,031.69英尺）。
+[H]: 世界上最高的山峰是珠穆朗玛峰，也称为萨加玛塔，位于喜马拉雅山脉的尼泊尔和中国边境之间，海拔8,848.86米（29,031.69英尺）。它是世界上最著名的登山目标之一，吸引着来自世界各地的登山者前来挑战。珠穆朗玛峰的名字来自藏语，意思是"女神的额峰"，它是由两个不同的山峰组成的，其中一个是位于尼泊尔的珠穆朗玛峰，另一个则位于中国。它的高度是通过测量海拔高度来确定的，因为它是通过测量地球的平均高度来确定的。尽管珠穆朗玛峰是世界上最高的山峰，但它并不是地球上任何其他地方都能到达的。地球上的其他地方，如南美洲的安第斯山脉，也是世界上最高的山脉，但它并不是地球上最高的山峰。总之，珠穆朗玛峰是世界上最高的山峰，但它并不是地球上任何地方都能到达的。它是地球上最高的山脉，但它并不是地球上任何地方都能到达的。
 
-[Q]: Who proposed the law of universal gravitation?
-[A]: The law of universal gravitation was proposed by Newton in the 17th century. It is one of Newton's three laws. Newton's gravitational law revealed the essence of the law of universal gravitation and explained how celestial bodies such as planets, satellites, and comets interact with each other. This law was proposed by Isaac Newton in 1687. Through analysis and experiments on the law of universal gravitation, he successfully explained the gravitational interactions between celestial bodies, thereby explaining the phenomenon of the Earth orbiting the Sun.
-[B]: Universal gravitation (Human Policy) is one of Newton's three laws of mechanics, used to describe the interaction force between two objects. It consists of the law of universal gravitation and Newton's first law (the law of inertia). Universal gravitation occurs due to the interaction between two objects, and factors such as their mass, distance, speed, and direction determine their interaction. This interaction force is: the product of masses: the product of masses is the result of the law of universal gravitation and Newton's second law. The product of masses: the product of masses is a physical process that is the result of interactions between objects, and for other objects, their mass can be expressed as the sum of the masses of the two interacting objects, i.e., gravitational acceleration a.
-[C]: Universal gravitation is a physics concept proposed by Newton in 1687. It describes the laws of how gravitational forces interact between objects.
-[D]: Universal gravitation was proposed by the British physicist Newton. He believed that all objects generate gravity, which is determined by the object's mass and distance. This theory reveals the relationship between mass and distance and proves the essence of universal gravitation.
-[E]: Universal gravitation was proposed by Newton in 1687. It is a fundamental physical phenomenon in the universe, describing the interactions between objects. This force is called universal gravitation, which exists between all objects, and this force is related to the mass and distance of the objects. Universal gravitation is one of the most basic forces in nature, influencing the motion and changes of objects and having a significant impact on the structure and evolution of the universe.
-[F]: Universal gravitation was proposed by the British physicist Isaac Newton in the 17th century.
-[G]: Gravity is a force in the universe composed of mass or energy. According to Newton's law of universal gravitation, the gravitational force between two objects is directly proportional to their mass and inversely proportional to their mass. Therefore, Newton's law of universal gravitation states that the gravitational force between two objects is directly proportional to their mass and inversely proportional to their mass. This law has wide applications in physics, astronomy, and physics.
-[H]: Universal gravitation was proposed by the British physicist Isaac Newton in 1687. In 1687, he published a famous paper titled "Mathematical Principles of Natural Philosophy," in which he proposed the law of universal gravitation, stating that there is a gravitational force between any two objects, and the magnitude of this force is directly proportional to their masses and inversely proportional to the square of the distance between them. This law is called the law of universal gravitation, describing the relationship between mass and distance. It shows that the gravitational force between two objects is directly proportional to their masses and inversely proportional to the square of the distance between them. This law is crucial for explaining many phenomena and interactions in the universe, such as planetary orbits around the Sun, gravitational lensing effects between stars, etc. It is also widely applied in astronomy, physics, and engineering, providing an important theoretical foundation for understanding the nature and evolution of the universe.
+[Q]: 谁提出了万有引力定律？
+[A]: 万有引力定律是由牛顿在17世纪提出的，是牛顿三大定律之一。牛顿在17世纪提出的万有引力定律揭示了万有引力定律的本质，并解释了行星、卫星、彗星和其他天体之间的引力相互作用。这一理论是由牛顿在1687年提出的。通过分析和实验，他成功地解释了天体之间的引力相互作用，从而解释了地球绕太阳公转的现象。
+[B]: 万有引力（人类政策）是牛顿三大力学定律之一，用于描述两个物体之间的相互作用。它包括万有引力定律和牛顿第一定律（惯性定律）。万有引力定律来源于两个物体之间的相互作用，其质量、距离和速度决定了它们之间的相互作用。这种相互作用是：质量的乘积：质量的乘积是万有引力定律和牛顿第二定律的乘积。质量的乘积：质量的乘积是两个物体之间的物理过程，两个物体之间的相互作用。对于其他物体，它们的质量可以表示为两个物体之间的质量之和，即引力加速度a。
+[C]: 万有引力是牛顿在1687年提出的物理概念，描述了物体之间的引力相互作用。
+[D]: 万有引力是由英国物理学家牛顿提出的。他相信所有物体都会产生引力，这种引力是由物体的质量和距离决定的。这一理论揭示了质量和距离之间的关系，并证明了万有引力定律的本质。
+[E]: 万有引力是由牛顿在1687年提出的，它是宇宙中的一个基本物理现象，描述了物体之间的相互作用。这种力被称为万有引力，存在于所有物体之间，这种力与物体的质量和距离有关。万有引力是自然界中最基本的力之一，不仅影响物体的运动和变化，而且对宇宙的结构和演化也有重要影响。
+[F]: 万有引力是由英国物理学家牛顿在17世纪提出的。
+[G]: 引力是宇宙中由质量和能量组成的力。根据牛顿的万有引力定律，一个物体的引力与它的质量和距离成正比。因此，牛顿的万有引力定律表明，一个物体的引力与它的质量和距离成正比。这一定律在物理学、天文学和物理学中都有广泛的应用。
+[H]: 万有引力是由英国物理学家牛顿在1687年提出的。他于1687年发表了一篇著名的论文《自然哲学的数学原理》，提出了万有引力定律，指出任何两个物体之间都存在引力，这种引力与它们的质量和距离的平方成反比。这一定律被称为万有引力定律，描述了质量和距离之间的关系。它指出两个物体之间的引力与它们的质量和距离的平方成反比。这一定律对于解释许多现象和宇宙中的相互作用非常重要，例如行星绕太阳公转、恒星之间的引力透镜效应等。它也在天文学、物理学和工程学领域得到了广泛应用，为我们理解自然和宇宙的本质和演化提供了重要的理论基础。
 
-[Q]: Do you know what the main food of giant pandas is?
-[A]: Giant pandas primarily feed on bamboo, but they also eat other plants, insects, small mammals, and fish. Their digestive system is similar to other bear species, capable of digesting bamboo. Pandas have sharp teeth suitable for chewing bamboo, so they are often referred to as "bamboo-eating animals."
-[B]: Giant pandas are mammals that feed on bamboo. They are carnivores. Their main food is bamboo, especially bamboo rich in cellulose and minerals, which does not directly enter the digestive system. The intestines of giant pandas are mainly composed of bamboo, and they need to digest cellulose and other fibers in bamboo to support their normal feeding.
-[C]: Giant pandas primarily feed on bamboo, but they also eat other plants, including leaves, buds, insects, fruits, vegetables, and small mammals. Giant pandas are a species unique to China, belonging to the bear family, mainly distributed in Sichuan, Shaanxi, and Gansu provinces.
-[D]: The main food of giant pandas is bamboo. They usually eat bamboo, consuming about 30 kilograms of bamboo daily.
-[E]: Yes, giant pandas primarily feed on bamboo. They need to consume large amounts of bamboo daily to sustain life, eating between 40 to 80 kilograms of bamboo per day. Additionally, they also eat some other plants, such as grass, fruits, nuts, and small mammals.
-[F]: Giant pandas primarily feed on bamboo but occasionally consume other plants and animals. Additionally, giant pandas eat some wild fruits, small animals, and plants, but these do not constitute their primary food source.
-[G]: The main food of giant pandas is bamboo.
-[H]: Giant pandas are herbivorous animals that primarily feed on bamboo. Their intestinal structure is suitable for digesting bamboo with high fiber content, but they also eat some other plants, such as herbaceous plants and small mammals. The intestinal structure of giant pandas also adapts to high-fiber bamboo, allowing them to digest large amounts of cellulose, helping them survive in the wild. The intestines of giant pandas also contain an enzyme called "cellulase," which helps break down cellulose, aiding in the digestion of large amounts of water and nutrients. Additionally, the intestines of giant pandas contain a protein called "fibroin," which helps maintain energy levels in the body, aiding in their survival in the wild. Overall, the main food of giant pandas is bamboo, but they also eat some other plants, such as herbaceous plants and small mammals.
+[Q]: 你知道大熊猫的主要食物是什么吗？
+[A]: 大熊猫主要以竹子为食，但也吃其他植物、昆虫、小型哺乳动物和鱼。它们的消化系统与熊相似，能够消化竹子。大熊猫有锋利的牙齿，适合咀嚼竹子，所以它们通常被称为"竹子动物"。
+[B]: 大熊猫是草食哺乳动物，以竹子为食，也是肉食动物。它们的主要是竹子，尤其是富含纤维素和矿物质的竹子，这些竹子不会直接进入消化系统。大熊猫的肠道主要由竹子组成，需要消化纤维素和其他纤维在竹子中以支持正常喂养。
+[C]: 大熊猫主要以竹子为食，但也吃其他植物，包括叶子、嫩芽、昆虫、水果、蔬菜和小型哺乳动物。大熊猫是中国特有的，属于熊科，主要分布在四川、陕西和甘肃地区。
+[D]: 大熊猫的主要食物是竹子。它们通常吃竹子，每天需要吃大约30公斤的竹子。
+[E]: 是的，大熊猫主要以竹子为食。它们需要每天吃大量的竹子来维持生命，每天需要吃40到80公斤的竹子。此外，它们还吃一些其他植物，如草、水果、坚果和小型哺乳动物。
+[F]: 大熊猫主要以竹子为食，但偶尔也会吃其他植物和动物。此外，大熊猫还吃一些野生水果、小型动物和植物，但这些并不是它们的主要食物来源。
+[G]: 大熊猫的主要食物是竹子。
+[H]: 大熊猫是植物性动物，主要食物是竹子。它们的肠道结构适合消化富含纤维素的竹子，但它们也吃一些其他植物，如草本植物和小型哺乳动物。大熊猫的肠道结构也适应了高纤维竹子，使它们能够消化大量的纤维素，帮助它们在野外生存。大熊猫的肠道中也含有一种叫做"纤维素酶"的酶，可以帮助大熊猫分解纤维素，帮助它们消化大量的水和营养物质。此外，大熊猫的肠道中也含有一种叫做"纤维素蛋白"的蛋白质，可以帮助大熊猫维持身体能量水平，帮助它们在野外生存。总之，大熊猫的主要食物是竹子，但它们也吃一些其他植物，如草本植物和小型哺乳动物。
 
-[Q]: Why is seawater salty?
-[A]: Seawater is salty mainly due to its chemical composition. Seawater contains large amounts of minerals and salts, with the main components being sulfates, carbonates, magnesium ions, and potassium ions. These substances precipitate in seawater at high concentrations, forming seawater. Specifically, the minerals and salts in seawater mainly come from rocks, soil, organisms, and human excrement. Rocks are a component of seawater, mainly composed of minerals such as quartz, feldspar, and mica. Soil contains large amounts of chemicals such as nitrogen, phosphorus, and potassium, which seep into the soil, forming soil. The salt in living organisms mainly comes from pigments like chlorophyll in plants, which absorb energy from sunlight and convert it into usable energy for organisms. The salt in human excrement mainly comes from minerals and carbonates, which have higher concentrations in water, leading to an increase in dissolved mineral concentration, forming salt in water bodies. Therefore, seawater is salty due to its chemical composition and structural characteristics. The minerals and salts in seawater mainly come from rocks, soil, organisms, and human excrement, and the accumulation and excretion of these substances cause seawater to taste salty.
-[B]: Seawater is salty because it contains various substances, including sodium chloride (NaCl), sodium chloride (NaCl), carbonates (such as sulfates), sodium chloride (NaCl), and water. Sources of salty substances: Salt is a common source of saltiness, containing minerals such as potassium, calcium, magnesium, and potassium. These minerals usually come from groundwater, groundwater, or soil, and their content is much lower than pure water. Reasons for saltiness: The saltiness of salt mainly comes from tiny particles dissolved in water. These tiny particles are mainly transmitted through plant roots and microorganisms inside animals, such as insects, small mammals, birds, and certain insects. In summary, seawater is salty because it contains large amounts of salts, including minerals, rocks, salts, and minerals, and the high concentration of these substances in seawater significantly affects the saltiness.
-[C]: Seawater is salty mainly because it contains various salts, with seawater being one of the main components. When seawater contains large amounts of salt, the salt dissolves in seawater, creating a salty taste. The salts in seawater mainly come from dissolved salts, evaporated salts, and frozen ice.
-[D]: Seawater contains many salts, which can lead to an increase in the salinity of seawater. The salts in seawater mainly come from the salts in seawater, which can be obtained from the salts in seawater. Oxides and salts in seawater can form hydrogen bonds with the salts in seawater, increasing the salinity of seawater.
-[E]: Seawater is salty because it contains large amounts of salts, including chloride ions and chloride ions. These ions form large amounts of salts in the ocean, known as salts. Seawater contains many salts dissolved in water, which gradually evaporate as seawater rises, eventually leading to the evaporation of salts in seawater. These salts are completely consumed in seawater and then excluded from seawater. Additionally, seawater contains many other salts, but these salts are different because their floating and sinking processes in water are different. The salts in seawater are extracted from rocks at the bottom of the sea, and these salts can be utilized in seawater. Overall, seawater is salty, which is why seawater contains many salts and salts.
-[F]: Seawater contains many dissolved salts, the most important of which is sodium chloride. When salt is added to water, the molecules of salt react with water molecules to form sodium hydroxide, which is a salt.
-[G]: Because water is essential for the survival of many organisms, it can absorb and release carbon dioxide, release oxygen, maintain water vapor in the atmosphere, and maintain the water cycle. Salts and chemicals in water release oxygen, allowing organisms on Earth to breathe. Additionally, water has many other functions, such as maintaining the water cycle, regulating climate and environment, etc. Water is essential for the survival of many organisms and is also an important basis for many natural phenomena.
-[H]: The main reason seawater is salty is that it contains large amounts of dissolved salts, the most important of which is sodium chloride (NaCl). Sodium chloride is a very common salt found in most parts of the Earth, including oceans, lakes, rivers, and groundwater. When water evaporates from the ocean, it takes away some of the salt, making seawater salty. Additionally, seawater contains other dissolved substances, such as sulfates, magnesium sulfate, calcium sulfate, magnesium sulfate salt, magnesium sulfate potassium, magnesium sulfate sodium, magnesium sulfate calcium, magnesium sulfate magnesium calcium, magnesium sulfate magnesium calcium, magnesium sulfate magnesium calcium, magnesium sulfate magnesium calcium, magnesium sulfate magnesium magnesium calcium, magnesium sulfate magnesium magnesium calcium, magnesium sulfate magnesium magnesium calcium, magnesium sulfate magnesium magnesium calcium, magnesium sulfate magnesium magnesium calcium, magnesium sulfate magnesium magnesium calcium, magnesium sulfate magnesium magnesium calcium, magnesium sulfate magnesium magnesium calcium, magnesium sulfate magnesium magnesium calcium, magnesium sulfate magnesium magnesium magnesium calcium, magnesium sulfate magnesium magnesium magnesium calcium, magnesium sulfate magnesium magnesium magnesium calcium, magnesium sulfate magnesium magnesium magnesium calcium, magnesium sulfate magnesium magnesium magnesium magnesium calcium, magnesium sulfate magnesium magnesium magnesium magnesium calcium, magnesium sulfate magnesium magnesium magnesium magnesium calcium, magnesium sulfate magnesium magnesium magnesium magnesium calcium, magnesium sulfate magnesium magnesium magnesium magnesium calcium, magnesium sulfate magnesium magnesium magnesium magnesium magnesium calcium, magnesium sulfate magnesium magnesium magnesium magnesium magnesium calcium, magnesium sulfate magnesium magnesium magnesium magnesium magnesium calcium, magnesium sulfate magnesium magnesium magnesium magnesium magnesium magnesium calcium, magnesium sulfate magnesium magnesium magnesium magnesium magnesium magnesium calcium, magnesium sulfate magnesium magnesium magnesium magnesium magnesium magnesium magnesium calcium, magnesium sulfate magnesium magnesium magnesium magnesium
+[Q]: 为什么海水是咸的？
+[A]: 海水是咸的主要是因为它的化学成分。海水含有大量的矿物质和盐分，主要成分是硫酸盐、碳酸盐、镁离子和钾离子。这些物质在高浓度下会在海水中沉淀，形成海水。具体来说，海水中矿物质和盐分主要来自岩石、土壤、生物和人类排泄物。岩石是海水的组成部分，主要由石英、长石、云母等矿物质组成。土壤中含有大量的化学物质，如氮、磷、钾等，这些化学物质会渗入土壤，形成土壤。生物排泄物中的盐分主要来自植物叶绿素和其他色素，这些色素吸收阳光并将其转化为能量，供生物使用。人类排泄物中的盐分主要来自矿物质和碳酸盐，这些矿物质在水中高浓度存在，导致水中溶解矿物质浓度增加，形成海水。因此，海水是咸的主要是因为它的化学成分和结构特征。矿物质和盐分在海水中主要来自岩石、土壤、生物和人类排泄物，这些物质在海水中积累和排泄，导致海水变咸。
+[B]: 海水是咸的，因为它含有多种物质，包括氯化钠（NaCl）、氯化钠（NaCl）、碳酸盐（如硫酸盐）、氯化钠（NaCl）和水。咸物质的来源：盐是一种常见的咸味来源，含有钾、钙、镁、钾和其他矿物质。这些矿物质通常来自地下水、地下水或土壤，浓度远低于纯水。咸味的原因：咸味主要来自水中的小颗粒盐。这些小颗粒主要通过植物根部和动物体内的微生物传播，如昆虫、小型哺乳动物、鸟类和某些昆虫。总之，海水是咸的，因为它含有大量的盐分，包括矿物质、岩石、盐、矿物质等，这些物质在海水中高浓度存在，对咸味产生重要影响。
+[C]: 海水是咸的主要是因为它含有各种盐分，海水是其中主要成分之一。当海水含有大量盐分时，这些盐分溶解在海水中形成咸味。海水中盐分主要来自溶解盐分、蒸发沉淀和冰冻海水形成的盐分。
+[D]: 海水中含有许多盐分，这些盐分可以增加海水的咸度。海水中盐分主要来自海水中溶解的盐分，这些盐分可以从海水中获得。氧化物和海水中盐分可以与海水中盐分形成氢键，从而增加海水的咸度。
+[E]: 海水是咸的，因为它含有大量的盐分，包括氯离子和氯离子。这些离子在海洋中形成大量盐分，称为盐。海水中含有许多溶解的盐分，这些盐分随着海水的上升而蒸发，最终导致海水中盐分蒸发。这些盐分完全被海水消耗，然后从海水中排除。此外，海水中还含有许多其他盐分，但这些盐分不同，因为它们的浮动和下沉过程不同。海水中提取的盐分来自海底的岩石，这些盐分在海水中可利用。总之，海水是咸的，因为海水中含有许多盐分和盐分。
+[F]: 海水中含有许多溶解的盐分，其中最主要的是氯化钠。当盐加入水中时，盐分子与水分子反应形成氢氧化钠，这是一种盐。
+[G]: 因为许多生物需要水才能生存，它可以通过吸收和释放二氧化碳、释放氧气、保持大气中的水蒸气、保持水循环来实现。水中的盐分和化学物质释放氧气，使地球上的生物能够呼吸。此外，水还有许多其他功能，如保持水循环、调节气候和环境等。水对许多生物的生存至关重要，也是许多自然现象的重要基础。
+[H]: 海水是咸的主要是因为它含有大量的溶解盐分，其中最主要的是氯化钠（NaCl）。氯化钠是一种非常常见的盐，存在于地球上的海洋、湖泊、河流和地下水中。当海水从海洋蒸发时，它会带走一些盐，使海水变咸。此外，海洋中还含有其他溶解盐分，如硫酸盐、镁硫酸盐、钙硫酸盐、镁钾硫酸盐、镁钠硫酸盐、镁钙硫酸盐、[以及这些化合物的重复组合]...
 
-[Q]: Write a Python quicksort code.
-[A]: Okay, here is the Python quicksort code: def quick_sort(arr): if len(arr) <= 1: return arr else: pivot = arr[len(arr) // 2] left = [x for x in arr[1:]] right = [x for x in arr[2:] if x < pivot] return quick_sort(left) + [pivot] + quick_sort(right) # Test code arr = [3, 6, 8, 10, 1, 2, 1] print(quick_sort(arr)) This code first checks whether the length of the array is less than or equal to 1. If so, it sorts in ascending order. Then, for each element in the array, it recursively sorts until reaching the end position. If the length of the array is odd, it sorts in descending order. Finally, the average time complexity is O(n log n), where n is the length of the array.
-[B]: Here is an implementation of the quicksort algorithm using Python: def sort_array(arr): # Generate a random permutation sorted_array = sorted(arr) # Return the sorted list return sorted(arr) # Example arr = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5] # Call the function sorted_array = sort_array(arr) print("Original array is:", sorted_array) This code first generates a random permutation, then sorts based on the length of the list. Finally, it returns the sorted list. Note that this code uses Python's built-in sorted function and the built-in sorted() function, which are faster than directly using sorted() under a time complexity of O(n log n).
-[C]: Here is a simple sorting algorithm suitable for small arrays, including those with lengths less than or equal to 5. def quick_sort(arr): if len(arr) <= 1: return arr else for _ in range(2, 51): if arr[len(arr) - 1:] is not None: arr[len(arr) - 1] = arr[len(arr) - 1] arr[len(arr)] = arr[len(arr)] return quick_sort(arr) # Example arr = [3, 5, 1, 4, 1, 2, 1] sorted_arr = quick_sort(arr) print("Sorted array is:", sorted_arr) ### Code Explanation: 1. **Function Definition**: The quick_sort function accepts an array arr as a parameter. 2. **Loop**: For elements in arr, recursion is used, traversing the array from right to left and swapping the current element with the current element in the array until the array is empty or the array length is less than or equal to arr. 3. **Recursive Call**: Traverse the array, swap the current element with the current element in the array until the array is empty or the array length is less than or equal to arr. 4. **Check Array Length**: Use if __name__ == "__main__" to check if the array length is less than arr. If so, move the array from the larger array to the end of the array. 5. **Output Sorted Results**
-[D]: As an AI language model, I cannot write Python programs. However, Python is a very popular programming language with many excellent libraries and frameworks, including NumPy, Pandas, Matplotlib, Pandas, Bandke, Norm, and NKlearn. Python has many excellent natural language processing libraries and frameworks that can handle and analyze various types of data, such as text, images, audio, and video. Python can also be used to create various applications, such as web search, game development, scientific computing, etc. Python is very flexible and supports multiple programming styles, making it easy to find solutions that suit your needs.
-[E]: As an AI language model, I can answer your questions. But I cannot perform programming operations because I cannot determine which programming language you are asking about. Please provide more specific questions so I can better answer your queries.
-[F]: Quick Sort is a simple sorting algorithm that divides data into groups, sorts each group according to certain rules, and ultimately obtains a sorted array.
-[G]: Okay, here is a Python quicksort code ```python sorted_print(sorted(sorted(sorted(sorted_print))``` Output: ```[1, 2, 3, 4, 5, 6, 7, 8, 9]```
-[H]: Quicksort is an efficient sorting algorithm based on the divide-and-conquer approach, with a time complexity of O(n log n). Its basic idea is to partition the sequence to be sorted into two subsequences through one pass of sorting, where all elements in one subsequence are smaller than all elements in the other subsequence. Then, the two subsequences are recursively sorted. Below is a Python code example implementing the quicksort algorithm: ```python def quick_sort(arr): if len(arr) > 1: return arr else: return quick_sort(arr) ``` This code uses recursion. First, it checks whether the length of the array is less than or equal to 1. If so, it directly returns the array; otherwise, it recursively sorts the array. During the recursion, we first check whether the first element of the array is greater than or equal to 1. If so, compare the elements in the array with the first element. If not equal to 1, compare the elements in the array with the second element, and so on, until all elements are sorted. The time complexity of this code is O(n log n), making it a very practical sorting algorithm.
+[Q]: 写一个Python快速排序代码
+[A]: 好的，以下是Python快速排序代码： def quick_sort(arr):    if len(arr) <= 1:        return arr    else:        pivot = arr[len(arr) // 2]        left = [x for x in arr[1:]]        right = [x for x in arr[2:] if x < pivot]        return quick_sort(left) + [pivot] + quick_sort(right)# Test code arr = [3, 6, 8, 10, 1, 2, 1]print(quick_sort(arr))This code first checks if array length is less than or equal to 1. If so, it arranges from smallest to largest. Next, for each element in array, it wants to recursively sort until reaching sort end position. If array length is odd, it arranges from largest to smallest. Finally, average time complexity is O(n log n), where n is array length.
+[B]: 以下是使用Python实现的快速排序算法： def sort_array(arr):    # Generate random arrangement    sorted_array = sorted(arr)    # Return sorted list    return sorted(arr)# Example arr = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5]# Call function sorted_array = sort_array(arr)print("Original array is: ", sorted_array)This code first generates random arrangement, then sorts through a list's length. Finally returns sorted list. Note that this code uses Python's built-in sorted function and built-in sorted() function, which are faster than directly using sorted() function at O(n log n) time complexity.
+[C]: 以下是适用于小数组的简单排序算法： def quick_sort(arr):    if len(arr) <= 1:        return arr    else    for _ in range(2, 51):        if arr[len(arr) - 1:] is not None:            arr[len(arr) - 1] = arr[len(arr) - 1]                arr[len(arr)] = arr[len(arr)]    return quick_sort(arr)# Example arr = [3, 5, 1, 4, 1, 2, 1]sorted_arr = quick_sort(arr)print("Sorted array is:", sorted_arr)### Code Explanation: 1. **Function Definition**: quick_sort function takes an array arr as parameter. 2. **Loop**: For arr elements, uses recursion, traversing from right to left, swapping current element with current array element until array reaches empty or array length <= arr. 3. **Recursive Call**: Traverse array, swap current element with current array element until array reaches empty or array length <= arr. 4. **Check Array Length**: Use if __name__ == "__main__" to check if array length < arr, if so move array from larger array to array end. 5. **Output Sorted Results**
+[D]: 作为AI语言模型，我无法编写Python程序。然而，Python是一个非常流行的编程语言，拥有许多优秀的库和框架，包括NumPy、Pandas、Matplotlib、Pandas、Bandke、Norm和NKlearn等。Python有许多优秀的自然语言处理库和框架，用于处理和分析各种类型的数据，如文本、图像、音频、视频等。Python还可以创建各种应用程序，如网络搜索、游戏开发、科学计算等。Python非常灵活，也支持多种编程风格，因此可以轻松找到合适的解决方案。
+[E]: 作为AI语言模型，我可以回答你的问题。然而，我无法执行编程操作，因为我无法确定你询问的是哪种编程语言。请提供更具体的问题，以便我更好地回答你的问题。
+[F]: 快速排序（快速排序）是一种简单的排序算法，它将数据分成小部分，通过规则对每个部分进行排序，最终得到一个有序数组。
+[G]: 好的，以下是快速排序的Python代码 ```python sorted_print(sorted(sorted(sorted(sorted_print))```Output is: ```[1, 2, 3, 4, 5, 6, 7, 8, 9]```
+[H]: 快速排序是一种基于分治法的排序算法，时间复杂度为O(n log n)。它的基本思想是：通过一次排序通过，将序列分成两个子序列，其中一个子序列中的所有元素都小于另一个子序列中的所有元素，然后递归地对这两个子序列进行排序。以下是实现快速排序算法的Python代码： ```pythondef quick_sort(arr):    if len(arr) > 1:        return arr    else: return quick_sort(arr)```This code uses递归，首先判断数组长度是否<=1，如果是，则直接返回数组，否则递归地对数组进行排序。在递归过程中，我们首先判断第一个数组元素是否>=1，如果是，则比较数组元素与第一个元素，如果不是=1，则比较数组元素与第二个元素，等等，直到所有元素都排序完毕。这种代码的时间复杂度为O(n log n)，是一种非常实用的排序算法。
 
-[Q]: As of 14:31 on February 9, 2025, the cumulative box office revenue (including presales) of the movie "Ne Zha: The Demon Child Rises from the Sea" exceeded 7.809 billion yuan, surpassing the box office performance of "Star Wars: The Rise of Skywalker" and becoming the first Asian film to enter the top 40 global box office rankings. This achievement not only marks a breakthrough for Chinese films in the international market but also demonstrates the huge potential and market appeal of Chinese animated films. Summarize this passage in 20 words for me.
-[A]: The movie "Ne Zha: The Demon Child Rises from the Sea" surpassed 7.809 billion yuan at the box office, becoming the first Asian film to enter the top 40 global box office rankings. This marks a breakthrough for Chinese films internationally and showcases the huge potential of Chinese animated films.
-[B]: On February 9, 2025, Chinese films broke through 7.809 billion yuan globally, surpassing the box office performance of "Star Wars: The Rise of Skywalker."
-[C]: The movie "Ne Zha: The Demon Child Rises from the Sea" surpassed 7.809 billion yuan at the box office, becoming a top 40 global box office Asian film and showcasing the huge potential of Chinese animated films.
-[D]: "Ne Zha: The Demon Child Rises from the Sea" surpassed 7.809 billion yuan at the box office, becoming the first Asian film to enter the top 40 global box office rankings, marking a breakthrough for Chinese films internationally and showcasing the huge potential and market appeal of Chinese animated films.
-[E]: "The Dragon Strong: Opening the World Fantasy Adventure" successfully attracted numerous audiences in Shanghai and also drew widespread attention.
-[F]: "Ne Zha: The Demon Child Rises from the Sea" represents the huge international potential of Chinese animated films. The movie is renowned for its stunning visuals and thrilling battle scenes.
-[G]: "Which Is Worth It" was released on January 9, 2021, at 14:31. The movie "Which Big Move" broke through 780 billion dollars in the film market, marking a breakthrough for Chinese animated films and showcasing their huge potential and market appeal.
-[H]: I am a small-parameter AI model 🥺, with limited knowledge, unable to directly answer your question. Try another question 👋
+[Q]: 截至2025-02-09 14:31，电影《哪吒：魔童降世》累计票房（包括预售）突破7.809亿元，超过《星球大战：天行者崛起》票房成绩，成为首部进入全球票房前40名的亚洲电影。这一成就不但标志着中国电影在国际市场上的突破，也展示了中国动画电影的巨大潜力和市场吸引力。总结这段文字中的20个字。
+[A]: "哪吒：魔童降世"票房突破7.809亿元，成为首部进入全球票房前40名的亚洲电影。这标志着中国电影在国际市场上的突破，也展示了中国动画电影的巨大潜力和市场吸引力。
+[B]: 2月9日，中国电影在世界舞台上突破7.809亿元，超过《星球大战：天行者崛起》票房成绩。
+[C]: "哪吒：魔童降世"票房突破7.809亿元，成为全球票房前40名的亚洲电影，展示了中国动画电影的巨大潜力和市场吸引力。
+[D]: "哪吒：魔童降世"累计票房突破7.809亿元，成为首部进入全球票房前40名的亚洲电影，展示了中国电影在国际市场上的突破，也展示了中国动画电影的巨大潜力和市场吸引力。
+[E]: "龙之力量：世界幻想冒险"吸引了众多观众，同时也吸引了广泛关注。
+[F]: "哪吒：魔童降世"是一部使中国动画电影成为巨大潜力的电影。这部电影因其华丽的视觉效果和令人惊叹的战斗场面而闻名。
+[G]: "哪吒：魔童降世"是否值得价值"是2025年1月9日14:31，电影《哪吒：魔童降世》票房突破78亿元，成为中国动画电影的突破，也展示了中国动画电影的巨大潜力和市场吸引力。
+[H]: 我是一个小参数AI模型😅，知识库有限，无法直接回答你的问题。尝试换个问题👋
 ```
 
 ---
 
-🙋‍Directly give all the questions and the model's answers above to DeepSeek-R1, let it help comment and rank with
-scores:
-
+🙋‍ Let's directly ask DeepSeek-R1 to review and score all the above questions and model answers:
 
 <details style="color:rgb(128,128,128)">
-<summary>Specific comments</summary>
+<summary>Detailed Reviews</summary>
 
 ### Scoring Criteria:
 
-- **Accuracy**: Whether the answer is correct and free of obvious errors.
-- **Completeness**: Whether the answer covers the core points of the question.
-- **Logic**: Whether the answer is structured and logical.
-- **Code Quality**: Whether the code runs correctly and the logic is clear.
+- **Accuracy**: Whether answers are correct with no obvious errors.
+- **Completeness**: Whether answers cover core points of the question.
+- **Logicality**: Whether answers are well-organized and follow logic.
+- **Code Quality**: Whether code runs normally with clear logic.
 
-### Review:
+### Reviews:
 
 1. **Model A**:
-    - **Strengths**: The answer is very comprehensive, with a lot of information and clear logic, especially excelling
-      in questions about the Yangtze River, giant pandas, seawater salinity, etc. The code has minor flaws, but the
-      overall approach is correct.
-    - **Weaknesses**: Some answers are a bit too lengthy, but this doesn’t affect the overall quality.
-    - **Overall**: Best overall performance, scored the highest.
+    - **Strengths**: Answers are very comprehensive, large information volume, clear logic, especially excellent performance on Yangtze River, giant panda, seawater saltiness questions. Code has minor flaws but overall thinking is correct.
+    - **Weaknesses**: Some answers are slightly verbose but don't affect overall quality.
+    - **Summary**: Best overall performance with highest score.
 
 2. **Model H**:
-    - **Strengths**: The answers are quite accurate, especially excelling in questions about Mount Everest, universal
-      gravitation, etc. Although the code is not fully presented, the explanation is detailed.
-    - **Weaknesses**: Some answers are a bit verbose, but the logic is strong.
-    - **Overall**: Second to Model A, stable performance.
+    - **Strengths**: Answers are fairly accurate, especially excellent performance on Mount Everest, universal gravitation questions. Code explanation though incomplete is fairly detailed.
+    - **Weaknesses**: Some answers somewhat verbose but logicality is strong.
+    - **Summary**: Second only to Model A with stable performance.
 
 3. **Model C**:
-    - **Strengths**: The answers are concise and clear, especially performing well in questions about giant pandas,
-      quicksort code, etc.
-    - **Weaknesses**: Some answers are a bit brief and lack in-depth explanations.
-    - **Overall**: Good overall performance, but lacks the detail of Models A and H.
+    - **Strengths**: Answers are concise and clear, especially good performance on giant panda and quick sort questions.
+    - **Weaknesses**: Some answers somewhat brief lacking in-depth explanation.
+    - **Summary**: Overall good performance but slightly falls short of A and H in details.
 
 4. **Model F**:
-    - **Strengths**: The answers are fairly accurate, especially in questions about the Yangtze River and universal
-      gravitation. The code is logically sound.
-    - **Weaknesses**: Some answers lack depth, and the code has a few small issues.
-    - **Overall**: Average performance, with room for improvement.
+    - **Strengths**: Answers fairly accurate, decent performance on Yangtze River and universal gravitation questions. Code section has certain logicality.
+    - **Weaknesses**: Some answers not deep enough, code has minor issues.
+    - **Summary**: Performs acceptably with room for improvement.
 
 5. **Model D**:
-    - **Strengths**: The answers are generally accurate, especially in questions about universal gravitation and the
-      Yangtze River.
-    - **Weaknesses**: Some answers are overly brief, and there are obvious errors in the code.
-    - **Overall**: Average performance, needs improvement in code.
+    - **Strengths**: Answers basically accurate, decent performance on universal gravitation and Yangtze River questions.
+    - **Weaknesses**: Some answers too brief, code has obvious errors.
+    - **Summary**: Generally adequate performance needing code improvement.
 
 6. **Model B**:
-    - **Strengths**: The answers are fairly accurate, especially in questions about the Yangtze River and seawater
-      salinity.
-    - **Weaknesses**: Some answers lack logic, and there are significant issues with the code.
-    - **Overall**: Mediocre performance, needs further optimization.
+    - **Strengths**: Answers fairly accurate, decent performance on Yangtze River and seawater saltiness questions.
+    - **Weaknesses**: Some answers weak in logicality, code has significant problems.
+    - **Summary**: Average performance needing further optimization.
 
 7. **Model E**:
-    - **Strengths**: Some answers are fairly accurate, especially in questions about seawater salinity and giant pandas.
-    - **Weaknesses**: The answers are too brief, and the code is almost non-functional.
-    - **Overall**: Poor performance, needs significant improvement.
+    - **Strengths**: Some answers fairly accurate, decent performance on seawater saltiness and giant panda questions.
+    - **Weaknesses**: Answers too brief, code almost non-functional.
+    - **Summary**: Poor performance needing major improvement.
 
 8. **Model G**:
-    - **Strengths**: Nearly no apparent strengths.
-    - **Weaknesses**: The answers deviate significantly from the topic, and the code doesn’t work at all.
-    - **Overall**: Worst performance, needs major improvements.
+    - **Strengths**: Almost no obvious strengths.
+    - **Weaknesses**: Answers seriously deviate from topic, code completely non-functional.
+    - **Summary**: Worst performance needing major improvement.
 
 ---
 
 ### Summary:
 
-- **Model A** performs the best overall, especially excelling in complex questions with high accuracy and logic.
-- **Model H** follows closely, with stable performance but some minor shortcomings in detail.
-- **Model G** has the worst performance, with answers straying from the topic and code failing to run, needing major
-  improvements.
+- **Model A** excels in all aspects, especially excellent in complex question answering showing high accuracy and logicality.
+- **Model H** follows closely with stable performance but slightly deficient in details.
+- **Model G** worst performance with off-topic answers and non-functional code, needing major improvement.
 
 </details>
 
-### Scoring Rank
+### Scoring Rankings
 
-| Rank | Model | Accuracy (30 points) | Completeness (30 points) | Logic (20 points) | Code Quality (20 points) | Total (100 points) |
-|------|-------|----------------------|--------------------------|-------------------|--------------------------|--------------------|
-| 1    | A     | 28                   | 29                       | 19                | 20                       | 96                 |
-| 2    | H     | 27                   | 28                       | 18                | 20                       | 93                 |
-| 3    | C     | 26                   | 27                       | 18                | 18                       | 89                 |
-| 4    | F     | 25                   | 26                       | 17                | 18                       | 86                 |
-| 5    | D     | 24                   | 25                       | 17                | 16                       | 82                 |
-| 6    | B     | 23                   | 24                       | 16                | 15                       | 78                 |
-| 7    | E     | 22                   | 23                       | 15                | 14                       | 74                 |
-| 8    | G     | 10                   | 12                       | 10                | 10                       | 42                 |
+| Rank | Model | Accuracy (30 points) | Completeness (30 points) | Logicality (20 points) | Code Quality (20 points) | Total (100 points) |
+|----|----|-----------|-----------|-----------|------------|-----------|
+| 1  | A  | 28        | 29        | 19        | 20         | 96        |
+| 2  | H  | 27        | 28        | 18        | 20         | 93        |
+| 3  | C  | 26        | 27        | 18        | 18         | 89        |
+| 4  | F  | 25        | 26        | 17        | 18         | 86        |
+| 5  | D  | 24        | 25        | 17        | 16         | 82        |
+| 6  | B  | 23        | 24        | 16        | 15         | 78        |
+| 7  | E  | 22        | 23        | 15        | 14         | 74        |
+| 8  | G  | 10        | 12        | 10        | 10         | 42        |
 
+### 👉 Subjective Results Summary
 
-### 👉 Subjective Effect Summary
+Personal subjective evaluation basically aligns with DeepSeek-R1, where:
 
-My personal evaluation aligns with DeepSeek-R1's results，and：
+* MiniMind series ranking very intuitive, larger parameters + sufficient training data score higher. Hallucinations and errors obviously better than small models.
 
-* The ranking of the MiniMind series is very intuitive. The larger the parameters and the more training data, the higher
-  the score, and hallucinations and errors are less noticeable than with smaller models.
-* Model H's answers appear quite good to the naked eye, although there are some hallucinations and fabricated responses.
-* Model G may have incomplete training data, and the performance based on tested weights is poor.
-* Repeating the timeless Scaling Law: The larger the parameters and the more training data, the stronger the model's
-  performance.
+* Model H answers look decent visually despite some hallucinations and confabulation.
+
+* Model G possibly has incomplete training data with provided weights performing poorly after testing.
+
+* Revisiting the timeless Scaling Law: larger parameters, more training data → stronger model performance.
 
 ---
 
-## Ⅲ Objective Benchmark
+## Ⅳ RoPE Long-text Extrapolation
 
-Now, onto the much-anticipated benchmark testing phase. We won’t bother comparing with models like Qwen or GLM-level
-Chinese models.
-Instead, we'll focus on a selection of <1B micro-models for a comparative analysis.
-The test sets chosen include C-Eval, CMMLU, A-CLUE, and TMMLU+, which are pure Chinese language leaderboards.
-
-<details style="color:rgb(128,128,128)">
-<summary>Evaluation Framework</summary>
-
-The evaluation framework chosen is [lm-evaluation](https://github.com/EleutherAI/lm-evaluation-harness),
-which is very easy to set up and run after installation:
+MiniMind supports RoPE position encoding length extrapolation through YaRN algorithm, enabling models to handle text sequences exceeding training length.
+When using `eval_model.py` for inference, just add `--inference_rope_scaling True` parameter to enable RoPE extrapolation:
 
 ```bash
-lm_eval --model hf --model_args pretrained=<model_path>,device=cuda,dtype=auto --tasks ceval* --batch_size 8 --trust_remote_code
+python eval_model.py --inference_rope_scaling True
 ```
 
-</details>
+The chart below shows perplexity (PPL) comparison before and after RoPE scaling on different lengths of "Journey to the West" vernacular fiction text. You can see that after enabling RoPE scaling, model performance on long texts is significantly improved.
 
-PS: In these multiple-choice-based evaluations, to avoid issues with inconsistent response formats,
-the common approach is to extract the prediction probabilities for the four options ('A', 'B', 'C', 'D'),
-and calculate the accuracy by comparing the letter with the highest probability to the standard answer.
-The accuracy for random guessing is 25%, and models typically cluster around this number,
-often performing worse than random guessing, reminiscent of a high school cloze test...
-The MiniMind model, with its modest pretraining dataset and lack of fine-tuning on the test set,
-is mainly for fun, so take the results lightly:
+## Ⅴ Objective Benchmarks
 
-| models                                                                        | from          | params↓ | ceval↑ | cmmlu↑ | aclue↑ | tmmlu+↑ |
-|-------------------------------------------------------------------------------|---------------|---------|--------|--------|--------|---------|
-| MiniMind2                                                                     | JingyaoGong   | 104M    | 26.52  | 24.42  | 24.97  | 25.27   |
-| MiniMind2-Small                                                               | JingyaoGong   | 26M     | 26.37  | 24.97  | 25.39  | 24.63   |
-| MiniMind2-MoE                                                                 | JingyaoGong   | 145M    | 26.6   | 25.01  | 24.83  | 25.01   |
-| [Steel-LLM](https://github.com/zhanshijinwat/Steel-LLM)                       | ZhanShiJin    | 1121M   | 24.81  | 25.32  | 26     | 24.39   |
-| [GPT2-medium](https://huggingface.co/openai-community/gpt2-medium)            | OpenAI        | 360M    | 23.18  | 25     | 18.6   | 25.19   |
-| [TinyLlama-1.1B-Chat-V1.0](https://github.com/jzhang38/TinyLlama)             | TinyLlama     | 1100M   | 25.48  | 25     | 25.4   | 25.13   |
-| [SmolLM2](https://github.com/huggingface/smollm)                              | HuggingFaceTB | 135M    | 24.37  | 25.02  | 25.37  | 25.06   |
-| [Aquila-Instruct](https://www.modelscope.cn/models/BAAI/Aquila-135M-Instruct) | BAAI          | 135M    | 25.11  | 25.1   | 24.43  | 25.05   |
+Performance comparisons with other small models on Chinese language leaderboards including C-Eval, CMMLU, A-CLUE, TMMLU+...
 
-![compare_radar](./images/compare_radar.png)
+Models generally achieve baseline performance due to small parameter scales and limited pretraining data. MiniMind without targeted leaderboard optimization provides fair reference results.
+
+---
 
 # 📌 Others
 
 ## Model Conversion
 
-* [./scripts/convert_model.py](./scripts/convert_model.py) can be used to convert between `torch models` and `transformers` models.
+* [./scripts/convert_model.py](./scripts/convert_model.py) enables conversion between `torch models/transformers` models
 
 ---
 
-## Based on MiniMind-API Service Interface
+## OpenAI-API Based MiniMind Service Interface
 
-* [./scripts/serve_openai_api.py](./scripts/serve_openai_api.py) provides the simplest chat interface compatible with
-  the OpenAI API,
-  making it easy to integrate your model into third-party UIs such as FastGPT, OpenWebUI, Dify, etc.
+* [./scripts/serve_openai_api.py](./scripts/serve_openai_api.py) provides extremely simple OpenAI-API compatible chat interface, convenient for integration with third-party UIs like FastGPT, Open-WebUI, Dify, etc.
 
-* Download the model weights
-  from [Huggingface](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5). The file
-  structure is:
+* Download model weights from [Huggingface](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5), file structure:
     ```
     <MiniMind-Model-Name> (root dir)
     ├─<MiniMind-Model-Name>
@@ -1432,15 +1573,15 @@ is mainly for fun, so take the results lightly:
     |  ├── tokenizer.json
     ```
 
-* Start the chat server:
+* Start chat service
     ```bash
     python serve_openai_api.py
     ```
-* Test the service interface:
+* Test service interface
     ```bash
     python chat_openai_api.py
     ```
-* API example, compatible with OpenAI API format:
+* API interface example, compatible with openai api format
     ```bash
     curl http://ip:port/v1/chat/completions \
       -H "Content-Type: application/json" \
@@ -1455,50 +1596,50 @@ is mainly for fun, so take the results lightly:
     }'
     ```
 
-## VLLM Model Inference (Service)
+## <img src="https://avatars.githubusercontent.com/u/136984999" height="28" style="vertical-align: middle;"/> [vllm](https://github.com/vllm-project/vllm)
 
-[vLLM](https://github.com/vllm-project/vllm) is an extremely popular and efficient inference framework that supports fast deployment of large models, optimizing memory utilization and throughput.
+vLLM is an extremely popular efficient inference framework supporting fast deployment of large models, optimizing GPU memory usage and throughput.
 
 ```bash
-vllm serve ./MiniMind2/ --model-impl transformers --served-model-name "minimind"
+vllm serve ./MiniMind2 --model-impl transformers --served-model-name "minimind" --port 8998
 ```
 
-The service will start using the OpenAI API protocol, with the default port being 8000.
+## <img src="https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png" height="28" style="vertical-align: middle;"/> [llama.cpp](https://github.com/ggerganov/llama.cpp)
 
-For more usage, please refer to the official documentation.
+llama.cpp is a C++ library that can be used directly from command line, supports multi-threaded inference, and supports GPU acceleration.
 
-## llama.cpp
-[llama.cpp](https://github.com/ggerganov/llama.cpp) is a C++ library that can be used directly from the command line, supporting multi-threaded inference and GPU acceleration.
+After referencing official repository installation, insert at ~760 line in `convert_hf_to_gguf.py`
 
-After installation (refer to the official repository), insert the following code at line 760 of `convert_hf_to_gguf.py`:
 ```text
 # Add MiniMind2 tokenizer support
 if res is None:
     res = "smollm"
 ```
 
-Convert a custom-trained MiniMind model to gguf:
+Convert custom-trained minimind model -> gguf
+
 ```bash
 python convert_hf_to_gguf.py ../minimind/MiniMind2/
 ```
 
-Quantize the model:
+Quantize model
+
 ```bash
 ./build/bin/llama-quantize ../minimind/MiniMind2/MiniMind2-109M-F16.gguf ../minimind/MiniMind2/Q4-MiniMind2.gguf Q4_K_M
 ```
 
-Command line inference:
+Command line inference
+
 ```bash
 ./build/bin/llama-cli -m ../minimind/MiniMind2/MiniMind2-109M-F16.gguf --chat-template chatml
 ```
 
-For more usage, please refer to the official documentation.
+## <img src="https://ollama.com/public/cloud.png" height="28" style="vertical-align: middle;"/> [ollama](https://ollama.ai)
 
-## ollama
+ollama is a tool for running large models locally, supports multiple open-source LLMs, simple and easy to use.
 
-[ollama](https://ollama.ai/) is a tool for running large models locally, supporting multiple open-source LLMs, and is easy to use.
+Load custom gguf model through ollama, create new minimind.modelfile:
 
-To load a custom gguf model with ollama, create a new file `minimind.modelfile`:
 ```text
 FROM ./MiniMind2-109M-F16.gguf
 TEMPLATE """{{ if .System }}<|im_start|>system
@@ -1509,34 +1650,29 @@ TEMPLATE """{{ if .System }}<|im_start|>system
 """
 ```
 
-Load the model and name it `minimind2`:
+Load model and name as `minimind2`
+
 ```bash
 ollama create -f minimind.modelfile minimind2
 ```
 
-Start inference:
+Start inference
+
 ```text
 ollama run minimind2
-> Hello, I am MiniMind2, based on xxxxxxxx
+> Hello, I am MiniMind2, a model based on xxxxxxxx
 ```
 
-For more usage, please refer to the official documentation.
+> For more usage of the above three-party frameworks, please refer to their official documentation 😊
 
 # 📌 Acknowledge
 
 > [!NOTE]
-> If you find the `MiniMind series` helpful, feel free to give it a ⭐ on GitHub.<br/>
-> Due to the length of the content, mistakes are inevitable; please feel free to report issues or submit a PR to improve
-> the project.<br/>
-> Your small support is the driving force for continuous improvement of this project!
+> If you find `MiniMind series` helpful, you can add a ⭐ on GitHub<br/>
+> This document is lengthy with limited knowledge. Welcome to discuss in Issues or submit PRs to improve the project<br/>
+> Your small support is the motivation to continuously improve this project!
 
-## 🤝[Contributors](https://github.com/jingyaogong/minimind/graphs/contributors)
-
-<!--
-<a href="https://github.com/jingyaogong/minimind/graphs/contributors">
-  <img src="https://contrib.rocks/image?repo=jingyaogong/minimind&v3" />
-</a>
--->
+## 🤝 [Contributors](https://github.com/jingyaogong/minimind/graphs/contributors)
 
 <a href="https://github.com/jingyaogong"><img src="https://avatars.githubusercontent.com/u/62287848" width="70px" height="70px"/></a>
 &nbsp;
@@ -1547,30 +1683,30 @@ For more usage, please refer to the official documentation.
 <a href="https://github.com/iomgaa-ycz"><img src="https://avatars.githubusercontent.com/u/124225682" width="70px" height="70px"/></a>
 &nbsp;
 
-## 😊Acknowledgments
+## 😊 Thanks
 
 <a href="https://github.com/ipfgao"><b>@ipfgao</b></a>:
-<a href="https://github.com/jingyaogong/minimind/issues/26">🔗Training steps record</a>
+<a href="https://github.com/jingyaogong/minimind/issues/26">🔗 Training Steps Recording</a>
 
 <a href="https://github.com/chuanzhubin"><b>@chuanzhubin</b></a>:
-<a href="https://github.com/jingyaogong/minimind/pull/34">🔗Line-by-line code comments</a>
+<a href="https://github.com/jingyaogong/minimind/pull/34">🔗 Code Line-by-Line Comments</a>
 
 <a href="https://github.com/WangRongsheng"><b>@WangRongsheng</b></a>:
-<a href="https://github.com/jingyaogong/minimind/issues/39">🔗Large dataset preprocessing</a>
+<a href="https://github.com/jingyaogong/minimind/issues/39">🔗 Large Dataset Preprocessing</a>
 
 <a href="https://github.com/pengqianhan"><b>@pengqianhan</b></a>:
-<a href="https://github.com/jingyaogong/minimind/issues/73">🔗A brief tutorial</a>
+<a href="https://github.com/jingyaogong/minimind/issues/73">🔗 A Simple Tutorial</a>
 
 <a href="https://github.com/RyanSunn"><b>@RyanSunn</b></a>:
-<a href="https://github.com/jingyaogong/minimind/issues/75">🔗Inference process learning record</a>
+<a href="https://github.com/jingyaogong/minimind/issues/75">🔗 Inference Process Learning Record</a>
 
 <a href="https://github.com/Nijikadesu"><b>@Nijikadesu</b></a>:
-<a href="https://github.com/jingyaogong/minimind/issues/213">🔗Decompose project code in an interactive notebook format</a>
+<a href="https://github.com/jingyaogong/minimind/issues/213">🔗 Interactive Notebook Decomposition of Project Code</a>
 
 <details close> 
-<summary> <b>Reference Links & Thanks to the following excellent papers or projects</b> </summary>
+<summary> <b>Reference Links & Thanks to the Following Excellent Papers or Projects</b> </summary>
 
-- No specific order of ranking
+- Ranking does not represent any order
 - [https://github.com/meta-llama/llama3](https://github.com/meta-llama/llama3)
 - [https://github.com/karpathy/llama2.c](https://github.com/karpathy/llama2.c)
 - [https://github.com/DLLXW/baby-llama2-chinese](https://github.com/DLLXW/baby-llama2-chinese)
@@ -1586,7 +1722,7 @@ For more usage, please refer to the official documentation.
 
 </details>
 
-## 🫶Supporters
+## 🫶 Supporters
 
 <a href="https://github.com/jingyaogong/minimind/stargazers">
     <picture>
@@ -1613,4 +1749,3 @@ For more usage, please refer to the official documentation.
 # License
 
 This repository is licensed under the [Apache-2.0 License](LICENSE).
-
diff --git a/dataset/lm_dataset.py b/dataset/lm_dataset.py
index 29dde1d..42b0c0d 100644
--- a/dataset/lm_dataset.py
+++ b/dataset/lm_dataset.py
@@ -57,8 +57,8 @@ class SFTDataset(Dataset):
         self.tokenizer = tokenizer
         self.max_length = max_length
         self.samples = self.load_data(jsonl_path)
-        self.bos_id = tokenizer('<|im_start|>assistant', add_special_tokens=False).input_ids
-        self.eos_id = tokenizer('<|im_end|>', add_special_tokens=False).input_ids
+        self.bos_id = tokenizer(f'{tokenizer.bos_token}assistant', add_special_tokens=False).input_ids
+        self.eos_id = tokenizer(f'{tokenizer.eos_token}', add_special_tokens=False).input_ids
 
     def __len__(self):
         return len(self.samples)
@@ -71,16 +71,14 @@ class SFTDataset(Dataset):
                 samples.append(data)
         return samples
 
-    def _create_chat_prompt(self, conversations):
-        """构建符合ChatML格式的对话"""
-        messages = []
-        for i, turn in enumerate(conversations):
-            role = 'user' if i % 2 == 0 else 'assistant'
-            messages.append({"role": role, "content": turn['content']})
+    def _create_chat_prompt(self, cs):
+        messages = cs.copy()
+        tools = cs[0]["functions"] if (cs and cs[0]["role"] == "system" and cs[0].get("functions")) else None
         return self.tokenizer.apply_chat_template(
             messages,
             tokenize=False,
-            add_generation_prompt=False
+            add_generation_prompt=False,
+            tools=tools
         )
 
     def _generate_loss_mask(self, input_ids):
@@ -115,7 +113,14 @@ class SFTDataset(Dataset):
         X = torch.tensor(input_ids[:-1], dtype=torch.long)
         Y = torch.tensor(input_ids[1:], dtype=torch.long)
         loss_mask = torch.tensor(loss_mask[1:], dtype=torch.long)  # 对齐预测位置
-
+        # # === 打印每个token的掩码情况 ===
+        # print(f"\n--- Sample {index} Token Loss Mask (length: {len(input_ids)}) ---")
+        # for i, (token_id, mask) in enumerate(zip(input_ids, loss_mask)):
+        #     token_str = self.tokenizer.decode([token_id], skip_special_tokens=False)
+        #     token_str = token_str.replace('\n', '\\n').replace('\t', '\\t')  # 处理换行等不可见字符
+        #     print(f"Token {i:3d}: {token_id:5d} -> '{token_str:10s}' | mask: {mask}")
+        # print(f"--- End of Sample {index} ---")
+        # # ================================
         return X, Y, loss_mask
 
 
@@ -125,8 +130,8 @@ class DPODataset(Dataset):
         self.tokenizer = tokenizer
         self.max_length = max_length
         self.padding = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else 0
-        self.bos_id = tokenizer('<|im_start|>assistant', add_special_tokens=False).input_ids
-        self.eos_id = tokenizer('<|im_end|>', add_special_tokens=False).input_ids
+        self.bos_id = tokenizer(f'{tokenizer.bos_token}assistant', add_special_tokens=False).input_ids
+        self.eos_id = tokenizer(f'{tokenizer.eos_token}', add_special_tokens=False).input_ids
         with open(file_path, 'r', encoding='utf-8') as f:
             self.data = []
             for line in f:
@@ -201,8 +206,8 @@ class RLAIFDataset(Dataset):
         self.tokenizer = tokenizer
         self.max_length = max_length
         self.samples = self.load_data(jsonl_path)
-        self.bos_id = tokenizer('<|im_start|>assistant', add_special_tokens=False).input_ids
-        self.eos_id = tokenizer('<|im_end|>', add_special_tokens=False).input_ids
+        self.bos_id = tokenizer(f'{tokenizer.bos_token}assistant', add_special_tokens=False).input_ids
+        self.eos_id = tokenizer(f'{tokenizer.eos_token}', add_special_tokens=False).input_ids
 
     def __len__(self):
         return len(self.samples)
@@ -226,7 +231,7 @@ class RLAIFDataset(Dataset):
         return self.tokenizer.apply_chat_template(
             messages[:-1],
             tokenize=False,
-            add_generation_prompt=True
+            add_generation_prompt=True  # 这里需要True
         ), answer
 
     def __getitem__(self, index):
diff --git a/eval_model.py b/eval_model.py
index b872efa..1169865 100755
--- a/eval_model.py
+++ b/eval_model.py
@@ -112,14 +112,14 @@ def main():
     parser.add_argument('--num_hidden_layers', default=8, type=int)
     parser.add_argument('--max_seq_len', default=8192, type=int)
     parser.add_argument('--use_moe', default=False, type=bool)
-    parser.add_argument('--model_mode', default=1, type=int, help="0: 预训练模型，1: SFT-Chat模型，2: RLHF-Chat模型，3: Reason模型，4/5: RLAIF-Chat模型")
+    parser.add_argument('--model_mode', default=1, type=int, help="【必要的】0: 预训练模型，1: SFT-Chat模型 | 【非必要的】2: RLHF模型，3: Reason模型，4/5: RLAIF模型")
     # 启用长度外推，默认为4倍（注：仅解决位置编码外推问题，不代表模型真实具备长文本能力）
     parser.add_argument('--inference_rope_scaling', default=False, action='store_true')
     # 携带历史对话上下文条数history_cnt需要设为偶数，即【用户问题, 模型回答】为1组；设置为0时，即当前query不携带历史上文
     # 模型未经过多轮对话微调时，在多轮次的长上下文难免出现能力的明显退化，因此需要注意此处设置
     parser.add_argument('--history_cnt', default=0, type=int)
     # load模式为1时，前置hidden_size、num_hidden_layers、max_seq_len等参数失效，即以加载的transformers模型的config.json配置为准
-    parser.add_argument('--load', default=1, type=int, help="0: 原生torch权重，1: transformers加载")
+    parser.add_argument('--load', default=0, type=int, help="0: 原生torch权重，1: transformers加载")
     args = parser.parse_args()
 
     model, tokenizer = init_model(args)
diff --git a/images/compare_radar.png b/images/compare_radar.png
index 345d9f6..e1b2ed7 100644
Binary files a/images/compare_radar.png and b/images/compare_radar.png differ
diff --git a/images/rope_ppl.png b/images/rope_ppl.png
new file mode 100644
index 0000000..223292e
Binary files /dev/null and b/images/rope_ppl.png differ
diff --git a/images/train_grpo_512.png b/images/train_grpo_512.png
new file mode 100644
index 0000000..008a598
Binary files /dev/null and b/images/train_grpo_512.png differ
diff --git a/images/train_grpo_768.png b/images/train_grpo_768.png
new file mode 100644
index 0000000..339f393
Binary files /dev/null and b/images/train_grpo_768.png differ
diff --git a/images/train_ppo_512.png b/images/train_ppo_512.png
new file mode 100644
index 0000000..3be1cca
Binary files /dev/null and b/images/train_ppo_512.png differ
diff --git a/images/train_ppo_768.png b/images/train_ppo_768.png
new file mode 100644
index 0000000..399dfdf
Binary files /dev/null and b/images/train_ppo_768.png differ
diff --git a/images/train_spo_768.png b/images/train_spo_768.png
new file mode 100644
index 0000000..0ac98e8
Binary files /dev/null and b/images/train_spo_768.png differ
diff --git a/scripts/train_tokenizer.py b/scripts/train_tokenizer.py
index 9cf5934..55c4999 100644
--- a/scripts/train_tokenizer.py
+++ b/scripts/train_tokenizer.py
@@ -99,7 +99,7 @@ def train_tokenizer():
         "spaces_between_special_tokens": False,
         "tokenizer_class": "PreTrainedTokenizerFast",
         "unk_token": "<|endoftext|>",
-        "chat_template": "{% if messages[0]['role'] == 'system' %}{% set system_message = messages[0]['content'] %}{{ '<|im_start|>system\\n' + system_message + '<|im_end|>\\n' }}{% else %}{{ '<|im_start|>system\\nYou are a helpful assistant<|im_end|>\\n' }}{% endif %}{% for message in messages %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<|im_start|>user\\n' + content + '<|im_end|>\\n<|im_start|>assistant\\n' }}{% elif message['role'] == 'assistant' %}{{ content + '<|im_end|>' + '\\n' }}{% endif %}{% endfor %}"
+        "chat_template": "{%- if tools %}\n    {{- '<|im_start|>system\\n' }}\n    {%- if messages[0].role == 'system' %}\n        {{- messages[0].content + '\\n\\n' }}\n    {%- endif %}\n    {{- \"# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n    {%- for tool in tools %}\n        {{- \"\\n\" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' -%}\n        {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n    {%- else -%}\n        {{- '<|im_start|>system\\nYou are a helpful assistant<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}\n{%- for message in messages[::-1] %}\n    {%- set index = (messages|length - 1) - loop.index0 %}\n    {%- if ns.multi_step_tool and message.role == \"user\" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}\n        {%- set ns.multi_step_tool = false %}\n        {%- set ns.last_query_index = index %}\n    {%- endif %}\n{%- endfor %}\n{%- for message in messages %}\n    {%- if message.content is string %}\n        {%- set content = message.content %}\n    {%- else %}\n        {%- set content = '' %}\n    {%- endif %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) %}\n        {{- '<|im_start|>' + message.role + '\\n' + content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {%- set reasoning_content = '' %}\n        {%- if message.reasoning_content is string %}\n            {%- set reasoning_content = message.reasoning_content %}\n        {%- else %}\n            {%- if '</think>' in content %}\n                {%- set reasoning_content = content.split('</think>')[0].rstrip('\\n').split('<think>')[-1].lstrip('\\n') %}\n                {%- set content = content.split('</think>')[-1].lstrip('\\n') %}\n            {%- endif %}\n        {%- endif %}\n        {%- if loop.index0 > ns.last_query_index %}\n            {%- if loop.last or (not loop.last and reasoning_content) %}\n                {{- '<|im_start|>' + message.role + '\\n<think>\\n' + reasoning_content.strip('\\n') + '\\n</think>\\n\\n' + content.lstrip('\\n') }}\n            {%- else %}\n                {{- '<|im_start|>' + message.role + '\\n' + content }}\n            {%- endif %}\n        {%- else %}\n            {{- '<|im_start|>' + message.role + '\\n' + content }}\n        {%- endif %}\n        {%- if message.tool_calls %}\n            {%- for tool_call in message.tool_calls %}\n                {%- if (loop.first and content) or (not loop.first) %}\n                    {{- '\\n' }}\n                {%- endif %}\n                {%- if tool_call.function %}\n                    {%- set tool_call = tool_call.function %}\n                {%- endif %}\n                {{- '<tool_call>\\n{\"name\": \"' }}\n                {{- tool_call.name }}\n                {{- '\", \"arguments\": ' }}\n                {%- if tool_call.arguments is string %}\n                    {{- tool_call.arguments }}\n                {%- else %}\n                    {{- tool_call.arguments | tojson }}\n                {%- endif %}\n                {{- '}\\n</tool_call>' }}\n            {%- endfor %}\n        {%- endif %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if loop.first or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n    {%- if enable_thinking is defined and enable_thinking is false %}\n        {{- '<think>\\n\\n</think>\\n\\n' }}\n    {%- endif %}\n{%- endif %}"
     }
 
     # 保存配置文件
diff --git a/trainer/train_distill_reason.py b/trainer/train_distill_reason.py
index 579fedb..7bd6bce 100644
--- a/trainer/train_distill_reason.py
+++ b/trainer/train_distill_reason.py
@@ -76,7 +76,7 @@ def train_epoch(epoch, wandb):
 
             optimizer.zero_grad(set_to_none=True)
 
-        if step % args.log_interval == 0:
+        if step % args.log_interval == 0 or step == iter_per_epoch - 1:
             spend_time = time.time() - start_time
             Logger(
                 'Epoch:[{}/{}]({}/{}) loss:{:.3f} lr:{:.12f} epoch_Time:{}min:'.format(
@@ -93,7 +93,7 @@ def train_epoch(epoch, wandb):
                            "lr": optimizer.param_groups[-1]['lr'],
                            "epoch_Time": spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60})
 
-        if (step + 1) % args.save_interval == 0 and (not ddp or dist.get_rank() == 0):
+        if ((step + 1) % args.save_interval == 0 or step == iter_per_epoch - 1) and (not ddp or dist.get_rank() == 0):
             model.eval()
             moe_path = '_moe' if lm_config.use_moe else ''
             ckp = f'{args.save_dir}/reason_{lm_config.hidden_size}{moe_path}.pth'
@@ -159,7 +159,7 @@ if __name__ == "__main__":
     args = parser.parse_args()
 
     lm_config = MiniMindConfig(hidden_size=args.hidden_size, num_hidden_layers=args.num_hidden_layers,
-                         use_moe=args.use_moe)
+                               use_moe=args.use_moe)
     args.save_dir = os.path.join(args.out_dir)
     os.makedirs(args.save_dir, exist_ok=True)
     os.makedirs(args.out_dir, exist_ok=True)
@@ -184,7 +184,7 @@ if __name__ == "__main__":
         torch.cuda.manual_seed(base_seed + rank)
 
     if args.use_wandb and (not ddp or ddp_local_rank == 0):
-        import wandb
+        import swanlab as wandb
 
         wandb.init(project=args.wandb_project, name=args.wandb_run_name)
     else:
diff --git a/trainer/train_distillation.py b/trainer/train_distillation.py
index be790a4..cccfee1 100644
--- a/trainer/train_distillation.py
+++ b/trainer/train_distillation.py
@@ -110,7 +110,7 @@ def train_epoch(epoch, wandb, alpha=0.0, temperature=1.0):
             scaler.update()
             optimizer.zero_grad(set_to_none=True)
 
-        if step % args.log_interval == 0:
+        if step % args.log_interval == 0 or step == iter_per_epoch - 1:
             spend_time = time.time() - start_time
             Logger(
                 'Epoch:[{}/{}]({}/{}) loss:{:.4f} lr:{:.12f} epoch_Time:{}min:'.format(
@@ -133,7 +133,7 @@ def train_epoch(epoch, wandb, alpha=0.0, temperature=1.0):
                     "last-time": spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60
                 })
 
-        if (step + 1) % args.save_interval == 0 and (not ddp or dist.get_rank() == 0):
+        if ((step + 1) % args.save_interval == 0 or step == iter_per_epoch - 1) and (not ddp or dist.get_rank() == 0):
             model.eval()
             moe_path = '_moe' if lm_config_student.use_moe else ''
             ckp = f'{args.save_dir}/full_dist_{lm_config_student.hidden_size}{moe_path}.pth'
@@ -231,7 +231,7 @@ if __name__ == "__main__":
         torch.cuda.manual_seed(base_seed + rank)
 
     if args.use_wandb and (not ddp or ddp_local_rank == 0):
-        import wandb
+        import swanlab as wandb
 
         wandb.init(project=args.wandb_project, name=args.wandb_run_name)
     else:
diff --git a/trainer/train_dpo.py b/trainer/train_dpo.py
index c7701d0..db17164 100644
--- a/trainer/train_dpo.py
+++ b/trainer/train_dpo.py
@@ -1,5 +1,6 @@
 import os
 import sys
+
 __package__ = "trainer"
 sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
 
@@ -99,7 +100,7 @@ def train_epoch(epoch, wandb):
             scaler.update()
             optimizer.zero_grad(set_to_none=True)
 
-        if step % args.log_interval == 0:
+        if step % args.log_interval == 0 or step == iter_per_epoch - 1:
             spend_time = time.time() - start_time
             Logger(
                 'Epoch:[{}/{}]({}/{}) loss:{:.3f} lr:{:.12f} epoch_Time:{}min:'.format(
@@ -116,7 +117,7 @@ def train_epoch(epoch, wandb):
                            "lr": optimizer.param_groups[-1]['lr'],
                            "epoch_Time": spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60})
 
-        if (step + 1) % args.save_interval == 0 and (not ddp or dist.get_rank() == 0):
+        if ((step + 1) % args.save_interval == 0 or step == iter_per_epoch - 1) and (not ddp or dist.get_rank() == 0):
             model.eval()
             moe_path = '_moe' if lm_config.use_moe else ''
             ckp = f'{args.save_dir}/rlhf_{lm_config.hidden_size}{moe_path}.pth'
@@ -189,7 +190,8 @@ if __name__ == "__main__":
 
     args = parser.parse_args()
 
-    lm_config = MiniMindConfig(hidden_size=args.hidden_size, num_hidden_layers=args.num_hidden_layers, use_moe=args.use_moe)
+    lm_config = MiniMindConfig(hidden_size=args.hidden_size, num_hidden_layers=args.num_hidden_layers,
+                               use_moe=args.use_moe)
     args.save_dir = os.path.join(args.out_dir)
     os.makedirs(args.save_dir, exist_ok=True)
     os.makedirs(args.out_dir, exist_ok=True)
@@ -214,7 +216,7 @@ if __name__ == "__main__":
         torch.cuda.manual_seed(base_seed + rank)
 
     if args.use_wandb and (not ddp or ddp_local_rank == 0):
-        import wandb
+        import swanlab as wandb
 
         wandb.init(project=args.wandb_project, name=args.wandb_run_name)
     else:
diff --git a/trainer/train_full_sft.py b/trainer/train_full_sft.py
index e2a9a2f..49a3ff0 100644
--- a/trainer/train_full_sft.py
+++ b/trainer/train_full_sft.py
@@ -63,7 +63,7 @@ def train_epoch(epoch, wandb):
 
             optimizer.zero_grad(set_to_none=True)
 
-        if step % args.log_interval == 0:
+        if step % args.log_interval == 0 or step == iter_per_epoch - 1:
             spend_time = time.time() - start_time
             Logger(
                 'Epoch:[{}/{}]({}/{}) loss:{:.3f} lr:{:.12f} epoch_Time:{}min:'.format(
@@ -80,7 +80,7 @@ def train_epoch(epoch, wandb):
                            "lr": optimizer.param_groups[-1]['lr'],
                            "epoch_Time": spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60})
 
-        if (step + 1) % args.save_interval == 0 and (not ddp or dist.get_rank() == 0):
+        if ((step + 1) % args.save_interval == 0 or step == iter_per_epoch - 1) and (not ddp or dist.get_rank() == 0):
             model.eval()
             moe_path = '_moe' if lm_config.use_moe else ''
             ckp = f'{args.save_dir}/full_sft_{lm_config.hidden_size}{moe_path}.pth'
@@ -170,7 +170,7 @@ if __name__ == "__main__":
         torch.cuda.manual_seed(base_seed + rank)
 
     if args.use_wandb and (not ddp or ddp_local_rank == 0):
-        import wandb
+        import swanlab as wandb
 
         wandb.init(project=args.wandb_project, name=args.wandb_run_name)
     else:
diff --git a/trainer/train_grpo.py b/trainer/train_grpo.py
new file mode 100755
index 0000000..a3e943c
--- /dev/null
+++ b/trainer/train_grpo.py
@@ -0,0 +1,316 @@
+import os
+import sys
+
+__package__ = "trainer"
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
+import argparse
+import time
+import re
+import gc
+import torch
+from contextlib import nullcontext
+import torch.distributed as dist
+from torch import optim
+from torch.nn.parallel import DistributedDataParallel
+from torch.utils.data import DataLoader, DistributedSampler
+from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModel
+from model.model_minimind import MiniMindConfig, MiniMindForCausalLM
+from dataset.lm_dataset import RLAIFDataset
+from torch.optim.lr_scheduler import CosineAnnealingLR
+
+
+def Logger(content):
+    if not ddp or dist.get_rank() == 0:
+        print(content)
+
+
+def calculate_rewards(prompts, responses, reward_model, reward_tokenizer):
+    """整合所有奖励函数计算总奖励"""
+
+    def reasoning_model_reward(rewards):
+        # 1. 格式奖励（仅针对训练推理模型时使用）
+        pattern = r"^<think>\n.*?\n</think>\n<answer>\n.*?\n</answer>$"
+        pattern2 = r"^<think>\n.*?\n</think>\n\n<answer>\n.*?\n</answer>$"
+
+        matches_pattern = [re.match(pattern, response, re.S) for response in responses]
+        matches_pattern2 = [re.match(pattern2, response, re.S) for response in responses]
+
+        format_rewards = []
+        for match_pattern, match_pattern2 in zip(matches_pattern, matches_pattern2):
+            if match_pattern:
+                format_rewards.append(0.5)
+            elif match_pattern2:
+                format_rewards.append(0.5)
+            else:
+                format_rewards.append(0.0)
+        rewards += torch.tensor(format_rewards, device=args.device)
+
+        # 2. 标记奖励（防止严格奖励稀疏，仅针对训练推理模型时使用）
+        def mark_num(text):
+            reward = 0
+            if text.count("<think>") == 1:
+                reward += 0.25
+            if text.count("</think>") == 1:
+                reward += 0.25
+            if text.count("<answer>") == 1:
+                reward += 0.25
+            if text.count("</answer>") == 1:
+                reward += 0.25
+            return reward
+
+        mark_rewards = [mark_num(response) for response in responses]
+        rewards += torch.tensor(mark_rewards, device=args.device)
+        return rewards
+
+    rewards = torch.zeros(len(responses), device=args.device)
+
+    # 3. 格式奖励
+    if args.reasoning == 1:
+        rewards = reasoning_model_reward(rewards)  # 训练推理模型时使用
+
+    # 4. 使用reward model计算奖励
+    with torch.no_grad():
+        reward_model_scores = []
+        batch_size = len(prompts)
+        scale = 3.0
+
+        for i in range(batch_size):
+            for j in range(args.num_generations):
+                response_idx = i * args.num_generations + j
+                response = responses[response_idx]
+                prompt = prompts[i]
+
+                pattern = r"<\|im_start\|>(system|user|assistant)\s+(.*?)<\|im_end\|>"
+                matches = re.findall(pattern, prompt, re.DOTALL)
+                messages = [{"role": role, "content": content.strip()} for role, content in matches]
+
+                tmp_chat = messages + [{"role": "assistant", "content": response}]
+                score = reward_model.get_score(reward_tokenizer, tmp_chat)
+                score = max(min(score, scale), -scale)
+
+                if args.reasoning == 1:
+                    answer_match = re.search(r'<answer>(.*?)</answer>', response, re.DOTALL)
+                    if answer_match:
+                        answer_content = answer_match.group(1).strip()
+                        tmp_chat = messages + [{"role": "assistant", "content": answer_content}]
+                        answer_score = reward_model.get_score(reward_tokenizer, tmp_chat)
+                        answer_score = max(min(answer_score, scale), -scale)
+                        score = score * 0.4 + answer_score * 0.6
+
+                reward_model_scores.append(score)
+
+        reward_model_scores = torch.tensor(reward_model_scores, device=args.device)
+        rewards += reward_model_scores
+
+    return rewards
+
+
+def grpo_train_epoch(epoch, wandb):
+    for step, batch in enumerate(train_loader):
+        prompts = batch['prompt']  # list[str], length B
+        prompt_inputs = tokenizer(prompts, return_tensors="pt", padding=True, return_token_type_ids=False,
+                                  padding_side="left", add_special_tokens=False).to(args.device)  # input_ids: [B, P], attention_mask: [B, P]
+        if args.max_seq_len:
+            prompt_inputs["input_ids"] = prompt_inputs["input_ids"][:, -args.max_seq_len:]
+            prompt_inputs["attention_mask"] = prompt_inputs["attention_mask"][:, -args.max_seq_len:]
+
+        with torch.no_grad():
+            outputs = (model.module if ddp else model).generate(
+                **prompt_inputs, max_new_tokens=args.max_gen_len, do_sample=True, temperature=0.8,
+                num_return_sequences=args.num_generations, pad_token_id=tokenizer.pad_token_id)  # [B*num_gen, P+R]
+
+        completion_ids = outputs[:, prompt_inputs["input_ids"].size(1):]  # [B*num_gen, R]
+        
+        def get_per_token_logps(mdl, input_ids, n_keep):
+            input_ids = input_ids.detach().clone() if input_ids.is_inference() else input_ids
+            logits = mdl(input_ids, logits_to_keep=n_keep + 1).logits[:, :-1, :]
+            per_token_logps = []
+            for logits_row, ids_row in zip(logits, input_ids[:, -n_keep:]):
+                ids_row = ids_row.detach().clone() if ids_row.is_inference() else ids_row
+                per_token_logps.append(torch.gather(logits_row.log_softmax(dim=-1), 1, ids_row.unsqueeze(1)).squeeze(1))
+            return torch.stack(per_token_logps)
+
+        per_token_logps = get_per_token_logps(model, outputs, completion_ids.size(1))  # [B*num_gen, R]
+        with torch.no_grad():
+            ref_per_token_logps = get_per_token_logps(ref_model, outputs, completion_ids.size(1))  # [B*num_gen, R]
+
+        completions = tokenizer.batch_decode(completion_ids, skip_special_tokens=True)
+        rewards = calculate_rewards(prompts, completions, reward_model, reward_tokenizer).to(args.device)  # [B*num_gen]
+
+        grouped_rewards = rewards.view(-1, args.num_generations)  # [B, num_gen]
+        mean_r = grouped_rewards.mean(dim=1).repeat_interleave(args.num_generations)  # [B*num_gen]
+        std_r = grouped_rewards.std(dim=1).repeat_interleave(args.num_generations)  # [B*num_gen]
+        advantages = torch.clamp((rewards - mean_r) / (std_r + 1e-4), -10, 10)
+        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)  # [B*num_gen]
+
+        is_eos = completion_ids == tokenizer.eos_token_id  # [B*num_gen, R]
+        eos_idx = torch.full((is_eos.size(0),), is_eos.size(1), dtype=torch.long, device=args.device)
+        eos_idx[is_eos.any(dim=1)] = is_eos.int().argmax(dim=1)[is_eos.any(dim=1)]
+        completion_mask = (torch.arange(is_eos.size(1), device=args.device).expand(is_eos.size(0), -1) <= eos_idx.unsqueeze(1)).int()  # [B*num_gen, R]
+
+        kl_div = ref_per_token_logps - per_token_logps
+        per_token_kl = torch.exp(kl_div) - kl_div - 1  # [B*num_gen, R]
+        per_token_loss = -(torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1) - args.beta * per_token_kl)  # [B*num_gen, R]
+        loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean() / args.accumulation_steps  # scalar
+        loss.backward()
+
+        if (step + 1) % args.accumulation_steps == 0:
+            if args.grad_clip > 0:
+                torch.nn.utils.clip_grad_norm_(model.parameters(), args.grad_clip)
+            optimizer.step()
+            scheduler.step()
+            optimizer.zero_grad()
+
+        if step % args.log_interval == 0 or step == iter_per_epoch - 1:
+            policy_loss_val = loss.item()
+            avg_reward_val = rewards.mean().item()
+            avg_len_val = completion_mask.sum(dim=1).float().mean().item()
+            current_lr = optimizer.param_groups[0]['lr']
+
+            Logger(
+                f'Epoch: {epoch}, Step: {step + 1}/{iter_per_epoch}, '
+                f'Actor Loss: {policy_loss_val:.4f}, Reward: {avg_reward_val:.4f}, '
+                f'Avg Response Len: {avg_len_val:.2f}, LR: {current_lr:.2e}')
+
+            if wandb and (not ddp or dist.get_rank() == 0):
+                log_dict = {
+                    "policy_loss": policy_loss_val,
+                    "reward": avg_reward_val,
+                    "avg_response_len": avg_len_val,
+                    "advantages_mean": advantages.mean().item(),
+                    "learning_rate": current_lr
+                }
+                wandb.log(log_dict)
+
+        if ((step + 1) % args.save_interval == 0 or step == iter_per_epoch - 1) and (not ddp or dist.get_rank() == 0):
+            model.eval()
+            moe_path = '_moe' if lm_config.use_moe else ''
+            suffix = 'grpo'
+            ckp = f'{args.save_dir}/{suffix}_{lm_config.hidden_size}{moe_path}.pth'
+
+            state_dict = model.module.state_dict() if isinstance(model,
+                                                                 torch.nn.parallel.DistributedDataParallel) else model.state_dict()
+            torch.save({k: v.half() for k, v in state_dict.items()}, ckp)
+            model.train()
+
+        del prompt_inputs, outputs, completion_ids, per_token_logps, ref_per_token_logps
+        del completions, rewards, grouped_rewards, mean_r, std_r, advantages, completion_mask
+        torch.cuda.empty_cache()
+        gc.collect()
+
+
+def init_model(lm_config):
+    tokenizer = AutoTokenizer.from_pretrained('../model/')
+    model = MiniMindForCausalLM(lm_config)
+    moe_path = '_moe' if lm_config.use_moe else ''
+    ckp = f'{args.save_dir}/full_sft_{lm_config.hidden_size}{moe_path}.pth'
+    if args.reasoning == 1:
+        ckp = f'{args.save_dir}/reason_{lm_config.hidden_size}{moe_path}.pth'
+    state_dict = torch.load(ckp, map_location=args.device)
+    model.load_state_dict(state_dict, strict=False)
+
+    ref_model = MiniMindForCausalLM(lm_config)
+    ref_model.load_state_dict(state_dict, strict=False)
+    ref_model.eval().requires_grad_(False)
+
+    Logger(f'LLM总参数量：{sum(p.numel() for p in model.parameters() if p.requires_grad) / 1e6:.3f} 百万')
+    model = model.to(args.device)
+    ref_model = ref_model.to(args.device)
+
+    reward_name = "../../internlm2-1_8b-reward"
+    reward_model = AutoModel.from_pretrained(
+        reward_name,
+        device_map="cuda",
+        torch_dtype=torch.float16,
+        trust_remote_code=True,
+    ).to(args.device).eval().requires_grad_(False)
+    reward_tokenizer = AutoTokenizer.from_pretrained(reward_name, trust_remote_code=True)
+
+    return model, ref_model, tokenizer, reward_model, reward_tokenizer
+
+
+def init_distributed_mode():
+    if not ddp: return
+    global ddp_local_rank, DEVICE
+    dist.init_process_group(backend="nccl")
+    ddp_local_rank = int(os.environ["LOCAL_RANK"])
+    DEVICE = f"cuda:{ddp_local_rank}"
+    torch.cuda.set_device(DEVICE)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--out_dir", type=str, default="../out")
+    parser.add_argument("--epochs", type=int, default=1)
+    parser.add_argument("--batch_size", type=int, default=2)
+    parser.add_argument("--learning_rate", type=float, default=8e-8)
+    parser.add_argument("--device", type=str, default="cuda:0" if torch.cuda.is_available() else "cpu")
+    parser.add_argument("--dtype", type=str, default="bfloat16")
+    parser.add_argument("--use_wandb", action="store_true")
+    parser.add_argument("--wandb_project", type=str, default="MiniMind-GRPO")
+    parser.add_argument("--num_workers", type=int, default=1)
+    parser.add_argument("--ddp", action="store_true")
+    parser.add_argument("--accumulation_steps", type=int, default=1)
+    parser.add_argument("--grad_clip", type=float, default=1.0)
+    parser.add_argument("--log_interval", type=int, default=1)
+    parser.add_argument("--save_interval", type=int, default=10)
+    parser.add_argument('--hidden_size', default=512, type=int)
+    parser.add_argument('--num_hidden_layers', default=8, type=int)
+    parser.add_argument('--use_moe', default=False, type=bool)
+    parser.add_argument('--max_seq_len', default=66, type=int)
+    parser.add_argument("--max_gen_len", type=int, default=1536)
+    parser.add_argument("--data_path", type=str, default="../dataset/rlaif-mini.jsonl")
+    parser.add_argument("--num_generations", type=int, default=8)
+    parser.add_argument("--beta", type=float, default=0.02)
+    parser.add_argument("--reasoning", type=int, default=1, help='0:普通模型，1:推理模型')
+    args = parser.parse_args()
+
+    lm_config = MiniMindConfig(hidden_size=args.hidden_size, num_hidden_layers=args.num_hidden_layers,
+                               max_seq_len=args.max_seq_len + args.max_gen_len,
+                               use_moe=args.use_moe)
+    args.save_dir = os.path.join(args.out_dir)
+    os.makedirs(args.save_dir, exist_ok=True)
+    os.makedirs(args.out_dir, exist_ok=True)
+
+    ctx = nullcontext() if "cuda" not in args.device else torch.amp.autocast('cuda')
+    ddp = int(os.environ.get("RANK", -1)) != -1
+    ddp_local_rank, DEVICE = 0, "cuda:0"
+
+    base_seed = 1337
+    torch.manual_seed(base_seed)
+    torch.cuda.manual_seed(base_seed)
+
+    if ddp:
+        init_distributed_mode()
+        args.device = torch.device(DEVICE)
+        rank = dist.get_rank()
+        torch.manual_seed(base_seed + rank)
+        # 同时设置 CUDA 的随机种子
+        torch.cuda.manual_seed(base_seed + rank)
+
+    if args.use_wandb and (not ddp or ddp_local_rank == 0):
+        import swanlab as wandb
+
+        wandb.init(project=args.wandb_project)
+    else:
+        wandb = None
+
+    model, ref_model, tokenizer, reward_model, reward_tokenizer = init_model(lm_config=lm_config)
+    train_ds = RLAIFDataset(args.data_path, tokenizer, max_length=lm_config.max_seq_len)
+    train_sampler = DistributedSampler(train_ds) if ddp else None
+    train_loader = DataLoader(train_ds, batch_size=args.batch_size, pin_memory=True,
+                              drop_last=False, shuffle=False,
+                              num_workers=args.num_workers, sampler=train_sampler)
+
+    optimizer = optim.AdamW(model.parameters(), lr=args.learning_rate)
+
+    iter_per_epoch = len(train_loader)
+    total_optimizer_steps = (iter_per_epoch // args.accumulation_steps) * args.epochs
+    scheduler = CosineAnnealingLR(optimizer, T_max=total_optimizer_steps, eta_min=args.learning_rate / 10)
+
+    if ddp:
+        model._ddp_params_and_buffers_to_ignore = {"pos_cis"}
+        model = DistributedDataParallel(model, device_ids=[ddp_local_rank])
+
+    for epoch in range(args.epochs):
+        grpo_train_epoch(epoch, wandb)
diff --git a/trainer/train_lora.py b/trainer/train_lora.py
index 4711062..7d74b9f 100644
--- a/trainer/train_lora.py
+++ b/trainer/train_lora.py
@@ -64,7 +64,7 @@ def train_epoch(epoch, wandb):
 
             optimizer.zero_grad(set_to_none=True)
 
-        if step % args.log_interval == 0:
+        if step % args.log_interval == 0 or step == iter_per_epoch - 1:
             spend_time = time.time() - start_time
             Logger(
                 'Epoch:[{}/{}]({}/{}) loss:{:.3f} lr:{:.12f} epoch_Time:{}min:'.format(
@@ -81,7 +81,7 @@ def train_epoch(epoch, wandb):
                            "lr": optimizer.param_groups[-1]['lr'],
                            "epoch_Time": spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60})
 
-        if (step + 1) % args.save_interval == 0 and (not ddp or dist.get_rank() == 0):
+        if ((step + 1) % args.save_interval == 0 or step == iter_per_epoch - 1) and (not ddp or dist.get_rank() == 0):
             model.eval()
             lora_save_path = f'{args.save_dir}/lora/{args.lora_name}_{lm_config.hidden_size}.pth'
             os.makedirs(os.path.dirname(lora_save_path), exist_ok=True)
@@ -115,7 +115,7 @@ def init_distributed_mode():
 if __name__ == "__main__":
     parser = argparse.ArgumentParser(description="MiniMind SFT with LoRA")
     parser.add_argument("--out_dir", type=str, default="../out")
-    parser.add_argument("--epochs", type=int, default=10)
+    parser.add_argument("--epochs", type=int, default=50)
     parser.add_argument("--batch_size", type=int, default=32)
     parser.add_argument("--learning_rate", type=float, default=1e-4)
     parser.add_argument("--device", type=str, default="cuda:0" if torch.cuda.is_available() else "cpu")
@@ -127,15 +127,15 @@ if __name__ == "__main__":
     parser.add_argument("--accumulation_steps", type=int, default=1)
     parser.add_argument("--grad_clip", type=float, default=1.0)
     parser.add_argument("--warmup_iters", type=int, default=0)
-    parser.add_argument("--log_interval", type=int, default=100)
-    parser.add_argument("--save_interval", type=int, default=100)
+    parser.add_argument("--log_interval", type=int, default=10)
+    parser.add_argument("--save_interval", type=int, default=1)
     parser.add_argument('--local_rank', type=int, default=-1)
     parser.add_argument('--hidden_size', default=512, type=int)
     parser.add_argument('--num_hidden_layers', default=8, type=int)
     parser.add_argument('--max_seq_len', default=512, type=int)
     parser.add_argument('--use_moe', default=False, type=bool)
-    parser.add_argument("--data_path", type=str, default="../dataset/lora_medical.jsonl")
-    parser.add_argument("--lora_name", type=str, default="lora_medical", help="根据任务保存成lora_(英文/医学/心理...)")
+    parser.add_argument("--data_path", type=str, default="../dataset/lora_identity.jsonl")
+    parser.add_argument("--lora_name", type=str, default="lora_identity", help="根据任务保存成lora_(英文/医学/心理...)")
     args = parser.parse_args()
 
     lm_config = MiniMindConfig(hidden_size=args.hidden_size, num_hidden_layers=args.num_hidden_layers,
@@ -163,7 +163,7 @@ if __name__ == "__main__":
 
     args.wandb_run_name = f"MiniMind-Lora-SFT-Epoch-{args.epochs}-BatchSize-{args.batch_size}-LearningRate-{args.learning_rate}"
     if args.use_wandb and (not ddp or ddp_local_rank == 0):
-        import wandb
+        import swanlab as wandb
 
         wandb.init(project=args.wandb_project, name=args.wandb_run_name)
     else:
diff --git a/trainer/train_ppo.py b/trainer/train_ppo.py
new file mode 100644
index 0000000..dd67775
--- /dev/null
+++ b/trainer/train_ppo.py
@@ -0,0 +1,373 @@
+import os
+import sys
+
+__package__ = "trainer"
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
+
+import argparse
+import re
+import torch
+import torch.distributed as dist
+import torch.nn.functional as F
+from torch import optim, nn
+from torch.nn.parallel import DistributedDataParallel
+from torch.utils.data import DataLoader, DistributedSampler
+from transformers import AutoTokenizer, AutoModel
+from model.model_minimind import MiniMindConfig, MiniMindForCausalLM
+from dataset.lm_dataset import RLAIFDataset
+from torch.nn.utils import clip_grad_norm_
+from torch.optim.lr_scheduler import CosineAnnealingLR
+
+
+def Logger(content):
+    if not ddp or dist.get_rank() == 0:
+        print(content)
+
+
+def calculate_rewards(prompts, responses, reward_model, reward_tokenizer):
+    """整合所有奖励函数计算总奖励"""
+
+    def reasoning_model_reward(rewards):
+        # 1. 格式奖励（仅针对训练推理模型时使用）
+        pattern = r"^<think>\n.*?\n</think>\n<answer>\n.*?\n</answer>$"
+        pattern2 = r"^<think>\n.*?\n</think>\n\n<answer>\n.*?\n</answer>$"
+
+        matches_pattern = [re.match(pattern, response, re.S) for response in responses]
+        matches_pattern2 = [re.match(pattern2, response, re.S) for response in responses]
+
+        format_rewards = []
+        for match_pattern, match_pattern2 in zip(matches_pattern, matches_pattern2):
+            if match_pattern:
+                format_rewards.append(0.5)
+            elif match_pattern2:
+                format_rewards.append(0.5)
+            else:
+                format_rewards.append(0.0)
+        rewards += torch.tensor(format_rewards, device=args.device)
+
+        # 2. 标记奖励（防止严格奖励稀疏，仅针对训练推理模型时使用）
+        def mark_num(text):
+            reward = 0
+            if text.count("<think>") == 1:
+                reward += 0.25
+            if text.count("</think>") == 1:
+                reward += 0.25
+            if text.count("<answer>") == 1:
+                reward += 0.25
+            if text.count("</answer>") == 1:
+                reward += 0.25
+            return reward
+
+        mark_rewards = [mark_num(response) for response in responses]
+        rewards += torch.tensor(mark_rewards, device=args.device)
+        return rewards
+
+    rewards = torch.zeros(len(responses), device=args.device)
+
+    # 格式奖励
+    if args.reasoning == 1:
+        rewards = reasoning_model_reward(rewards)  # 训练推理模型时使用
+
+    # 使用reward model计算整个response的奖励
+    with torch.no_grad():
+        reward_model_scores = []
+        for prompt, response in zip(prompts, responses):
+            pattern = r"<\|im_start\|>(system|user|assistant)\s+(.*?)<\|im_end\|>"
+            matches = re.findall(pattern, prompt, re.DOTALL)
+            messages = [{"role": role, "content": content.strip()} for role, content in matches]
+
+            tmp_chat = messages + [{"role": "assistant", "content": response}]
+            score = reward_model.get_score(reward_tokenizer, tmp_chat)
+
+            scale = 3.0
+            score = max(min(score, scale), -scale)
+
+            # 当args.reasoning=1时，额外计算<answer>内容的奖励
+            if args.reasoning == 1:
+                answer_match = re.search(r'<answer>(.*?)</answer>', response, re.DOTALL)
+                if answer_match:
+                    answer_content = answer_match.group(1).strip()
+                    # 对answer内容单独计算reward
+                    tmp_chat = messages + [{"role": "assistant", "content": answer_content}]
+                    answer_score = reward_model.get_score(reward_tokenizer, tmp_chat)
+                    answer_score = max(min(answer_score, scale), -scale)
+
+                    score = score * 0.4 + answer_score * 0.6
+            reward_model_scores.append(score)
+
+        reward_model_scores = torch.tensor(reward_model_scores, device=args.device)
+        rewards += reward_model_scores
+
+    return rewards
+
+
+def ppo_train_epoch(epoch: int, wandb_run, old_actor_model, ref_model, actor_scheduler, critic_scheduler):
+    actor_model.train()
+    critic_model.train()
+    is_master = (not ddp) or dist.get_rank() == 0
+
+    for step, batch in enumerate(train_loader):
+        prompts = batch["prompt"]  # list[str], length B
+        enc = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True, 
+                       max_length=args.max_seq_len).to(args.device)  # input_ids: [B, P], attention_mask: [B, P]
+        prompt_lengths = enc.attention_mask.sum(dim=1)  # [B]
+
+        with torch.no_grad():
+            gen_out = actor_model.generate(
+                input_ids=enc.input_ids, attention_mask=enc.attention_mask,
+                max_new_tokens=args.max_gen_len, do_sample=True, temperature=0.8,
+                pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id)  # [B, P+R]
+
+        responses_text = [tokenizer.decode(gen_out[i, prompt_lengths[i]:], skip_special_tokens=True) for i in range(len(prompts))]
+        rewards = calculate_rewards(prompts, responses_text, reward_model, reward_tokenizer)  # [B]
+
+        full_mask = (gen_out != tokenizer.pad_token_id).long()  # [B, P+R]
+        values_seq = critic_model(input_ids=gen_out, attention_mask=full_mask)  # [B, P+R]
+        last_indices = full_mask.sum(dim=1) - 1  # [B]
+        values = values_seq[torch.arange(values_seq.size(0), device=values_seq.device), last_indices]  # [B]
+        advantages = rewards - values.detach()  # [B]
+
+        logits = actor_model(input_ids=gen_out, attention_mask=full_mask).logits  # [B, P+R, V]
+        labels = gen_out[:, 1:].clone()  # [B, P+R-1]
+        logp_tokens = F.log_softmax(logits[:, :-1], dim=-1).gather(2, labels.unsqueeze(-1)).squeeze(-1)  # [B, P+R-1]
+        seq_len = gen_out.size(1) - 1
+        resp_mask = torch.arange(seq_len, device=gen_out.device).unsqueeze(0) >= prompt_lengths.unsqueeze(1)
+        final_mask = resp_mask & (~labels.eq(tokenizer.pad_token_id))  # [B, P+R-1]
+        actor_logp = (logp_tokens * final_mask).sum(dim=1)  # [B]
+
+        with torch.no_grad():
+            old_logits = old_actor_model(input_ids=gen_out, attention_mask=full_mask).logits  # [B, P+R, V]
+            old_logp_tokens = F.log_softmax(old_logits[:, :-1], dim=-1).gather(2, labels.unsqueeze(-1)).squeeze(-1)  # [B, P+R-1]
+            old_logp = (old_logp_tokens * final_mask).sum(dim=1)  # [B]
+            
+            ref_logits = ref_model(input_ids=gen_out, attention_mask=full_mask).logits  # [B, P+R, V]
+            ref_logp_tokens = F.log_softmax(ref_logits[:, :-1], dim=-1).gather(2, labels.unsqueeze(-1)).squeeze(-1)  # [B, P+R-1]
+            ref_logp = (ref_logp_tokens * final_mask).sum(dim=1)  # [B]
+
+        kl = (actor_logp - old_logp).mean()  # scalar
+        kl_ref = (actor_logp - ref_logp).mean()  # scalar
+        ratio = torch.exp(actor_logp - old_logp)  # [B]
+        surr1 = ratio * advantages  # [B]
+        surr2 = torch.clamp(ratio, 1.0 - args.clip_epsilon, 1.0 + args.clip_epsilon) * advantages  # [B]
+        policy_loss = -torch.min(surr1, surr2).mean()  # scalar
+        value_loss = F.mse_loss(values, rewards)  # scalar
+        loss = policy_loss + args.vf_coef * value_loss + args.kl_coef * kl_ref  # scalar
+        loss.backward()
+
+        if (step + 1) % args.accumulation_steps == 0:
+            clip_grad_norm_(actor_model.parameters(), args.grad_clip)
+            clip_grad_norm_(critic_model.parameters(), args.grad_clip)
+            actor_optimizer.step()
+            critic_optimizer.step()
+            actor_scheduler.step()
+            critic_scheduler.step()
+            actor_optimizer.zero_grad()
+            critic_optimizer.zero_grad()
+
+        if is_master:
+            response_ids = gen_out[:, enc.input_ids.shape[1]:]
+            is_eos = (response_ids == tokenizer.eos_token_id)
+            eos_indices = torch.argmax(is_eos.int(), dim=1)
+            has_eos = is_eos.any(dim=1)
+            lengths = torch.where(has_eos, eos_indices + 1, torch.tensor(response_ids.shape[1], device=is_eos.device))
+            avg_len = lengths.float().mean()
+
+            actor_loss_val = policy_loss.item()
+            critic_loss_val = value_loss.item()
+            reward_val = rewards.mean().item()
+            kl_val = kl.item()
+            kl_ref_val = kl_ref.item()
+            avg_len_val = avg_len.item()
+            actor_lr = actor_optimizer.param_groups[0]['lr']
+            critic_lr = critic_optimizer.param_groups[0]['lr']
+
+            if wandb_run is not None:
+                wandb_run.log({
+                    "actor_loss": actor_loss_val,
+                    "critic_loss": critic_loss_val,
+                    "reward": reward_val,
+                    "kl": kl_val,
+                    "kl_ref": kl_ref_val,
+                    "avg_response_len": avg_len_val,
+                    "actor_lr": actor_lr,
+                })
+
+            Logger(f"Epoch: {epoch}, Step: {step + 1}/{len(train_loader)}, "
+                   f"Actor Loss: {actor_loss_val:.4f}, Critic Loss: {critic_loss_val:.4f}, "
+                   f"Reward: {reward_val:.4f}, KL: {kl_val:.4f}, KL_ref: {kl_ref_val:.4f}, "
+                   f"Avg Response Len: {avg_len_val:.2f}, Actor LR: {actor_lr:.2e}, Critic LR: {critic_lr:.2e}")
+
+        if (step + 1) % args.update_old_actor_freq == 0:
+            state_dict = actor_model.module.state_dict() if isinstance(actor_model, torch.nn.parallel.DistributedDataParallel) else actor_model.state_dict()
+            old_actor_model.load_state_dict({k: v.detach().cpu() for k, v in state_dict.items()})
+            old_actor_model.to(args.device)
+
+        if ((step + 1) % args.save_interval == 0 or step == iter_per_epoch - 1) and (not ddp or dist.get_rank() == 0):
+            actor_model.eval()
+            moe_path = '_moe' if lm_config.use_moe else ''
+            ckp = f'{args.save_dir}/ppo_actor_{lm_config.hidden_size}{moe_path}.pth'
+
+            if isinstance(actor_model, torch.nn.parallel.DistributedDataParallel):
+                state_dict = actor_model.module.state_dict()
+            else:
+                state_dict = actor_model.state_dict()
+
+            state_dict = {k: v.half() for k, v in state_dict.items()}  # 半精度保存
+            torch.save(state_dict, ckp)
+            actor_model.train()
+
+
+# 自定义的Critic模型，继承自MiniMindLM
+class CriticModel(MiniMindForCausalLM):
+    def __init__(self, params):
+        super().__init__(params)
+        # 替换lm_head为输出单一价值的线性层
+        self.value_head = nn.Linear(params.hidden_size, 1)
+
+    def forward(self, input_ids=None, attention_mask=None, **kwargs):
+        # 使用基础模型获取隐藏状态
+        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask, **kwargs)
+        # self.model 返回的是一个元组，第一个元素是 last_hidden_state
+        hidden_states = self.model.norm(outputs[0])
+        # 使用value_head获取价值估计
+        values = self.value_head(hidden_states).squeeze(-1)
+        return values
+
+
+def init_model(lm_config):
+    tokenizer = AutoTokenizer.from_pretrained('../model/', padding_side='left')
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+
+    moe_path = '_moe' if lm_config.use_moe else ''
+    ckp = f'{args.save_dir}/{"reason" if args.reasoning == 1 else "full_sft"}_{lm_config.hidden_size}{moe_path}.pth'
+    state_dict = torch.load(ckp, map_location=args.device)
+
+    actor_model = MiniMindForCausalLM(lm_config)
+    actor_model.load_state_dict(state_dict, strict=False)
+    actor_model = actor_model.to(args.device)
+
+    old_actor_model = MiniMindForCausalLM(lm_config)
+    old_actor_model.load_state_dict(state_dict, strict=False)
+    old_actor_model = old_actor_model.eval().requires_grad_(False).to(args.device)
+
+    ref_model = MiniMindForCausalLM(lm_config)
+    ref_model.load_state_dict(state_dict, strict=False)
+    ref_model = ref_model.eval().requires_grad_(False).to(args.device)
+
+    critic_model = CriticModel(lm_config)
+    critic_model.load_state_dict(state_dict, strict=False)
+    critic_model = critic_model.to(args.device)
+
+    reward_name = "../../internlm2-1_8b-reward"
+    reward_model = AutoModel.from_pretrained(
+        reward_name, device_map="cuda", torch_dtype=torch.float32, trust_remote_code=True
+    ).to(args.device).eval().requires_grad_(False)
+    reward_tokenizer = AutoTokenizer.from_pretrained(reward_name, trust_remote_code=True)
+
+    Logger(f'Actor模型总参数量：{sum(p.numel() for p in actor_model.parameters() if p.requires_grad) / 1e6:.3f} 百万')
+    Logger(f'Critic模型总参数量：{sum(p.numel() for p in critic_model.parameters() if p.requires_grad) / 1e6:.3f} 百万')
+
+    return actor_model, old_actor_model, ref_model, critic_model, reward_model, tokenizer, reward_tokenizer
+
+
+def init_distributed_mode():
+    if not ddp: return
+    global ddp_local_rank, DEVICE
+    dist.init_process_group(backend="nccl")
+    ddp_local_rank = int(os.environ["LOCAL_RANK"])
+    DEVICE = f"cuda:{ddp_local_rank}"
+    torch.cuda.set_device(DEVICE)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--out_dir", type=str, default="../out")
+    parser.add_argument("--epochs", type=int, default=1)
+    parser.add_argument("--batch_size", type=int, default=2)
+    parser.add_argument("--learning_rate", type=float, default=8e-8)
+    parser.add_argument("--critic_learning_rate", type=float, default=8e-8)
+    parser.add_argument("--device", type=str, default="cuda:0" if torch.cuda.is_available() else "cpu")
+    parser.add_argument("--dtype", type=str, default="bfloat16")
+    parser.add_argument("--use_wandb", action="store_true")
+    parser.add_argument("--wandb_project", type=str, default="MiniMind-PPO")
+    parser.add_argument("--num_workers", type=int, default=1)
+    parser.add_argument("--ddp", action="store_true")
+    parser.add_argument("--accumulation_steps", type=int, default=1)
+    parser.add_argument("--grad_clip", type=float, default=1.0)
+    parser.add_argument("--log_interval", type=int, default=1)
+    parser.add_argument("--save_interval", type=int, default=10)
+    parser.add_argument('--hidden_size', default=512, type=int)
+    parser.add_argument('--num_hidden_layers', default=8, type=int)
+    parser.add_argument('--use_moe', default=False, type=bool)
+    parser.add_argument('--max_seq_len', default=66, type=int)
+    parser.add_argument("--max_gen_len", type=int, default=1536)
+    parser.add_argument("--data_path", type=str, default="../dataset/rlaif-mini.jsonl")
+    parser.add_argument("--clip_epsilon", type=float, default=0.1)
+    parser.add_argument("--vf_coef", type=float, default=0.5)
+    parser.add_argument("--kl_coef", type=float, default=0.02, help="KL散度惩罚系数")
+    parser.add_argument("--reasoning", type=int, default=1, help='0:普通模型，1:推理模型')
+    parser.add_argument("--update_old_actor_freq", type=int, default=4, help="频率：每处理n个batch后更新old_actor_model")
+    args = parser.parse_args()
+
+    lm_config = MiniMindConfig(hidden_size=args.hidden_size, num_hidden_layers=args.num_hidden_layers,
+                               use_moe=args.use_moe)
+    args.save_dir = os.path.join(args.out_dir)
+    os.makedirs(args.save_dir, exist_ok=True)
+    os.makedirs(args.out_dir, exist_ok=True)
+
+    ddp = int(os.environ.get("RANK", -1)) != -1
+    ddp_local_rank, DEVICE = 0, "cuda:0"
+    base_seed = 1337
+    torch.manual_seed(base_seed)
+    torch.cuda.manual_seed(base_seed)
+
+    if ddp:
+        init_distributed_mode()
+        args.device = torch.device(DEVICE)
+        rank = dist.get_rank()
+        torch.manual_seed(base_seed + rank)
+        # 同时设置 CUDA 的随机种子
+        torch.cuda.manual_seed(base_seed + rank)
+
+    if args.use_wandb and (not ddp or ddp_local_rank == 0):
+        import swanlab as wandb
+
+        wandb.init(project=args.wandb_project)
+    else:
+        wandb = None
+
+    # 初始化所有模型
+    actor_model, old_actor_model, ref_model, critic_model, reward_model, tokenizer, reward_tokenizer = init_model(lm_config=lm_config)
+
+    # 准备数据集和数据加载器
+    train_ds = RLAIFDataset(args.data_path, tokenizer, max_length=(args.max_seq_len + args.max_gen_len))
+    train_sampler = DistributedSampler(train_ds) if ddp else None
+    train_loader = DataLoader(train_ds, batch_size=args.batch_size, pin_memory=True,
+                              drop_last=False, shuffle=False,
+                              num_workers=args.num_workers, sampler=train_sampler)
+
+    # 初始化优化器
+    actor_optimizer = optim.AdamW(actor_model.parameters(), lr=args.learning_rate)
+    critic_optimizer = optim.AdamW(critic_model.parameters(), lr=args.critic_learning_rate)
+
+    iter_per_epoch = len(train_loader)
+    total_optimizer_steps = (iter_per_epoch // args.accumulation_steps) * args.epochs
+    actor_scheduler = CosineAnnealingLR(actor_optimizer, T_max=total_optimizer_steps, eta_min=args.learning_rate / 10)
+    critic_scheduler = CosineAnnealingLR(critic_optimizer, T_max=total_optimizer_steps,
+                                         eta_min=args.critic_learning_rate / 10)
+
+    # 如果使用分布式训练，包装模型
+    if ddp:
+        actor_model._ddp_params_and_buffers_to_ignore = {"pos_cis"}
+        critic_model._ddp_params_and_buffers_to_ignore = {"pos_cis"}
+        actor_model = DistributedDataParallel(actor_model, device_ids=[ddp_local_rank])
+        critic_model = DistributedDataParallel(critic_model, device_ids=[ddp_local_rank])
+        # old_actor_model 不需要DDP包装，因为它只在主进程上用于计算，并且不进行梯度更新
+        old_actor_model.to(args.device)
+
+    for epoch in range(args.epochs):
+        ppo_train_epoch(epoch, wandb, old_actor_model, ref_model, actor_scheduler, critic_scheduler)
+
+    if ddp:
+        dist.destroy_process_group()
diff --git a/trainer/train_pretrain.py b/trainer/train_pretrain.py
index f0cfc58..6eed5a7 100644
--- a/trainer/train_pretrain.py
+++ b/trainer/train_pretrain.py
@@ -1,5 +1,6 @@
 import os
 import sys
+
 __package__ = "trainer"
 sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
 
@@ -62,7 +63,7 @@ def train_epoch(epoch, wandb):
 
             optimizer.zero_grad(set_to_none=True)
 
-        if step % args.log_interval == 0:
+        if step % args.log_interval == 0 or step == iter_per_epoch - 1:
             spend_time = time.time() - start_time
             Logger(
                 'Epoch:[{}/{}]({}/{}) loss:{:.3f} lr:{:.12f} epoch_Time:{}min:'.format(
@@ -79,7 +80,7 @@ def train_epoch(epoch, wandb):
                            "lr": optimizer.param_groups[-1]['lr'],
                            "epoch_Time": spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60})
 
-        if (step + 1) % args.save_interval == 0 and (not ddp or dist.get_rank() == 0):
+        if ((step + 1) % args.save_interval == 0 or step == iter_per_epoch - 1) and (not ddp or dist.get_rank() == 0):
             model.eval()
             moe_path = '_moe' if lm_config.use_moe else ''
             ckp = f'{args.save_dir}/pretrain_{lm_config.hidden_size}{moe_path}.pth'
@@ -140,7 +141,8 @@ if __name__ == "__main__":
     parser.add_argument("--data_path", type=str, default="../dataset/pretrain_hq.jsonl")
     args = parser.parse_args()
 
-    lm_config = MiniMindConfig(hidden_size=args.hidden_size, num_hidden_layers=args.num_hidden_layers, use_moe=args.use_moe)
+    lm_config = MiniMindConfig(hidden_size=args.hidden_size, num_hidden_layers=args.num_hidden_layers,
+                               use_moe=args.use_moe)
     args.save_dir = os.path.join(args.out_dir)
     os.makedirs(args.save_dir, exist_ok=True)
     os.makedirs(args.out_dir, exist_ok=True)
@@ -167,7 +169,7 @@ if __name__ == "__main__":
         torch.cuda.manual_seed(base_seed + rank)
 
     if args.use_wandb and (not ddp or ddp_local_rank == 0):
-        import wandb
+        import swanlab as wandb
 
         wandb.init(project=args.wandb_project, name=args.wandb_run_name)
     else:
diff --git a/trainer/train_spo.py b/trainer/train_spo.py
new file mode 100755
index 0000000..7e44d3a
--- /dev/null
+++ b/trainer/train_spo.py
@@ -0,0 +1,367 @@
+import os
+import sys
+
+__package__ = "trainer"
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
+import argparse
+import time
+import re
+import gc
+import torch
+from contextlib import nullcontext
+import torch.distributed as dist
+from torch import optim
+from torch.nn.parallel import DistributedDataParallel
+from torch.utils.data import DataLoader, DistributedSampler
+from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModel
+from model.model_minimind import MiniMindConfig, MiniMindForCausalLM
+from dataset.lm_dataset import RLAIFDataset
+from torch.optim.lr_scheduler import CosineAnnealingLR
+from collections import defaultdict
+
+
+class AutoAdaptiveValueTracker:
+    def __init__(self, rho_mode='kl', rho_const=0.9, D_half=0.06, clip_lower=0.5, clip_upper=0.96):
+        self.rho_mode = rho_mode
+        self.rho_const = rho_const
+        self.D_half = D_half
+        self.clip_lower = clip_lower
+        self.clip_upper = clip_upper
+        # Stable initialization following N_init = 1/(1-clip_lower)
+        N_init = 1.0 / (1.0 - self.clip_lower)
+        self.alpha = 0.5 * N_init
+        self.beta = 0.5 * N_init
+        self.old_mean_logprob = None
+
+    def get_baselines(self, batch_size):
+        baseline = self.alpha / (self.alpha + self.beta)
+        return torch.full((batch_size,), baseline, dtype=torch.float32)
+
+    def compute_rho(self, cur_mean_logprob):
+        if self.rho_mode == 'constant':
+            return self.rho_const
+        if self.old_mean_logprob is None:
+            return self.rho_const
+        kl = abs(self.old_mean_logprob - cur_mean_logprob)
+        rho = 2 ** (-kl / self.D_half)
+        return max(min(rho, self.clip_upper), self.clip_lower)
+
+    def update(self, rewards, cur_logprobs=None, response_masks=None):
+        if cur_logprobs is not None and response_masks is not None:
+            mean_logprob = ((cur_logprobs * response_masks).sum() / response_masks.sum()).item()
+            rho = self.compute_rho(mean_logprob)
+            self.old_mean_logprob = mean_logprob
+        else:
+            rho = self.rho_const
+
+        scale = 3.0
+        normalized_rewards = (rewards + scale) / (2 * scale)
+        avg_normalized_reward = normalized_rewards.mean().item()
+        self.alpha = rho * self.alpha + avg_normalized_reward
+        self.beta = rho * self.beta + (1 - avg_normalized_reward)
+        return rho
+
+
+def Logger(content):
+    if not ddp or dist.get_rank() == 0:
+        print(content)
+
+
+def calculate_rewards(prompts, responses, reward_model, reward_tokenizer):
+    """整合所有奖励函数计算总奖励"""
+
+    def reasoning_model_reward(rewards):
+        # 1. 格式奖励（仅针对训练推理模型时使用）
+        pattern = r"^<think>\n.*?\n</think>\n<answer>\n.*?\n</answer>$"
+        pattern2 = r"^<think>\n.*?\n</think>\n\n<answer>\n.*?\n</answer>$"
+
+        matches_pattern = [re.match(pattern, response, re.S) for response in responses]
+        matches_pattern2 = [re.match(pattern2, response, re.S) for response in responses]
+
+        format_rewards = []
+        for match_pattern, match_pattern2 in zip(matches_pattern, matches_pattern2):
+            if match_pattern:
+                format_rewards.append(0.5)
+            elif match_pattern2:
+                format_rewards.append(0.5)
+            else:
+                format_rewards.append(0.0)
+        rewards += torch.tensor(format_rewards, device=args.device)
+
+        # 2. 标记奖励（防止严格奖励稀疏，仅针对训练推理模型时使用）
+        def mark_num(text):
+            reward = 0
+            if text.count("<think>") == 1:
+                reward += 0.25
+            if text.count("</think>") == 1:
+                reward += 0.25
+            if text.count("<answer>") == 1:
+                reward += 0.25
+            if text.count("</answer>") == 1:
+                reward += 0.25
+            return reward
+
+        mark_rewards = [mark_num(response) for response in responses]
+        rewards += torch.tensor(mark_rewards, device=args.device)
+        return rewards
+
+    rewards = torch.zeros(len(responses), device=args.device)
+
+    # 3. 格式奖励
+    if args.reasoning == 1:
+        rewards = reasoning_model_reward(rewards)  # 训练推理模型时使用
+
+    # 4. 使用reward model计算奖励
+    with torch.no_grad():
+        reward_model_scores = []
+        scale = 3.0
+
+        for i, (prompt, response) in enumerate(zip(prompts, responses)):
+            pattern = r"<\|im_start\|>(system|user|assistant)\s+(.*?)<\|im_end\|>"
+            matches = re.findall(pattern, prompt, re.DOTALL)
+            messages = [{"role": role, "content": content.strip()} for role, content in matches]
+
+            tmp_chat = messages + [{"role": "assistant", "content": response}]
+            score = reward_model.get_score(reward_tokenizer, tmp_chat)
+            score = max(min(score, scale), -scale)
+
+            if args.reasoning == 1:
+                answer_match = re.search(r'<answer>(.*?)</answer>', response, re.DOTALL)
+                if answer_match:
+                    answer_content = answer_match.group(1).strip()
+                    tmp_chat = messages + [{"role": "assistant", "content": answer_content}]
+                    answer_score = reward_model.get_score(reward_tokenizer, tmp_chat)
+                    answer_score = max(min(answer_score, scale), -scale)
+                    score = score * 0.4 + answer_score * 0.6
+
+            reward_model_scores.append(score)
+
+        reward_model_scores = torch.tensor(reward_model_scores, device=args.device)
+        rewards += reward_model_scores
+
+    return rewards
+
+
+def spo_train_epoch(epoch, wandb, value_tracker):
+    for step, batch in enumerate(train_loader):
+        prompts = batch['prompt']  # list[str], length B
+        prompt_inputs = tokenizer(prompts, return_tensors="pt", padding=True, return_token_type_ids=False,
+                                  padding_side="left", add_special_tokens=False).to(args.device)  # input_ids: [B, P], attention_mask: [B, P]
+        if args.max_seq_len:
+            prompt_inputs["input_ids"] = prompt_inputs["input_ids"][:, -args.max_seq_len:]
+            prompt_inputs["attention_mask"] = prompt_inputs["attention_mask"][:, -args.max_seq_len:]
+
+        with torch.no_grad():
+            outputs = (model.module if ddp else model).generate(
+                **prompt_inputs, max_new_tokens=args.max_gen_len, do_sample=True, temperature=0.8,
+                num_return_sequences=1, pad_token_id=tokenizer.pad_token_id)  # [B, P+R]
+
+        completion_ids = outputs[:, prompt_inputs["input_ids"].size(1):]  # [B, R]
+
+        def get_per_token_logps(mdl, input_ids, n_keep):
+            input_ids = input_ids.detach().clone() if input_ids.is_inference() else input_ids
+            logits = mdl(input_ids, logits_to_keep=n_keep + 1).logits[:, :-1, :]
+            per_token_logps = []
+            for logits_row, ids_row in zip(logits, input_ids[:, -n_keep:]):
+                ids_row = ids_row.detach().clone() if ids_row.is_inference() else ids_row
+                per_token_logps.append(torch.gather(logits_row.log_softmax(dim=-1), 1, ids_row.unsqueeze(1)).squeeze(1))
+            return torch.stack(per_token_logps)
+
+        per_token_logps = get_per_token_logps(model, outputs, completion_ids.size(1))  # [B, R]
+        with torch.no_grad():
+            ref_per_token_logps = get_per_token_logps(ref_model, outputs, completion_ids.size(1))  # [B, R]
+
+        completions = tokenizer.batch_decode(completion_ids, skip_special_tokens=True)  # list[str], length B
+        rewards = calculate_rewards(prompts, completions, reward_model, reward_tokenizer).to(args.device)  # [B]
+
+        baselines = value_tracker.get_baselines(len(prompts)).to(args.device)  # [B]
+
+        scale = 3.0
+        # Un-normalize baselines to be in the same scale as raw rewards [-3, 3]
+        unnormalized_baselines = baselines * (2 * scale) - scale  # [B]
+        advantages = rewards - unnormalized_baselines  # [B]
+
+        # 直接使用 baseline 提供的优势估计，只做裁剪防止梯度爆炸。不再做 batch 内归一化，因为 baseline 已经提供了跨 batch 的稳定基线
+        advantages = advantages.clamp(-5.0, 5.0)
+
+        is_eos = completion_ids == tokenizer.eos_token_id  # [B, R]
+        eos_idx = torch.full((is_eos.size(0),), is_eos.size(1), dtype=torch.long, device=args.device)  # [B]
+        eos_idx[is_eos.any(dim=1)] = is_eos.int().argmax(dim=1)[is_eos.any(dim=1)]
+        completion_mask = (torch.arange(is_eos.size(1), device=args.device).expand(is_eos.size(0), -1) <= eos_idx.unsqueeze(1)).int()  # [B, R]
+
+        kl_div = ref_per_token_logps - per_token_logps  # [B, R]
+        per_token_kl = torch.exp(kl_div) - kl_div - 1  # [B, R]
+        per_token_loss = -per_token_logps * advantages.unsqueeze(1) + args.beta * per_token_kl  # [B, R]
+        loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean() / args.accumulation_steps  # scalar
+        loss.backward()
+
+        response_masks = completion_mask.float()  # [B, R]
+        rho = value_tracker.update(rewards, per_token_logps.detach(), response_masks)
+
+        if (step + 1) % args.accumulation_steps == 0:
+            if args.grad_clip > 0:
+                torch.nn.utils.clip_grad_norm_(model.parameters(), args.grad_clip)
+            optimizer.step()
+            scheduler.step()
+            optimizer.zero_grad()
+
+        if step % args.log_interval == 0 or step == iter_per_epoch - 1:
+            policy_loss_val = loss.item()
+            avg_reward_val = rewards.mean().item()
+            avg_len_val = completion_mask.sum(dim=1).float().mean().item()
+            # average kl over valid tokens for logging
+            kl_val = ((per_token_kl * completion_mask).sum() / (completion_mask.sum() + 1e-8)).item()
+            avg_baseline_val = baselines.mean().item()
+            current_lr = optimizer.param_groups[0]['lr']
+
+            Logger(
+                f'Epoch: {epoch}, Step: {step + 1}/{iter_per_epoch}, '
+                f'Actor Loss: {policy_loss_val:.4f}, Reward: {avg_reward_val:.4f}, '
+                f'Baseline: {avg_baseline_val:.4f}, KL: {kl_val:.4f}, Rho: {rho:.4f}, Avg Response Len: {avg_len_val:.2f}, LR: {current_lr:.2e}')
+
+            if wandb and (not ddp or dist.get_rank() == 0):
+                log_dict = {
+                    "policy_loss": policy_loss_val,
+                    "reward": avg_reward_val,
+                    "kl": kl_val,
+                    "rho": float(rho),
+                    "baseline": avg_baseline_val,
+                    # "avg_response_len": avg_len_val,
+                    "advantages_mean": advantages.mean().item(),
+                    "learning_rate": current_lr
+                }
+                wandb.log(log_dict)
+
+        if ((step + 1) % args.save_interval == 0 or step == iter_per_epoch - 1) and (not ddp or dist.get_rank() == 0):
+            model.eval()
+            moe_path = '_moe' if lm_config.use_moe else ''
+            suffix = 'spo'
+            ckp = f'{args.save_dir}/{suffix}_{lm_config.hidden_size}{moe_path}.pth'
+
+            state_dict = model.module.state_dict() if isinstance(model,
+                                                                 torch.nn.parallel.DistributedDataParallel) else model.state_dict()
+            torch.save({k: v.half() for k, v in state_dict.items()}, ckp)
+            model.train()
+
+        del prompt_inputs, outputs, completion_ids, per_token_logps, ref_per_token_logps
+        del completions, rewards, advantages, completion_mask, baselines, response_masks
+        torch.cuda.empty_cache()
+        gc.collect()
+
+
+def init_model(lm_config):
+    tokenizer = AutoTokenizer.from_pretrained('../model/')
+    model = MiniMindForCausalLM(lm_config)
+    moe_path = '_moe' if lm_config.use_moe else ''
+    ckp = f'{args.save_dir}/full_sft_{lm_config.hidden_size}{moe_path}.pth'
+    if args.reasoning == 1:
+        ckp = f'{args.save_dir}/reason_{lm_config.hidden_size}{moe_path}.pth'
+    state_dict = torch.load(ckp, map_location=args.device)
+    model.load_state_dict(state_dict, strict=False)
+    
+    ref_model = MiniMindForCausalLM(lm_config)
+    ref_model.load_state_dict(state_dict, strict=False)
+    ref_model.eval().requires_grad_(False)
+
+    Logger(f'LLM总参数量：{sum(p.numel() for p in model.parameters() if p.requires_grad) / 1e6:.3f} 百万')
+    model = model.to(args.device)
+    ref_model = ref_model.to(args.device)
+
+    reward_name = "../../internlm2-1_8b-reward"
+    reward_model = AutoModel.from_pretrained(
+        reward_name,
+        device_map="cuda",
+        torch_dtype=torch.float16,
+        trust_remote_code=True,
+    ).to(args.device).eval().requires_grad_(False)
+    reward_tokenizer = AutoTokenizer.from_pretrained(reward_name, trust_remote_code=True)
+
+    return model, ref_model, tokenizer, reward_model, reward_tokenizer
+
+
+def init_distributed_mode():
+    if not ddp: return
+    global ddp_local_rank, DEVICE
+    dist.init_process_group(backend="nccl")
+    ddp_local_rank = int(os.environ["LOCAL_RANK"])
+    DEVICE = f"cuda:{ddp_local_rank}"
+    torch.cuda.set_device(DEVICE)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--out_dir", type=str, default="../out")
+    parser.add_argument("--epochs", type=int, default=1)
+    parser.add_argument("--batch_size", type=int, default=2)
+    parser.add_argument("--learning_rate", type=float, default=1e-7)
+    parser.add_argument("--device", type=str, default="cuda:0" if torch.cuda.is_available() else "cpu")
+    parser.add_argument("--dtype", type=str, default="bfloat16")
+    parser.add_argument("--use_wandb", action="store_true")
+    parser.add_argument("--wandb_project", type=str, default="MiniMind-SPO")
+    parser.add_argument("--num_workers", type=int, default=1)
+    parser.add_argument("--ddp", action="store_true")
+    parser.add_argument("--accumulation_steps", type=int, default=4)
+    parser.add_argument("--grad_clip", type=float, default=1.0)
+    parser.add_argument("--log_interval", type=int, default=1)
+    parser.add_argument("--save_interval", type=int, default=10)
+    parser.add_argument('--hidden_size', default=512, type=int)
+    parser.add_argument('--num_hidden_layers', default=8, type=int)
+    parser.add_argument('--use_moe', default=False, type=bool)
+    parser.add_argument('--max_seq_len', default=66, type=int)
+    parser.add_argument("--max_gen_len", type=int, default=1536)
+    parser.add_argument("--data_path", type=str, default="../dataset/rlaif-mini.jsonl")
+    parser.add_argument("--beta", type=float, default=0.02)
+    parser.add_argument("--reasoning", type=int, default=1, help='0:普通模型，1:推理模型')
+    args = parser.parse_args()
+
+    lm_config = MiniMindConfig(hidden_size=args.hidden_size, num_hidden_layers=args.num_hidden_layers,
+                               max_seq_len=args.max_seq_len + args.max_gen_len,
+                               use_moe=args.use_moe)
+    args.save_dir = os.path.join(args.out_dir)
+    os.makedirs(args.save_dir, exist_ok=True)
+    os.makedirs(args.out_dir, exist_ok=True)
+
+    ctx = nullcontext() if "cuda" not in args.device else torch.amp.autocast('cuda')
+    ddp = int(os.environ.get("RANK", -1)) != -1
+    ddp_local_rank, DEVICE = 0, "cuda:0"
+
+    base_seed = 1337
+    torch.manual_seed(base_seed)
+    torch.cuda.manual_seed(base_seed)
+
+    if ddp:
+        init_distributed_mode()
+        args.device = torch.device(DEVICE)
+        rank = dist.get_rank()
+        torch.manual_seed(base_seed + rank)
+        # 同时设置 CUDA 的随机种子
+        torch.cuda.manual_seed(base_seed + rank)
+
+    if args.use_wandb and (not ddp or ddp_local_rank == 0):
+        import swanlab as wandb
+
+        wandb.init(project=args.wandb_project)
+    else:
+        wandb = None
+
+    model, ref_model, tokenizer, reward_model, reward_tokenizer = init_model(lm_config=lm_config)
+    train_ds = RLAIFDataset(args.data_path, tokenizer, max_length=lm_config.max_seq_len)
+    train_sampler = DistributedSampler(train_ds) if ddp else None
+    train_loader = DataLoader(train_ds, batch_size=args.batch_size, pin_memory=True,
+                              drop_last=False, shuffle=False,
+                              num_workers=args.num_workers, sampler=train_sampler)
+
+    optimizer = optim.AdamW(model.parameters(), lr=args.learning_rate)
+
+    iter_per_epoch = len(train_loader)
+    total_optimizer_steps = (iter_per_epoch // args.accumulation_steps) * args.epochs
+    scheduler = CosineAnnealingLR(optimizer, T_max=total_optimizer_steps, eta_min=args.learning_rate / 10)
+
+    if ddp:
+        model._ddp_params_and_buffers_to_ignore = {"pos_cis"}
+        model = DistributedDataParallel(model, device_ids=[ddp_local_rank])
+
+    value_tracker = AutoAdaptiveValueTracker(rho_mode='kl', rho_const=0.9, D_half=0.06, clip_lower=0.5, clip_upper=0.96)
+
+    for epoch in range(args.epochs):
+        spo_train_epoch(epoch, wandb, value_tracker)

Tokenizer Model	Vocabulary Size	Source
yi tokenizer	64,000	01AI (China)
qwen2 tokenizer	151,643	Alibaba Cloud (China)
glm tokenizer	151,329	Zhipu AI (China)
mistral tokenizer	32,000	Mistral AI (France)
llama3 tokenizer	128,000	Meta (USA)
minimind tokenizer	6,400	Custom