[update] readme

This commit is contained in:
jingyaogong 2026-01-05 23:15:25 +08:00
parent 4e73f34823
commit 9830915d87
2 changed files with 6 additions and 6 deletions

View File

@ -939,7 +939,7 @@ MiniMind2第一时间只能坚定不移的选择做蒸馏派日后基于0.1B
这在GRPO中通过设置规则奖励函数约束模型符合思考标签和回复标签在冷启动靠前的阶段奖励值设置应该提高一些
另一个问题是蒸馏过程虽然和SFT一样但实验结果是模型难以每次都符合模板规范的回复即脱离思考和回复标签约束。
这里的小技巧是增加标记位置token的损失惩罚详见`train_distill_reason.py`:
这里的小技巧是增加标记位置token的损失惩罚详见`train_reason.py`:
```text
# 在 sp_ids 对应的位置增加额外的惩罚
@ -953,9 +953,9 @@ loss_mask[sp_ids] = 10 # 惩罚系数
脚本默认基于rlhf后的基模型做推理能力的蒸馏微调下面直接启动训练即可
```bash
torchrun --nproc_per_node 1 train_distill_reason.py
torchrun --nproc_per_node 1 train_reason.py
# or
python train_distill_reason.py
python train_reason.py
```
> 训练后的模型权重文件默认每隔`100步`保存为: `reason_*.pth`*为模型具体dimension每次保存时新文件会覆盖旧文件

View File

@ -928,7 +928,7 @@ The reply template for reasoning model R1 is:
This is constrained by setting a rule-based reward function in GRPO to make the model comply with thinking tags and reply tags (in the early stages of cold starts, reward values should be increased).
Another issue is that although the distillation process is the same as SFT, experimental results show that models have difficulty consistently complying with template-compliant replies every time, i.e., deviating from thinking and reply tag constraints.
A small trick here is to increase the loss penalty for marker position tokens. See details in `train_distill_reason.py`:
A small trick here is to increase the loss penalty for marker position tokens. See details in `train_reason.py`:
```text
# Add extra penalty to positions corresponding to sp_ids
@ -942,9 +942,9 @@ Therefore, `r1_mix_1024.jsonl` mixed approximately 10k multi-turn conversations
The script defaults to reasoning ability distillation fine-tuning based on the rlhf model. You can directly start training:
```bash
torchrun --nproc_per_node 1 train_distill_reason.py
torchrun --nproc_per_node 1 train_reason.py
# or
python train_distill_reason.py
python train_reason.py
```
> After training, model weight files are saved by default every `100 steps` as: `reason_*.pth` (where * is the model's specific dimension, new files overwrite old ones on each save)