mirror of
https://github.com/jingyaogong/minimind.git
synced 2026-01-13 19:57:20 +08:00
[update] readme
This commit is contained in:
parent
4e73f34823
commit
9830915d87
@ -939,7 +939,7 @@ MiniMind2第一时间只能坚定不移的选择做蒸馏派,日后基于0.1B
|
||||
这在GRPO中通过设置规则奖励函数约束模型符合思考标签和回复标签(在冷启动靠前的阶段奖励值设置应该提高一些)
|
||||
|
||||
另一个问题是蒸馏过程虽然和SFT一样,但实验结果是模型难以每次都符合模板规范的回复,即脱离思考和回复标签约束。
|
||||
这里的小技巧是增加标记位置token的损失惩罚,详见`train_distill_reason.py`:
|
||||
这里的小技巧是增加标记位置token的损失惩罚,详见`train_reason.py`:
|
||||
|
||||
```text
|
||||
# 在 sp_ids 对应的位置增加额外的惩罚
|
||||
@ -953,9 +953,9 @@ loss_mask[sp_ids] = 10 # 惩罚系数
|
||||
脚本默认基于rlhf后的基模型做推理能力的蒸馏微调,下面直接启动训练即可:
|
||||
|
||||
```bash
|
||||
torchrun --nproc_per_node 1 train_distill_reason.py
|
||||
torchrun --nproc_per_node 1 train_reason.py
|
||||
# or
|
||||
python train_distill_reason.py
|
||||
python train_reason.py
|
||||
```
|
||||
|
||||
> 训练后的模型权重文件默认每隔`100步`保存为: `reason_*.pth`(*为模型具体dimension,每次保存时新文件会覆盖旧文件)
|
||||
|
||||
@ -928,7 +928,7 @@ The reply template for reasoning model R1 is:
|
||||
This is constrained by setting a rule-based reward function in GRPO to make the model comply with thinking tags and reply tags (in the early stages of cold starts, reward values should be increased).
|
||||
|
||||
Another issue is that although the distillation process is the same as SFT, experimental results show that models have difficulty consistently complying with template-compliant replies every time, i.e., deviating from thinking and reply tag constraints.
|
||||
A small trick here is to increase the loss penalty for marker position tokens. See details in `train_distill_reason.py`:
|
||||
A small trick here is to increase the loss penalty for marker position tokens. See details in `train_reason.py`:
|
||||
|
||||
```text
|
||||
# Add extra penalty to positions corresponding to sp_ids
|
||||
@ -942,9 +942,9 @@ Therefore, `r1_mix_1024.jsonl` mixed approximately 10k multi-turn conversations
|
||||
The script defaults to reasoning ability distillation fine-tuning based on the rlhf model. You can directly start training:
|
||||
|
||||
```bash
|
||||
torchrun --nproc_per_node 1 train_distill_reason.py
|
||||
torchrun --nproc_per_node 1 train_reason.py
|
||||
# or
|
||||
python train_distill_reason.py
|
||||
python train_reason.py
|
||||
```
|
||||
|
||||
> After training, model weight files are saved by default every `100 steps` as: `reason_*.pth` (where * is the model's specific dimension, new files overwrite old ones on each save)
|
||||
|
||||
Loading…
Reference in New Issue
Block a user