[update] readme

2026-06-06 00:04:50 +00:00 · 2026-04-09 19:00:41 +08:00
parent 939dc8ff42
commit b2488e6440
2 changed files with 2 additions and 4 deletions
@@ -1582,7 +1582,7 @@ HF_ENDPOINT=https://hf-mirror.com lm_eval --model hf --model_args pretrained="/p

 MiniMind 的数据规模远小于表中其他模型，且训练比例偏向中文，因此英文表现不佳，此外默认没有专门针对这类选择题评测格式做对齐微调，所以表现会相对弱，结果仅供娱乐：

-| model name | from | params | ZH (ceval / cmmlu) | EN (arc / piqa / OBQA / HellaSwag / siqa) |
+| model name | from | params | zh (ceval / cmmlu) | en (arc / piqa / openbookqa / hellaswag / social_iqa) |
 |---|---|---|---|---|
 | minimind-3 | current | 64M | 24.89 / 25.38 | 28.49 / 50.65 / 23.60 / 28.28 / 34.19 |
 | minimind-3-moe | current | 198M | 25.48 / 24.32 | 27.74 / 50.71 / 26.20 / 27.43 / 34.03 |
@@ -1603,7 +1603,6 @@ minimind-3-exam 不是更大的基座模型，也几乎没有额外注入新知
 这个实验想说明的只是：对这类评测，小模型的瓶颈未必完全在知识本身，也可能在于输入格式没有对齐。仅做少量格式对齐后，minimind-3-exam 在上面 7 个任务上平均提升约 2.9 个百分点。

 </details>
-<br/>

 ![benchmark_radar](./images/benchmark_radar.jpg)

@@ -1579,7 +1579,7 @@ HF_ENDPOINT=https://hf-mirror.com lm_eval --model hf --model_args pretrained="/p

 MiniMind is trained on far less data than the other models listed here, and its training mix is heavily skewed toward Chinese, so its English performance is relatively weak. It is also not specifically aligned to this multiple-choice evaluation format by default, so the results here are only for entertainment:

-| model name | from | params | ZH (ceval / cmmlu) | EN (arc / piqa / OBQA / HellaSwag / siqa) |
+| model name | from | params | zh (ceval / cmmlu) | en (arc / piqa / openbookqa / hellaswag / social_iqa) |
 |---|---|---|---|---|
 | minimind-3 | current | 64M | 24.89 / 25.38 | 28.49 / 50.65 / 23.60 / 28.28 / 34.19 |
 | minimind-3-moe | current | 198M | 25.48 / 24.32 | 27.74 / 50.71 / 26.20 / 27.43 / 34.03 |
@@ -1600,7 +1600,6 @@ The 7 benchmarks used in this section have no sample overlap with the alignment
 What this experiment suggests is simple: for this kind of benchmark, the bottleneck of a small model may not lie entirely in knowledge itself, but also in whether the input format is aligned. With only a small amount of format alignment, minimind-3-exam improves by about 2.9 percentage points on average across the 7 tasks above.

 </details>
-<br/>

 ![benchmark_radar](./images/benchmark_radar.jpg)