From 6efba3249ada08443c5f7cdf527112084a4a544d Mon Sep 17 00:00:00 2001
From: jingyaogong <gongjy.cs@qq.com>
Date: Fri, 24 Oct 2025 01:18:33 +0800
Subject: [PATCH] [feat] update readme

---
 README.md    | 3 +--
 README_en.md | 3 +--
 2 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/README.md b/README.md
index ee3b8a1..ac07f81 100644
--- a/README.md
+++ b/README.md
@@ -990,7 +990,7 @@ LLM里的强化学习方法可分两类：
 
 $$\mathcal{J}_{PO} = \mathbb{E}_{q \sim P(Q), o \sim \pi(O|q)} \left[ \underbrace{f(r_t)}_{\text{策略项}} \cdot \underbrace{g(A_t)}_{\text{优势项}} - \underbrace{h(\text{KL}_t)}_{\text{正则项}} \right]$$
 
-训练时，只需**最小化负目标函数**，即: $\mathcal{L_{PO}}=\mathcal{J_{PO}}$
+训练时，只需**最小化负目标函数**，即: $\mathcal{L_{PO}}=-\mathcal{J_{PO}}$
 
 这个框架只包含三个核心组件：
 * **策略项** $f(r_t)$: 如何使用概率比 $r_t$? 即告诉模型新旧策略偏差有多大，是否探索到了更好的token
@@ -1009,7 +1009,6 @@ $$\mathcal{J}_{PO} = \mathbb{E}_{q \sim P(Q), o \sim \pi(O|q)} \left[ \underbrac
 | $\text{KL}_t$ | KL散度 | 防止策略偏离参考模型太远 | $[0, +\infty)$ |
 
 </details>
-<br/>
 
 不同的**xxPO算法**本质上只是对这三个组件的不同设计的实例化！
 
diff --git a/README_en.md b/README_en.md
index 4c2759c..8f218a1 100644
--- a/README_en.md
+++ b/README_en.md
@@ -968,7 +968,7 @@ The essence of all RL algorithms is only optimizing one expectation:
 
 $$\mathcal{J}_{PO} = \mathbb{E}_{q \sim P(Q), o \sim \pi(O|q)} \left[ \underbrace{f(r_t)}_{\text{policy term}} \cdot \underbrace{g(A_t)}_{\text{advantage term}} - \underbrace{h(\text{KL}_t)}_{\text{regularization term}} \right]$$
 
-During training, only **minimize the negative objective function**, i.e.: $\mathcal{L_{PO}}=\mathcal{J_{PO}}$
+During training, only **minimize the negative objective function**, i.e.: $\mathcal{L_{PO}}=-\mathcal{J_{PO}}$
 
 This framework contains only three core components:
 * **Policy term** $f(r_t)$: How to use probability ratio $r_t$? Tell the model how large the deviation between new and old policies is, whether better tokens are explored
@@ -987,7 +987,6 @@ This framework contains only three core components:
 | $\text{KL}_t$ | KL divergence | Prevent policy from deviating too far from reference model | $[0, +\infty)$ |
 
 </details>
-<br/>
 
 Different **xxPO algorithms** are essentially just different design instantiations of these three components!