Last updated on July 1, 2025 pm
Policy Gradient
策略梯度的目标是,找到一个最优策略π θ ∗ \pi^*_{\theta} π θ ∗ ,让agent在与环境交互时,获得的累积回报期望J ( θ ) J(\theta) J ( θ ) 最大
最直接的优化方法就是梯度上升,即沿着梯度的方向更新参数:
θ n e w = θ o l d + α ∇ θ J ( θ ) θ_{new}=θ_{old}+\alpha\nabla_{\theta}J(\theta)
θ n e w = θ o l d + α ∇ θ J ( θ )
这里的 α \alpha α 是学习率,∇ θ J ( θ ) \nabla_{\theta}J(\theta) ∇ θ J ( θ ) 就是目标函数 J ( θ ) J(\theta) J ( θ ) 对参数θ \theta θ 的梯度,也就是策略梯度 。
直接求J ( θ ) J(\theta) J ( θ ) 的梯度很困难,因为它依赖于与环境交互产生的整个轨迹的概率分布,而这个分布本身又依赖于 θ \theta θ 。
所以策略梯度采用一种这样的估计方式:
∇ J ( θ ) ∝ ∑ s μ ( s ) ∑ a q π ( s , a ) ∇ θ π ( a ∣ s , θ ) \nabla J(\theta) \propto \sum_{s} \mu(s) \sum_{a} q_{\pi}(s, a) \nabla_{\theta} \pi(a|s, \theta)
∇ J ( θ ) ∝ s ∑ μ ( s ) a ∑ q π ( s , a ) ∇ θ π ( a ∣ s , θ )
π ( a ∣ s , θ ) \pi(a|s, \theta) π ( a ∣ s , θ ) :我们的策略,即在状态s s s 下选择动作 a a a 的概率
∇ θ π ( a ∣ s , θ ) \nabla_{\theta} \pi(a|s, \theta) ∇ θ π ( a ∣ s , θ ) :策略函数对参数 θ \theta θ 的梯度。这个梯度向量指向了这样一个方向:如果我们沿着这个方向更新θ \theta θ ,那么在状态 s s s 下选择动作 a a a 的概率 π ( a ∣ s , θ ) \pi(a|s, \theta) π ( a ∣ s , θ ) 会增加得最快
q π ( s , a ) q_{\pi}(s, a) q π ( s , a ) :动作价值函数(Action-Value Function / Q Q Q -Function ) ,指在状态 s s s 下,执行动作a a a ,然后继续遵循当前策略 π \pi π ,所能获得的期望总回报
μ ( s ) \mu(s) μ ( s ) :在当前策略 π \pi π 下,状态s s s 的稳态分布
分析:
内层求和 (∑ a \sum_{a} ∑ a ) :对于一个特定的状态 s s s ,我们遍历所有可能的动作 a a a
q π ( s , a ) ⋅ ∇ θ π ( a ∣ s , θ ) q_{\pi}(s, a) \cdot \nabla_{\theta} \pi(a|s, \theta) q π ( s , a ) ⋅ ∇ θ π ( a ∣ s , θ ) :这是一个加权的梯度。∇ θ π ( a ∣ s , θ ) \nabla_{\theta} \pi(a|s, \theta) ∇ θ π ( a ∣ s , θ ) 告诉我们增加动作 a a a 概率的方向,而 q π ( s , a ) q_{\pi}(s, a) q π ( s , a ) 则为这个方向赋予权重
外层求和 ∑ s \sum_{s} ∑ s :我们将所有状态 s s s 下的梯度贡献加权平均起来就得到了最终的策略梯度
但是上面的估计是无法直接计算的,因为我们既不知道真实的状态分布 μ ( s ) \mu(s) μ ( s ) ,也不知道真实的动作价值q π ( s , a ) q_{\pi}(s, a) q π ( s , a ) 。所以我们需要用采样的思想来近似它。
θ t + 1 = θ t + α ∑ a q ^ ( S t , a , w ) ∇ θ π ( a ∣ S t , θ ) \theta_{t+1} = \theta_t + \alpha \sum_{a} \hat{q}(S_t, a, w) \nabla_{\theta} \pi(a|S_t, \theta)
θ t + 1 = θ t + α a ∑ q ^ ( S t , a , w ) ∇ θ π ( a ∣ S t , θ )
这里的变化是:
不再对所有状态 s s s 求和,而是在某一个时间步t t t ,使用当前状态S t S_t S t 。这相当于从状态分布μ ( s ) \mu(s) μ ( s ) 中采样了一个状态
用一个近似的价值函数q ^ ( S t , a , w ) \hat{q}(S_t, a, w) q ^ ( S t , a , w ) 来代替真实的 q π ( S t , a ) q_{\pi}(S_t, a) q π ( S t , a ) 。这个q ^ \hat{q} q ^ 也就是Critic ,通过学习来逼近真实的 q 函数
这个更新方式被称为All-Actions 方法,因为它在更新时,理论上需要对当前状态 S t S_t S t 下的所有动作a a a 都计算其q ^ \hat{q} q ^ 值和梯度
REINFORCE
REINFORCE算法是一种更简单的蒙特卡洛方法,它对公式3做了进一步的简化和近似:
不用Critic :REINFORCE不学习一个 q ^ \hat{q} q ^ 函数
不用“All-Actions” :它只关心在 S t S_t S t 时刻实际采取的那个动作a t a_t a t
用 G t G_t G t 代替 q π ( S t , a t ) q_{\pi}(S_t, a_t) q π ( S t , a t ) :它使用从t t t 时刻开始,直到本次交互结束的完整回报 G t G_t G t (一个具体的、带随机性的样本值)来作为q π ( S t , a t ) q_{\pi}(S_t, a_t) q π ( S t , a t ) 的无偏估计
所以REINFORCE的更新公式就变成了:
θ t + 1 = θ t + α G t ∇ θ log π ( a t ∣ S t , θ ) ) \theta_{t+1} = \theta_t + \alpha G_t \nabla_{\theta} \log \pi(a_t|S_t, \theta))
θ t + 1 = θ t + α G t ∇ θ log π ( a t ∣ S t , θ ))
注:这里用∇ log π \nabla \log \pi ∇ log π 是数学上的一个等价变换,称为log-derivative trick,其效果和∇ π \nabla \pi ∇ π 的目标一致,但在计算上更稳定、更方便
可以看到,REINFORCE只针对实际执行的动作 进行更新,用整条轨迹的未来回报 G t G_t G t 作为其好坏的评判标准,这就是它被称为蒙特卡洛策略梯度的原因。
REINFORCE Leave-One-Out (RLOO)
在标准 REINFORCE
中,我们为 每条采样轨迹 计算一次回报 G t G_t G t ,直接作为优势(advantage
)。虽然无偏,但方差很大 。
RLOO
的核心思想是:一次采样 同一状态(或同一 prompt)下的 K K K 条轨迹 ,并用其余 K − 1 K-1 K − 1 条的平均回报作为当前轨迹的 baseline,从而构造 Leave-One-Out (L-O-O) 优势 ,显著降低方差,同时保持无偏性。
设 q q q 为 prompt
,a k a_k a k 为第 k k k 条采样答案,R ( q , a k ) R(q,a_k) R ( q , a k ) 为reward
,则
b ( q , a k ) = 1 K − 1 ∑ i = 1 , i ≠ k K R ( q , a i ) , A ( q , a k ) = R ( q , a k ) − b ( q , a k ) = K K − 1 ( R ( q , a k ) − 1 K ∑ i = 1 K R ( q , a i ) ) . \begin{aligned} b(q,a_k) &= \frac1{K-1}\sum_{i=1,i\neq k}^{K} R(q,a_i),\\[4pt] A(q,a_k) &= R(q,a_k)-b(q,a_k) \\ &=\frac{K}{K-1}\!\Bigl(R(q,a_k) - \tfrac1K\sum_{i=1}^{K} R(q,a_i)\Bigr). \end{aligned}
b ( q , a k ) A ( q , a k ) = K − 1 1 i = 1 , i = k ∑ K R ( q , a i ) , = R ( q , a k ) − b ( q , a k ) = K − 1 K ( R ( q , a k ) − K 1 i = 1 ∑ K R ( q , a i ) ) .
策略更新仍沿 ∇ θ log π θ ( a k ∣ q ) , A ( q , a k ) \nabla_\theta\log\pi_\theta(a_k|q),A(q,a_k) ∇ θ log π θ ( a k ∣ q ) , A ( q , a k ) 方向,但由于 baseline 依赖于同批样本,方差大幅下降。
REINFORCE ++
后续工作发现 RLOO/GRPO
等 单prompt多轨迹 方法在 RLHF 中易过拟合简单 prompt ,并可能出现reward hack;同时降低了批次多样性
于是REINFORCE++诞生,其改进是:
全局批归一化基线 : 不再为每个 prompt 计算 baseline,而是用 整个 mini-batch 回报的均值 R ‾ batch \overline{R}_{\text{batch}} R batch :
A q , o t = r ( o 1 : t , q ) − β ∑ i = t T K L ( i ) , K L ( t ) = log π RL , θ old ( o t ∣ q , o < t ) π SFT ( o t ∣ q , o < t ) , A ~ q , o t = A q , o t − μ batch σ batch \begin{aligned} A_{q,o_t} & = r(o_{1:t},q)\;-\;\beta\sum_{i=t}^{T}\mathrm{KL}(i), \\[4pt] \mathrm{KL}(t) & =\log\frac{\pi_{\text{RL},\theta_{\text{old}}}(o_t \mid q, o_{<t})}{\pi_{\text{SFT}}(o_t \mid q, o_{<t})}, \\[6pt] \widetilde{A}_{q,o_t} & = \frac{A_{q,o_t}-\mu_{\text{batch}}}{\sigma_{\text{batch}}}\end{aligned}
A q , o t KL ( t ) A q , o t = r ( o 1 : t , q ) − β i = t ∑ T KL ( i ) , = log π SFT ( o t ∣ q , o < t ) π RL , θ old ( o t ∣ q , o < t ) , = σ batch A q , o t − μ batch
其中 ( μ batch , σ batch ) (\mu_{\text{batch}},\sigma_{\text{batch}}) ( μ batch , σ batch ) 是 整个batch reward
的均值 / 标准差,实现 批归一化优势 ,无偏且显著降方差
PPO 裁剪,但 无 Critic : 直接把上式 A ~ \widetilde{A} A 塞进 PPO 的 surrogate loss
,并保留概率比率裁剪,既保持 步长安全 ,又避免使用critic model
L ( θ ) = E q , o [ 1 ∣ o ∣ ∑ t min ( r t ( θ ) A ~ q , o t , clip ( r t ( θ ) , 1 − ϵ , 1 + ϵ ) A ~ q , o t ) ] , L(\theta)=\mathbb{E}_{q,o}\!\Bigl[\tfrac1{|o|}\sum_{t} \min\bigl(r_t(\theta)\,\widetilde{A}_{q,o_t},\; \text{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\,\widetilde{A}_{q,o_t}\bigr)\Bigr],
L ( θ ) = E q , o [ ∣ o ∣ 1 t ∑ min ( r t ( θ ) A q , o t , clip ( r t ( θ ) , 1 − ϵ , 1 + ϵ ) A q , o t ) ] ,
r t ( θ ) = π θ ( o t ∣ q , o < t ) π θ old ( o t ∣ q , o < t ) . r_t(\theta)=\frac{\pi_{\theta}(o_t\!\mid q,o_{<t})}{\pi_{\theta_{\text{old}}}(o_t\!\mid q,o_{<t})}.
r t ( θ ) = π θ old ( o t ∣ q , o < t ) π θ ( o t ∣ q , o < t ) .
一次一答 :每个 prompt 只采样 1 条输出,再依靠批归一化生成统一基线——这样既避免了 RLOO / GRPO
的 prompt-级过拟合,也把 GPU batch 容量留给更多不同 prompt,进一步提升多样性
verl practice
在verl
中,对于REINFORCE++
,提供了两种优势计算方式:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 def compute_reinforce_plus_plus_outcome_advantage (token_level_rewards: torch.Tensor, response_mask: torch.Tensor, gamma: torch.Tensor ): """ Compute advantage for REINFORCE++. This implementation is based on the paper: https://arxiv.org/abs/2501.03262 Args: token_level_rewards: `(torch.Tensor)` shape: (bs, response_length) response_mask: `(torch.Tensor)` shape: (bs, response_length) Returns: advantages: `(torch.Tensor)` shape: (bs, response_length) Returns: `(torch.Tensor)` shape: (bs, response_length) """ with torch.no_grad(): returns = torch.zeros_like(token_level_rewards) running_return = 0 for t in reversed (range (token_level_rewards.shape[1 ])): running_return = token_level_rewards[:, t] + gamma * running_return returns[:, t] = running_return running_return = running_return * response_mask[:, t] advantages = verl_F.masked_whiten(returns, response_mask) advantages = advantages * response_mask return advantages, returns def compute_reinforce_plus_plus_baseline_outcome_advantage (token_level_rewards: torch.Tensor, response_mask: torch.Tensor, index: torch.Tensor, epsilon: float = 1e-6 ): """ Compute advantage for RF++-baseline (https://arxiv.org/abs/2501.03262), operating only on Outcome reward (with only one scalar reward for each response). Args: token_level_rewards: `(torch.Tensor)` shape: (bs, response_length) response_mask: `(torch.Tensor)` shape: (bs, response_length) Returns: advantages: `(torch.Tensor)` shape: (bs, response_length) Returns: `(torch.Tensor)` shape: (bs, response_length) """ response_length = token_level_rewards.shape[-1 ] scores = token_level_rewards.sum (dim=-1 ) id2score = defaultdict(list ) id2mean = {} with torch.no_grad(): bsz = scores.shape[0 ] for i in range (bsz): id2score[index[i]].append(scores[i]) for idx in id2score: if len (id2score[idx]) == 1 : id2mean[idx] = torch.tensor(0.0 ) elif len (id2score[idx]) > 1 : id2mean[idx] = torch.mean(torch.tensor(id2score[idx])) else : raise ValueError(f"no score in prompt index: {idx} " ) for i in range (bsz): scores[i] = scores[i] - id2mean[index[i]] scores = scores.unsqueeze(-1 ).tile([1 , response_length]) * response_mask scores = verl_F.masked_whiten(scores, response_mask) * response_mask return scores, scores
reinforce_plus_plus_outcome_advantage
reinforce_plus_plus_baseline_outcome_advantage
逐token计算return(monte-carol return)G t = r t + γ , G t + 1 G_t = r_t + \gamma,G_{t+1} G t = r t + γ , G t + 1
只用outcome奖励,将一条response的所有reward累加,后续所有token在这一个response内的reward相同
全局批归一化(batch z-score)
先减去组内均值,然后再全局批归一化(batch z-score)
reinforce_plus_plus_baseline_outcome_advantage
通过先prompt组内归一化 再全局batch z-score归一化 ,实现两级降方差,方差更低
对比之下,GRPO
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 def compute_grpo_outcome_advantage ( token_level_rewards: torch.Tensor, response_mask: torch.Tensor, index: np.ndarray, epsilon: float = 1e-6 , norm_adv_by_std_in_grpo: str = True , ): """ Compute advantage for GRPO, operating only on Outcome reward (with only one scalar reward for each response). Args: token_level_rewards: `(torch.Tensor)` shape is (bs, response_length) response_mask: `(torch.Tensor)` shape is (bs, response_length) norm_adv_by_std_in_grpo: (bool) whether to scale the GRPO advantage. If True, the advantage is scaled by the std, as in the original GRPO. If False, the advantage is not scaled, as in Dr.GRPO (https://arxiv.org/abs/2503.20783). Returns: advantages: `(torch.Tensor)` shape is (bs, response_length) Returns: `(torch.Tensor)` shape is (bs, response_length) """ scores = token_level_rewards.sum (dim=-1 ) id2score = defaultdict(list ) id2mean = {} id2std = {} with torch.no_grad(): bsz = scores.shape[0 ] for i in range (bsz): id2score[index[i]].append(scores[i]) for idx in id2score: if len (id2score[idx]) == 1 : id2mean[idx] = torch.tensor(0.0 ) id2std[idx] = torch.tensor(1.0 ) elif len (id2score[idx]) > 1 : id2mean[idx] = torch.mean(torch.tensor(id2score[idx])) id2std[idx] = torch.std(torch.tensor([id2score[idx]])) else : raise ValueError(f"no score in prompt index: {idx} " ) for i in range (bsz): if norm_adv_by_std_in_grpo: scores[i] = (scores[i] - id2mean[index[i]]) / (id2std[index[i]] + epsilon) else : scores[i] = scores[i] - id2mean[index[i]] scores = scores.unsqueeze(-1 ) * response_mask return scores, scores
和compute_reinforce_plus_plus_baseline_outcome_advantage
一样,都采用outcome-only先算一个整体reward,然后计算同一prompt的组内均值和标准差
norm_adv_by_std_in_grpo
控制是否除以标准差(+ epsilon),Dr. GRPO提出除以标准差会引入长度偏置 ,但是Dr. GRPO方差稍高