REINFORCE / RLOO / REINFORCE++

Last updated on July 1, 2025 pm

Policy Gradient

策略梯度的目标是,找到一个最优策略πθ\pi^*_{\theta},让agent在与环境交互时,获得的累积回报期望J(θ)J(\theta)最大

最直接的优化方法就是梯度上升,即沿着梯度的方向更新参数:

θnew=θold+αθJ(θ)θ_{new}=θ_{old}+\alpha\nabla_{\theta}J(\theta)

这里的 α\alpha是学习率,θJ(θ)\nabla_{\theta}J(\theta) 就是目标函数 J(θ)J(\theta)对参数θ\theta的梯度,也就是策略梯度

直接求J(θ)J(\theta)的梯度很困难,因为它依赖于与环境交互产生的整个轨迹的概率分布,而这个分布本身又依赖于 θ\theta

所以策略梯度采用一种这样的估计方式:

J(θ)sμ(s)aqπ(s,a)θπ(as,θ)\nabla J(\theta) \propto \sum_{s} \mu(s) \sum_{a} q_{\pi}(s, a) \nabla_{\theta} \pi(a|s, \theta)

  • π(as,θ)\pi(a|s, \theta):我们的策略,即在状态ss下选择动作 aa的概率
  • θπ(as,θ)\nabla_{\theta} \pi(a|s, \theta):策略函数对参数 θ\theta的梯度。这个梯度向量指向了这样一个方向:如果我们沿着这个方向更新θ\theta,那么在状态 ss下选择动作 aa的概率 π(as,θ)\pi(a|s, \theta)会增加得最快
  • qπ(s,a)q_{\pi}(s, a)动作价值函数(Action-Value Function / QQ-Function,指在状态 ss下,执行动作aa,然后继续遵循当前策略 π\pi,所能获得的期望总回报
  • μ(s)\mu(s):在当前策略 π\pi下,状态ss稳态分布

分析:

  1. 内层求和 (a\sum_{a}):对于一个特定的状态 ss,我们遍历所有可能的动作 aa
  2. qπ(s,a)θπ(as,θ)q_{\pi}(s, a) \cdot \nabla_{\theta} \pi(a|s, \theta):这是一个加权的梯度。θπ(as,θ)\nabla_{\theta} \pi(a|s, \theta) 告诉我们增加动作 aa概率的方向,而 qπ(s,a)q_{\pi}(s, a) 则为这个方向赋予权重
  3. 外层求和 s\sum_{s}:我们将所有状态 ss下的梯度贡献加权平均起来就得到了最终的策略梯度

但是上面的估计是无法直接计算的,因为我们既不知道真实的状态分布 μ(s)\mu(s),也不知道真实的动作价值qπ(s,a)q_{\pi}(s, a)。所以我们需要用采样的思想来近似它。

θt+1=θt+αaq^(St,a,w)θπ(aSt,θ)\theta_{t+1} = \theta_t + \alpha \sum_{a} \hat{q}(S_t, a, w) \nabla_{\theta} \pi(a|S_t, \theta)

这里的变化是:

  1. 不再对所有状态 ss求和,而是在某一个时间步tt,使用当前状态StS_t。这相当于从状态分布μ(s)\mu(s) 中采样了一个状态
  2. 用一个近似的价值函数q^(St,a,w)\hat{q}(S_t, a, w) 来代替真实的 qπ(St,a)q_{\pi}(S_t, a)。这个q^\hat{q}也就是Critic,通过学习来逼近真实的 q 函数

这个更新方式被称为All-Actions方法,因为它在更新时,理论上需要对当前状态 StS_t下的所有动作aa 都计算其q^\hat{q}值和梯度

REINFORCE

REINFORCE算法是一种更简单的蒙特卡洛方法,它对公式3做了进一步的简化和近似:

  1. 不用Critic:REINFORCE不学习一个 q^\hat{q} 函数
  2. 不用“All-Actions”:它只关心在 StS_t时刻实际采取的那个动作ata_t
  3. GtG_t 代替 qπ(St,at)q_{\pi}(S_t, a_t):它使用从tt 时刻开始,直到本次交互结束的完整回报 GtG_t(一个具体的、带随机性的样本值)来作为qπ(St,at)q_{\pi}(S_t, a_t) 的无偏估计

所以REINFORCE的更新公式就变成了:

θt+1=θt+αGtθlogπ(atSt,θ))\theta_{t+1} = \theta_t + \alpha G_t \nabla_{\theta} \log \pi(a_t|S_t, \theta))

注:这里用logπ\nabla \log \pi是数学上的一个等价变换,称为log-derivative trick,其效果和π\nabla \pi的目标一致,但在计算上更稳定、更方便

可以看到,REINFORCE只针对实际执行的动作进行更新,用整条轨迹的未来回报 GtG_t 作为其好坏的评判标准,这就是它被称为蒙特卡洛策略梯度的原因。

REINFORCE Leave-One-Out (RLOO)


在标准 REINFORCE 中,我们为 每条采样轨迹 计算一次回报 GtG_t,直接作为优势(advantage)。虽然无偏,但方差很大

RLOO 的核心思想是:一次采样 同一状态(或同一 prompt)下的 KK 条轨迹,并用其余 K1K-1 条的平均回报作为当前轨迹的 baseline,从而构造 Leave-One-Out (L-O-O) 优势,显著降低方差,同时保持无偏性。

qq promptaka_k 为第 kk 条采样答案,R(q,ak)R(q,a_k)reward,则

b(q,ak)=1K1i=1,ikKR(q,ai),A(q,ak)=R(q,ak)b(q,ak)=KK1 ⁣(R(q,ak)1Ki=1KR(q,ai)).\begin{aligned} b(q,a_k) &= \frac1{K-1}\sum_{i=1,i\neq k}^{K} R(q,a_i),\\[4pt] A(q,a_k) &= R(q,a_k)-b(q,a_k) \\ &=\frac{K}{K-1}\!\Bigl(R(q,a_k) - \tfrac1K\sum_{i=1}^{K} R(q,a_i)\Bigr). \end{aligned}

策略更新仍沿 θlogπθ(akq),A(q,ak)\nabla_\theta\log\pi_\theta(a_k|q),A(q,a_k) 方向,但由于 baseline 依赖于同批样本,方差大幅下降。

REINFORCE ++

后续工作发现 RLOO/GRPO 等 单prompt多轨迹 方法在 RLHF 中易过拟合简单 prompt,并可能出现reward hack;同时降低了批次多样性

于是REINFORCE++诞生,其改进是:

  • 全局批归一化基线: 不再为每个 prompt 计算 baseline,而是用 整个 mini-batch 回报的均值 Rbatch\overline{R}_{\text{batch}}

Aq,ot=r(o1:t,q)    βi=tTKL(i),KL(t)=logπRL,θold(otq,o<t)πSFT(otq,o<t),A~q,ot=Aq,otμbatchσbatch\begin{aligned} A_{q,o_t} & = r(o_{1:t},q)\;-\;\beta\sum_{i=t}^{T}\mathrm{KL}(i), \\[4pt] \mathrm{KL}(t) & =\log\frac{\pi_{\text{RL},\theta_{\text{old}}}(o_t \mid q, o_{<t})}{\pi_{\text{SFT}}(o_t \mid q, o_{<t})}, \\[6pt] \widetilde{A}_{q,o_t} & = \frac{A_{q,o_t}-\mu_{\text{batch}}}{\sigma_{\text{batch}}}\end{aligned}

其中 (μbatch,σbatch)(\mu_{\text{batch}},\sigma_{\text{batch}})整个batch reward的均值 / 标准差,实现 批归一化优势,无偏且显著降方差

  • PPO 裁剪,但 无 Critic: 直接把上式 A~\widetilde{A} 塞进 PPO 的 surrogate loss,并保留概率比率裁剪,既保持 步长安全,又避免使用critic model

L(θ)=Eq,o ⁣[1otmin(rt(θ)A~q,ot,  clip(rt(θ),1ϵ,1+ϵ)A~q,ot)],L(\theta)=\mathbb{E}_{q,o}\!\Bigl[\tfrac1{|o|}\sum_{t} \min\bigl(r_t(\theta)\,\widetilde{A}_{q,o_t},\; \text{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\,\widetilde{A}_{q,o_t}\bigr)\Bigr],

rt(θ)=πθ(ot ⁣q,o<t)πθold(ot ⁣q,o<t).r_t(\theta)=\frac{\pi_{\theta}(o_t\!\mid q,o_{<t})}{\pi_{\theta_{\text{old}}}(o_t\!\mid q,o_{<t})}.

  • 一次一答:每个 prompt 只采样 1 条输出,再依靠批归一化生成统一基线——这样既避免了 RLOO / GRPO 的 prompt-级过拟合,也把 GPU batch 容量留给更多不同 prompt,进一步提升多样性

verl practice

verl中,对于REINFORCE++,提供了两种优势计算方式:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
def compute_reinforce_plus_plus_outcome_advantage(token_level_rewards: torch.Tensor, response_mask: torch.Tensor, gamma: torch.Tensor):
"""
Compute advantage for REINFORCE++.
This implementation is based on the paper: https://arxiv.org/abs/2501.03262

Args:
token_level_rewards: `(torch.Tensor)`
shape: (bs, response_length)
response_mask: `(torch.Tensor)`
shape: (bs, response_length)

Returns:
advantages: `(torch.Tensor)`
shape: (bs, response_length)
Returns: `(torch.Tensor)`
shape: (bs, response_length)
"""

with torch.no_grad():
returns = torch.zeros_like(token_level_rewards)
running_return = 0

for t in reversed(range(token_level_rewards.shape[1])):
running_return = token_level_rewards[:, t] + gamma * running_return
returns[:, t] = running_return
# Reset after EOS
running_return = running_return * response_mask[:, t]

advantages = verl_F.masked_whiten(returns, response_mask)
advantages = advantages * response_mask

return advantages, returns



def compute_reinforce_plus_plus_baseline_outcome_advantage(token_level_rewards: torch.Tensor, response_mask: torch.Tensor, index: torch.Tensor, epsilon: float = 1e-6):
"""
Compute advantage for RF++-baseline (https://arxiv.org/abs/2501.03262), operating only on Outcome reward
(with only one scalar reward for each response).

Args:
token_level_rewards: `(torch.Tensor)`
shape: (bs, response_length)
response_mask: `(torch.Tensor)`
shape: (bs, response_length)

Returns:
advantages: `(torch.Tensor)`
shape: (bs, response_length)
Returns: `(torch.Tensor)`
shape: (bs, response_length)
"""
response_length = token_level_rewards.shape[-1]
scores = token_level_rewards.sum(dim=-1)

id2score = defaultdict(list)
id2mean = {}

with torch.no_grad():
bsz = scores.shape[0]
for i in range(bsz):
id2score[index[i]].append(scores[i])
for idx in id2score:
if len(id2score[idx]) == 1:
id2mean[idx] = torch.tensor(0.0)
elif len(id2score[idx]) > 1:
id2mean[idx] = torch.mean(torch.tensor(id2score[idx]))
else:
raise ValueError(f"no score in prompt index: {idx}")
for i in range(bsz):
scores[i] = scores[i] - id2mean[index[i]]

scores = scores.unsqueeze(-1).tile([1, response_length]) * response_mask
scores = verl_F.masked_whiten(scores, response_mask) * response_mask

return scores, scores
reinforce_plus_plus_outcome_advantage reinforce_plus_plus_baseline_outcome_advantage
逐token计算return(monte-carol return)
Gt=rt+γ,Gt+1G_t = r_t + \gamma,G_{t+1}
只用outcome奖励,将一条response的所有reward累加,后续所有token在这一个response内的reward相同
全局批归一化(batch z-score) 先减去组内均值,然后再全局批归一化(batch z-score)

reinforce_plus_plus_baseline_outcome_advantage通过先prompt组内归一化全局batch z-score归一化,实现两级降方差,方差更低

对比之下,GRPO

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# NOTE(sgm): this implementation only consider outcome supervision, where the reward is a scalar.
def compute_grpo_outcome_advantage(
token_level_rewards: torch.Tensor,
response_mask: torch.Tensor,
index: np.ndarray,
epsilon: float = 1e-6,
norm_adv_by_std_in_grpo: str = True,
):
"""
Compute advantage for GRPO, operating only on Outcome reward
(with only one scalar reward for each response).

Args:
token_level_rewards: `(torch.Tensor)`
shape is (bs, response_length)
response_mask: `(torch.Tensor)`
shape is (bs, response_length)
norm_adv_by_std_in_grpo: (bool)
whether to scale the GRPO advantage.
If True, the advantage is scaled by the std, as in the original GRPO.
If False, the advantage is not scaled, as in Dr.GRPO (https://arxiv.org/abs/2503.20783).

Returns:
advantages: `(torch.Tensor)`
shape is (bs, response_length)
Returns: `(torch.Tensor)`
shape is (bs, response_length)
"""
scores = token_level_rewards.sum(dim=-1)

id2score = defaultdict(list)
id2mean = {}
id2std = {}

with torch.no_grad():
bsz = scores.shape[0]
for i in range(bsz):
id2score[index[i]].append(scores[i])
for idx in id2score:
if len(id2score[idx]) == 1:
id2mean[idx] = torch.tensor(0.0)
id2std[idx] = torch.tensor(1.0)
elif len(id2score[idx]) > 1:
id2mean[idx] = torch.mean(torch.tensor(id2score[idx]))
id2std[idx] = torch.std(torch.tensor([id2score[idx]]))
else:
raise ValueError(f"no score in prompt index: {idx}")
for i in range(bsz):
if norm_adv_by_std_in_grpo:
scores[i] = (scores[i] - id2mean[index[i]]) / (id2std[index[i]] + epsilon)
else:
scores[i] = scores[i] - id2mean[index[i]]
scores = scores.unsqueeze(-1) * response_mask

return scores, scores

compute_reinforce_plus_plus_baseline_outcome_advantage一样,都采用outcome-only先算一个整体reward,然后计算同一prompt的组内均值和标准差

norm_adv_by_std_in_grpo控制是否除以标准差(+ epsilon),Dr. GRPO提出除以标准差会引入长度偏置,但是Dr. GRPO方差稍高


REINFORCE / RLOO / REINFORCE++
https://lishilong.site/2025/07/01/Work/REINFORCE/
Author
Shilong Li
Posted on
July 1, 2025
Updated on
July 1, 2025
Licensed under