Shilong Li's Blog

REINFORCE / RLOO / REINFORCE++ / GRPO

Policy Gradient 策略梯度的目标是，找到一个最优策略πθ∗\pi^*_{\theta}πθ∗，让agent在与环境交互时，获得的累积回报期望J(θ)J(\theta)J(θ)最大最直接的优化方法就是梯度上升，即沿着梯度的方向更新参数： θnew=θold+α∇θJ(θ)θ_{new}=θ_{old}+\alpha\nabla_{\theta}J(\theta) θnew=θol

2025-07-01

Work

#LLM

KL Divergence的三种估计方法

Approximating KL Divergence Verl Implementation 12345678910111213141516171819202122232425262728293031323334def kl_penalty(logprob: torch.FloatTensor, ref_logprob: torch.FloatTensor, kl_penalty) ->

2025-06-18

Work

#LLM

LLM Reasoning Model - Math 训练记录

Infra: https://github.com/volcengine/verl Train Data: https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k Eval Data: https://huggingface.co/datasets/HuggingFaceH4/aime_2024 https://huggin

2025-06-18

Work

#LLM

Entroy的数值稳定计算方法

12345def entropy_from_logits(logits: torch.Tensor): """Calculate entropy from logits.""" pd = torch.nn.functional.softmax(logits, dim=-1) entropy = torch.logsume

2025-05-28

Work

#LLM

A unified perspective of RLHF

A unified perspective of RLHF

Currently popular RLHF Method To this day，the post-training diagram for LLMs is still CPT, SFT and RLHF. There are no signs that this diagram will change currently. Focusing on RLHF, I will attempt t

2024-10-29

Work

#LLM RLHF