REINFORCE / RLOO / REINFORCE++ Policy Gradient 策略梯度的目标是,找到一个最优策略πθ∗\pi^*_{\theta}πθ∗,让agent在与环境交互时,获得的累积回报期望J(θ)J(\theta)J(θ)最大 最直接的优化方法就是梯度上升,即沿着梯度的方向更新参数: θnew=θold+α∇θJ(θ)θ_{new}=θ_{old}+\alpha\nabla_{\theta}J(\theta) θnew=θol 2025-07-01 Work #LLM
KL Divergence的三种估计方法 Approximating KL Divergence Verl Implementation 12345678910111213141516171819202122232425262728293031323334def kl_penalty(logprob: torch.FloatTensor, ref_logprob: torch.FloatTensor, kl_penalty) -> 2025-06-18 Work #LLM
LLM Reasoning Model - Math 训练记录 Infra: https://github.com/volcengine/verl Train Data: https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k Eval Data: https://huggingface.co/datasets/HuggingFaceH4/aime_2024 https://huggin 2025-06-18 Work #LLM
Entroy的数值稳定计算方法 12345def entropy_from_logits(logits: torch.Tensor): """Calculate entropy from logits.""" pd = torch.nn.functional.softmax(logits, dim=-1) entropy = torch.logsume 2025-05-28 Work #LLM
A unified perspective of RLHF Currently popular RLHF Method To this day,the post-training diagram for LLMs is still CPT, SFT and RLHF. There are no signs that this diagram will change currently. Focusing on RLHF, I will attempt t 2024-10-29 Work #LLM RLHF