Shilong Li's Blog
  • Home
  • Archives
  • Categories
  • Tags
  • About

REINFORCE / RLOO / REINFORCE++

Policy Gradient 策略梯度的目标是,找到一个最优策略πθ∗\pi^*_{\theta}πθ∗​,让agent在与环境交互时,获得的累积回报期望J(θ)J(\theta)J(θ)最大 最直接的优化方法就是梯度上升,即沿着梯度的方向更新参数: θnew=θold+α∇θJ(θ)θ_{new}=θ_{old}+\alpha\nabla_{\theta}J(\theta) θnew​=θol
2025-07-01
Work
#LLM

KL Divergence的三种估计方法

Approximating KL Divergence Verl Implementation 12345678910111213141516171819202122232425262728293031323334def kl_penalty(logprob: torch.FloatTensor, ref_logprob: torch.FloatTensor, kl_penalty) ->
2025-06-18
Work
#LLM

LLM Reasoning Model - Math 训练记录

Infra: https://github.com/volcengine/verl Train Data: https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k Eval Data: https://huggingface.co/datasets/HuggingFaceH4/aime_2024 https://huggin
2025-06-18
Work
#LLM

Entroy的数值稳定计算方法

12345def entropy_from_logits(logits: torch.Tensor): """Calculate entropy from logits.""" pd = torch.nn.functional.softmax(logits, dim=-1) entropy = torch.logsume
2025-05-28
Work
#LLM
A unified perspective of RLHF

A unified perspective of RLHF

Currently popular RLHF Method To this day,the post-training diagram for LLMs is still CPT, SFT and RLHF. There are no signs that this diagram will change currently. Focusing on RLHF, I will attempt t
2024-10-29
Work
#LLM RLHF

Search

Hexo Fluid
Page Views: Unique Visitors: