Shilong Li's Blog
  • Home
  • Archives
  • Categories
  • Tags
  • About

2026-01-17

加权交叉熵在概率单纯形约束下的解

问题来自论文:Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning LLM的objective定义为 LCE(ϕ)=E(s,a)∼D[log⁡ πϕ(a∣s)]\mathcal{L}_{CE}(\phi)=\mathbb{E}_{(s,a)\sim\mathcal{D}}[\log~\pi_{\phi}(a|s)]LCE​
2026-01-16
Work
#LLM RL

Verl Optimization

Change Log: 2026-01-08: Initial release of the article. Online-RL 上图是online-rl的简化流程图,省略了ref model 和 critic model 从verl入手,我们分析一下online-rl训练过程中,各个role的显存占用情况,以及目前已有的加速方案 不考虑critic model的情况下,我们一般有三个
2026-01-05
Work
#RLHF

Entroy的数值稳定计算方法

12345def entropy_from_logits(logits: torch.Tensor): """Calculate entropy from logits.""" pd = torch.nn.functional.softmax(logits, dim=-1) entropy = torch.logsume
2025-05-28
Work
#LLM

Search

Hexo Fluid
Page Views: Unique Visitors: