👋 Welcome to Xiaobo’s Blog

I’m Xiaobo Yang, a PhD student at Peking University.

Gradient Estimation of KL Divergence in Large Language Model Reinforcement Learning

Theory In large language model reinforcement learning training, there are two ways to incorporate KL divergence constraints: reward shaping and KL loss. The former directly adds KL estimation to the reward value as a new reward $r \leftarrow r - \beta\cdot KL$; the latter puts the KL estimation term in the loss function, computing $\nabla_\theta \operatorname{KL}(\pi_\theta||\pi_{ref})$ for backpropagation. Schulman (2020) proposed three methods for estimating KL divergence, among which k3 loss is considered a good estimator with both unbiased and low variance properties. DeepSeek’s GRPO algorithm (Guo et al. 2025) adopted this approach: ...

March 14, 2025 · 10 min · 2021 words · Xiaobo Yang