Xiaobo's Blog
field notes on reinforcement learning, language models, and the math underneath.
Posts — 1
-
Gradient Estimation of KL Divergence in Large Language Model Reinforcement Learning
Three estimators, one loss, and the subtle gap between ∇E[·] and E[∇·].