Intro to REINFORCE
This series is written as I review Stanford's Computer Science course CS224r . Thanks to the instructors for making their slides and homework publicly available.
How do we evaluate a policy?
To introduce policy gradient methods, we begin with the objective we aim to optimize.
A policy
Since both the action space
Our goal is for the agent to behave optimally throughout its interaction with the environment, which terminates either upon reaching a goal (termination) or being interrupted prematurely (truncation). A common definition of “better behavior” is one that accumulates more rewards before stopping.
We define a trajectory
To evaluate a policy, we compute the expected return over all
possible trajectories induced by
This function
How do we compute this expectation?
To optimize
- Sampling an action
from , - Transitioning to the next state
with probability .
The full probability of trajectory
While this expression defines
Since log turns products into sums, this simplifies differentiation. We have:
This works because the transition probabilities
Final expression for
Now we can rewrite the gradient of
Substituting our earlier result:
This is the policy gradient expression. To estimate it in practice, we use Monte Carlo approximation1:
Thus, the REINFORCE algorithm works as follows:
- Sample
trajectories using the current policy ; - Estimate the policy gradient using the equation above;
- Update the parameters via gradient ascent:
. - Go back to 1, or exit.
This is known as the REINFORCE algorithm, or the vanilla policy gradient method, based on the policy gradient theorem2.
I highly recommend implementing this algorithm yourself. It’s the foundation for many modern RL algorithms like PPO, DDPG, and RPO. While REINFORCE works for simple environments like Goal2D, it suffers from high variance and inefficiency. In the next articles, we’ll explore how later algorithms address these issues. Together, these improvements shaped the algorithms we rely on in modern reinforcement learning.
An introduction to Monte Carlo approximation. This method is straightforward: draw samples
from , then estimate as their average.↩︎ See Chapter 13.2 of Reinforcement Learning: An Introduction by Sutton & Barto. A concise overview is also available on Lilian Weng’s blog .↩︎
- Title: Intro to REINFORCE
- Author: Harry Huang (aka Wenyuan Huang, 黄问远)
- Created at : 2025-08-13 16:19:40
- Updated at : 2025-11-16 19:36:37
- Link: https://whuang369.com/blog/2025/08/13/CS/Machine_Learning/Reinforcement_Learning/Policy_Gradient_Algorithm_1/
- License: This work is licensed under CC BY-NC-SA 4.0.