Intro to REINFORCE

Harry Huang (aka Wenyuan Huang, 黄问远)

This series is written as I review Stanford's Computer Science course CS224r . Thanks to the instructors for making their slides and homework publicly available.

How do we evaluate a policy?

To introduce policy gradient methods, we begin with the objective we aim to optimize.

A policy defines a distribution over actions conditioned on states . It determines the agent's behavior when interacting with the environment.

Since both the action space and the state space are often large or continuous, we cannot represent the policy with a lookup table. In deep reinforcement learning (DRL), we instead use a neural network (hence the "deep" in DRL) to parameterize the policy. This network maps a given state to a distribution over actions. Letting denote the neural network parameters, the policy becomes .

Our goal is for the agent to behave optimally throughout its interaction with the environment, which terminates either upon reaching a goal (termination) or being interrupted prematurely (truncation). A common definition of “better behavior” is one that accumulates more rewards before stopping.

We define a trajectory to be the sequence of states and actions experienced by the agent: . The cumulative reward of a trajectory is:

To evaluate a policy, we compute the expected return over all possible trajectories induced by :

This function serves as our optimization target.

How do we compute this expectation?

To optimize , we must understand the distribution over trajectories. A trajectory consists of a sequence of transitions. Each transition involves:

  • Sampling an action from ,
  • Transitioning to the next state with probability .

The full probability of trajectory is:

While this expression defines , it’s difficult to directly differentiate due to the product form. Fortunately, we can apply the log-derivative trick:

Since log turns products into sums, this simplifies differentiation. We have:

This works because the transition probabilities are properties of the environment and do not depend on .

Final expression for

Now we can rewrite the gradient of as:

Substituting our earlier result:

This is the policy gradient expression. To estimate it in practice, we use Monte Carlo approximation1:

Thus, the REINFORCE algorithm works as follows:

  1. Sample trajectories using the current policy ;
  2. Estimate the policy gradient using the equation above;
  3. Update the parameters via gradient ascent:
    .
  4. Go back to 1, or exit.

This is known as the REINFORCE algorithm, or the vanilla policy gradient method, based on the policy gradient theorem2.

I highly recommend implementing this algorithm yourself. It’s the foundation for many modern RL algorithms like PPO, DDPG, and RPO. While REINFORCE works for simple environments like Goal2D, it suffers from high variance and inefficiency. In the next articles, we’ll explore how later algorithms address these issues. Together, these improvements shaped the algorithms we rely on in modern reinforcement learning.


  1. An introduction to Monte Carlo approximation. This method is straightforward: draw samples from , then estimate as their average.↩︎

  2. See Chapter 13.2 of Reinforcement Learning: An Introduction by Sutton & Barto. A concise overview is also available on Lilian Weng’s blog .↩︎

  • Title: Intro to REINFORCE
  • Author: Harry Huang (aka Wenyuan Huang, 黄问远)
  • Created at : 2025-08-13 16:19:40
  • Updated at : 2025-11-16 19:36:37
  • Link: https://whuang369.com/blog/2025/08/13/CS/Machine_Learning/Reinforcement_Learning/Policy_Gradient_Algorithm_1/
  • License: This work is licensed under CC BY-NC-SA 4.0.
Comments