Intro to REINFORCE

This series is written as I review Stanford's Computer Science course CS224r. Thanks to the instructors for making their slides and homework publicly available.

How do we evaluate a policy?

To introduce policy gradient methods, we begin with the objective we aim to optimize.

A policy $\pi(a|s)$ defines a distribution over actions $a \in A$ conditioned on states $s \in S$. It determines the agent's behavior when interacting with the environment.

Since both the action space $A$ and the state space $S$ are often large or continuous, we cannot represent the policy with a lookup table. In deep reinforcement learning (DRL), we instead use a neural network (hence the "deep" in DRL) to parameterize the policy. This network maps a given state to a distribution over actions. Letting $\theta$ denote the neural network parameters, the policy becomes $\pi_\theta$.

Our goal is for the agent to behave optimally throughout its interaction with the environment, which terminates either upon reaching a goal (termination) or being interrupted prematurely (truncation). A common definition of “better behavior” is one that accumulates more rewards before stopping.

We define a trajectory $\tau$ to be the sequence of states and actions experienced by the agent: $\tau = \{(s_1, a_1), (s_2, a_2), \dots, (s_T, a_T)\}$. The cumulative reward of a trajectory is:

$$ R(\tau) = \sum_{t=1}^T r(s_t, a_t) $$

To evaluate a policy, we compute the expected return over all possible trajectories induced by $\pi_\theta$:

$$ J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)}[R(\tau)] $$

This function $J(\theta)$ serves as our optimization target.

How do we compute this expectation?

To optimize $J(\theta)$, we must understand the distribution $p_\theta(\tau)$ over trajectories. A trajectory consists of a sequence of transitions. Each transition involves:

Sampling an action $a_t$ from $\pi_\theta(a_t|s_t)$,
Transitioning to the next state $s_{t+1}$ with probability $p(s_{t+1} | s_t, a_t)$.

The full probability of trajectory $\tau$ is:

$$ \begin{align*} p(\tau) &= p(s_1)\pi_\theta(a_1 | s_1)p(s_2 | s_1, a_1)\pi_\theta(a_2 | s_2) \cdots \pi_\theta(a_T | s_T)p(s_{T+1} | s_T, a_T) \\ &= p(s_1)\prod_{t=1}^{T} \pi_\theta(a_t | s_t) \cdot p(s_{t+1} | s_t, a_t) \end{align*} $$

While this expression defines $p(\tau)$, it’s difficult to directly differentiate due to the product form. Fortunately, we can apply the log-derivative trick:

$$ \begin{align*} \nabla_\theta p_\theta(\tau) &= p_\theta(\tau) \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)} \\ &= p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) \end{align*} $$

Since log turns products into sums, this simplifies differentiation. We have:

$$ \begin{align*} \nabla_\theta \log p_\theta(\tau) &= \nabla_\theta \log \left[\prod_{t=1}^{T} \pi_\theta(a_t | s_t) p(s_{t+1} | s_t, a_t)\right] \\ &= \nabla_\theta \left[\log p(s_1) + \sum_{t=1}^{T} \log \pi_\theta(a_t | s_t) + \log p(s_{t+1} | s_t, a_t)\right] \\ &= \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \end{align*} $$

This works because the transition probabilities $p(s_{t+1}|s_t,a_t)$ are properties of the environment and do not depend on $\theta$.

Final expression for $\nabla_\theta J(\theta)$

Now we can rewrite the gradient of $J(\theta)$ as:

$$ \begin{align*} \nabla_\theta J(\theta) &= \int \nabla_\theta p_\theta(\tau) R(\tau) d\tau \\ &= \int p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) R(\tau) d\tau \\ &= \mathbb{E}_{\tau \sim p_\theta(\tau)}[\nabla_\theta \log p_\theta(\tau) R(\tau)] \end{align*} $$

Substituting our earlier result:

$$ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \left( \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \right) R(\tau) \right] $$

This is the policy gradient expression. To estimate it in practice, we use Monte Carlo approximation^[1]:

$$ \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \left[ \left( \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_{i, t} | s_{i, t}) \right) R(\tau^i) \right] $$

Thus, the REINFORCE algorithm works as follows:

Sample $N$ trajectories $\{\tau^i\}$ using the current policy $\pi_\theta$;
Estimate the policy gradient using the equation above;
Update the parameters via gradient ascent:
$\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$.
Go back to 1, or exit.

This is known as the REINFORCE algorithm, or the vanilla policy gradient method, based on the policy gradient theorem^[2].

I highly recommend implementing this algorithm yourself. It’s the foundation for many modern RL algorithms like PPO, DDPG, and RPO. While REINFORCE works for simple environments like Goal2D, it suffers from high variance and inefficiency. In the next articles, we’ll explore how later algorithms address these issues. Together, these improvements shaped the algorithms we rely on in modern reinforcement learning.

An introduction to Monte Carlo approximation. This method is straightforward: draw samples $X_1, X_2, \dots, X_N$ from $P(X)$, then estimate $\mathbb{E}[X]$ as their average. ↩
See Chapter 13.2 of Reinforcement Learning: An Introduction by Sutton & Barto. A concise overview is also available on Lilian Weng’s blog. ↩

How do we evaluate a policy?

How do we compute this expectation?

Final expression for $\nabla_\theta J(\theta)$

References