On-policy and Off-policy

This series is written as I review Stanford's Computer Science course CS224r. Thanks to the instructors for making their slides and homework publicly available.

Before reading this article, please make sure you understand the REINFORCE algorithm. You can review it in the previous article.

Intuition Behind REINFORCE

In the previous article, we derived the expression for the policy gradient $\nabla_\theta J(\theta)$ as:

$$ \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \left[ \left( \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_{i, t} | s_{i, t}) \right) R(\tau^i) \right] $$

This is essentially a weighted average, where each trajectory’s contribution is scaled by its total return $R(\tau^i)$. Trajectories with higher rewards are reinforced, while those with lower rewards contribute less. If you’re familiar with imitation learning, you may notice that the formula is almost identical—except for the presence of the reward term.^[1]

Reducing Noise in the Gradient

Let’s expand the total return to examine the structure:

$$ \begin{align*} \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i=1}^{N} \left( \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_{i,t} | s_{i,t}) \right) \left( \sum_{t'=1}^T r(s_{i, t'}, a_{i, t'}) \right) \\ &= \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_{i,t} | s_{i,t}) \cdot \left( \sum_{t'=1}^{T} r(s_{i, t'}, a_{i, t'}) \right) \end{align*} $$

Here’s the issue: at time step $t$, we are using the total trajectory reward—even rewards from steps before $t$, which clearly cannot be affected by the current action $a_t$. This introduces unnecessary noise. Intuitively, the gradient $\nabla_\theta \log \pi_\theta(a_{i,t} | s_{i,t})$ should only be weighted by the rewards that come after action $a_t$.

Thus, we refine the formula to only consider future rewards:

$$ \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_{i,t} | s_{i,t}) \cdot \left( \sum_{t'=t}^{T} r(s_{i, t'}, a_{i, t'}) \right) $$

This is often referred to as using the return-to-go as the weight for each gradient term.

Reducing Sensitivity to Reward Scale

Even with the return-to-go, there’s another issue: if all rewards are positive (but some suboptimal), the agent may still be reinforced for bad behavior.

For example, imagine training a humanoid robot to walk forward. A trajectory in which the robot takes a few steps and falls may still have positive return, but it should not be encouraged. This motivates the use of a baseline: a reference value representing average performance.

We subtract this baseline from each return, so that only above-average behavior is encouraged:

$$ b = \frac{1}{N} \sum_{i=1}^{N} R(\tau^i) $$

The gradient then becomes:

$$ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \nabla_\theta \log p_\theta(\tau) \cdot (R(\tau) - b)\right] $$

Combining both return-to-go and baseline:

$$ \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_{i, t} | s_{i, t}) \cdot \left( \left( \sum_{t'=t}^{T} r(s_{i, t'}, a_{i, t'}) \right) - b \right) $$

What Is On-Policy?

Let’s summarize the policy gradient algorithm so far:

Sample trajectories $\{\tau^i\}$ using the current policy $\pi_\theta$;
Estimate the gradient $\nabla_\theta J(\theta)$ from these trajectories;
Update parameters via gradient ascent:
$\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$;
Repeat or terminate.

This process requires collecting new data after every policy update. Algorithms that only use data generated by the current policy are called on-policy algorithms.

But can we reuse data from previous policies? That leads us to off-policy learning.

From On-Policy to Off-Policy: Reusing Old Data

The challenge is: if we update $\theta$, the saved trajectories are no longer sampled from $p_\theta(\tau)$, violating the assumptions of our Monte Carlo estimator.

We need a way to estimate expectations under $p_\theta(\tau)$ using samples from a different distribution $p_{\theta'}(\tau)$. This is where importance sampling^[2] comes in:

$$ \mathbb{E}_{x \sim p(x)}[f(x)] = \mathbb{E}_{x \sim q(x)}\left[\frac{p(x)}{q(x)} f(x)\right] $$

Applying this to policy gradients:

$$ \begin{align*} \nabla_\theta J(\theta) &= \mathbb{E}_{\tau \sim p_\theta(\tau)} [f(\tau)] \\ &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)} \left[\frac{p_\theta(\tau)}{p_{\theta'}(\tau)} f(\tau) \right] \end{align*} $$

Where $f(\tau)$ contains all gradient and reward terms. The importance weight becomes:

$$ \frac{p_\theta(\tau)}{p_{\theta'}(\tau)} = \prod_{t=1}^{T} \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta'}(a_t | s_t)} $$

Hence, the off-policy policy gradient is:

$$ \begin{align*} \nabla_\theta J(\theta) &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)} \left[ \prod_{t=1}^{T} \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta'}(a_t | s_t)} \cdot \left( \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot \left( \sum_{t'=t}^{T} r(s_{t'}, a_{t'}) - b \right) \right) \right] \end{align*} $$

And its Monte Carlo approximation is:

$$ \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \left[ \frac{\pi_\theta(a_{i,t} | s_{i,t})}{\pi_{\theta'}(a_{i,t} | s_{i,t})} \cdot \nabla_\theta \log \pi_\theta(a_{i, t} | s_{i, t}) \cdot \left( \left( \sum_{t'=t}^{T} r(s_{i, t'}, a_{i, t'}) \right) - b \right) \right] $$

This enables an off-policy algorithm:

Sample trajectories $\{\tau^i\}$ using some (possibly older) policy $\pi_{\theta'}$;
Estimate the gradient using all saved trajectories;
Update the current policy $\theta$;
Repeat from step 2.

Challenges in Off-Policy Learning

Despite being mathematically sound, off-policy learning faces practical issues:

If $\theta$ and $\theta'$ differ too much, the importance weights can explode or vanish;
For long trajectories, the product of importance weights can become unstable;
Very old trajectories may no longer reflect relevant behavior for the updated policy.

A common solution is to constrain the policy update so that the new policy doesn't diverge too far:

$$ \mathbb{E}_{s \sim \pi_\theta} \left[ D_{\text{KL}}(\pi_{\theta'}(\cdot|s) \parallel \pi_\theta(\cdot|s)) \right] \leq \delta $$

This idea leads to trust-region methods like TRPO and clipping-based approaches like PPO, which we’ll explore in future articles.

Behavior cloning gradient (imitation learning).

$\mathbb{E}_{x \sim p(x)}[f(x)] = \int p(x)f(x)dx = \int q(x)\frac{p(x)}{q(x)}f(x)dx = \mathbb{E}_{x \sim q(x)}\left[\frac{p(x)}{q(x)} f(x)\right]$

$\nabla_\theta J_{BC}(\theta) \approx \frac{1}{n} \sum_{i=1}^{N}\left(\sum_{t=1}^{T}\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\right)$ ↩
Importance sampling identity: ↩