TRPO and PPO

This series is written as I review Stanford's Computer Science course CS224r. Thanks to the instructors for making their slides and homework publicly available. Thanks also to Claude (Anthropic) for help organizing the derivations and tightening the writing.

Before reading this article, please make sure you understand the on-policy and off-policy algorithms. You can review it in the previous article.

Where We Left Off

In the previous article we arrived at an off-policy estimator and immediately ran into trouble: if the new policy $\pi_\theta$ drifts too far from the behavior policy $\pi_{\theta'}$, the importance weights blow up or collapse. We ended by writing down a fix—constrain the update so the two policies stay close:

$$ \mathbb{E}_{s \sim \pi_{\theta'}} \left[ D_{\text{KL}}(\pi_{\theta'}(\cdot|s) \parallel \pi_\theta(\cdot|s)) \right] \leq \delta $$

This article turns that single inequality into two concrete algorithms: TRPO, which enforces the constraint exactly, and PPO, which approximates it with a cheap clipping trick. But first, we need to upgrade our baseline.

From Baseline to Advantage

In the last article our baseline was the crudest possible choice—the average return $b = \frac{1}{N}\sum_i R(\tau^i)$, a single number for the whole batch. It reduces variance, but it is blind to which state we were in. Falling over from a stable standing pose and falling over from an already-stumbling pose are not equally bad, yet a global baseline scores them the same.

A better reference is state-dependent: how good is state $s_t$ on average? That is exactly the value function $V^\pi(s_t)$. Subtracting it gives the advantage:

$$ A^\pi(s_t, a_t) = Q^\pi(s_t, a_t) - V^\pi(s_t) $$

which measures how much better action $a_t$ is than the policy's average behavior in $s_t$. We learn $V^\pi$ with a second network (the critic), and estimate the advantage from sampled returns. A common, low-variance estimator is Generalized Advantage Estimation (GAE)^[1]:

$$ \hat{A}_t = \sum_{l=0}^{\infty} (\gamma \lambda)^l \, \delta_{t+l}, \qquad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) $$

where $\gamma$ is the discount and $\lambda \in [0,1]$ trades off bias against variance. From here on, we replace the bare return-to-go-minus-baseline term with $\hat{A}_t$. This is the actor-critic setup: the actor is $\pi_\theta$, the critic is $V$.

The Surrogate Objective

Recall the off-policy gradient from last time. Stripping it down to a single time step and writing the importance ratio as

$$ r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta'}(a_t | s_t)}, $$

the per-step contribution was $r_t(\theta) \, \nabla_\theta \log \pi_\theta(a_t|s_t) \, \hat{A}_t$. Here is the key observation: this is exactly the gradient of a much simpler quantity. Since $\nabla_\theta r_t(\theta) = r_t(\theta) \nabla_\theta \log \pi_\theta(a_t|s_t)$, we have

$$ \nabla_\theta \big[ r_t(\theta) \, \hat{A}_t \big] = r_t(\theta) \, \nabla_\theta \log \pi_\theta(a_t|s_t) \, \hat{A}_t. $$

So instead of differentiating a gradient estimator by hand, we can just maximize a scalar objective and let autodiff do the rest:

$$ L(\theta) = \mathbb{E}_{t} \left[ r_t(\theta) \, \hat{A}_t \right]. $$

This $L(\theta)$ is called the surrogate objective. It is "surrogate" because it is only a faithful stand-in for the true return as long as $\pi_\theta$ stays near $\pi_{\theta'}$—precisely the regime where importance sampling is trustworthy. Maximizing it without restraint pushes $r_t(\theta)$ arbitrarily large and breaks the approximation. We need the KL leash from above.

TRPO: Enforce the Constraint Exactly

Trust Region Policy Optimization (TRPO)^[2] takes the constraint literally. At each step it solves a constrained optimization:

$$ \max_{\theta} \;\; \mathbb{E}_{t}\!\left[ r_t(\theta)\, \hat{A}_t \right] \quad \text{subject to} \quad \mathbb{E}_{t}\!\left[ D_{\text{KL}}(\pi_{\theta'}(\cdot|s_t) \parallel \pi_\theta(\cdot|s_t)) \right] \le \delta. $$

The "trust region" is the ball of radius $\delta$ (in KL) around the current policy—the region where we trust the surrogate. TRPO solves this by approximating the objective to first order and the KL constraint to second order, then taking a step via the conjugate gradient method (to avoid forming the Hessian explicitly), followed by a line search to guarantee the constraint actually holds.

TRPO works and gives monotonic improvement guarantees, but the machinery—second-order curvature, conjugate gradients, line searches—is heavy and awkward to implement, especially with shared actor-critic networks. This is the pain point PPO removes.

PPO: Approximate the Constraint with Clipping

Proximal Policy Optimization (PPO)^[3] asks: can we get TRPO's "don't move too far" behavior using only first-order optimization (plain SGD/Adam)? The answer is yes, with a remarkably simple trick. Define the clipped surrogate objective:

$$ L^{\text{CLIP}}(\theta) = \mathbb{E}_{t} \left[ \min\Big( r_t(\theta)\, \hat{A}_t,\;\; \text{clip}\big(r_t(\theta),\, 1-\epsilon,\, 1+\epsilon\big)\, \hat{A}_t \Big) \right] $$

where $\epsilon$ is a small hyperparameter (often $0.2$). The intuition:

When $\hat{A}_t > 0$ (a good action), we want to increase its probability, so $r_t$ rises. But once $r_t > 1+\epsilon$, the clipped term flattens out—there is no more gradient reward for pushing further. The policy cannot run away.
When $\hat{A}_t < 0$ (a bad action), we want to decrease its probability, so $r_t$ falls. Once $r_t < 1-\epsilon$, the term again flattens.
The outer $\min$ makes the objective a pessimistic lower bound: clipping only ever removes incentive to move, it never rewards overshooting.

In effect, clipping mimics the trust region without any second-order math—the $\min$ kills the gradient whenever the policy has already moved "far enough." Some implementations instead use a KL-penalty variant, folding the constraint directly into the objective:

$$ L^{\text{KLPEN}}(\theta) = \mathbb{E}_{t} \left[ r_t(\theta)\, \hat{A}_t \right] - \beta \, \mathbb{E}_{t}\!\left[ D_{\text{KL}}(\pi_{\theta'} \parallel \pi_\theta) \right] $$

with $\beta$ adapted to keep the KL near a target. The clipped version is the more popular default because it needs no penalty tuning.

Putting It Together

The full PPO loop looks like this:

Run the current policy $\pi_{\theta'}$ to collect a batch of trajectories;
Compute advantages $\hat{A}_t$ (e.g. via GAE) using the critic $V$;
For several epochs, optimize $L^{\text{CLIP}}(\theta)$ on minibatches with SGD/Adam, reusing the same batch (this is the off-policy reuse the clip makes safe);
Update the critic by regressing $V$ toward the observed returns;
Set $\theta' \leftarrow \theta$ and repeat from step 1.

Notice that step 3 reuses one batch for multiple gradient steps—exactly the data efficiency we wanted from off-policy learning in the last article, now made stable by clipping. This combination of simplicity, stability, and sample reuse is why PPO became the default policy-gradient algorithm for everything from robotics to RLHF.

What's Next

PPO carries two networks: an actor and a critic for the advantage. In the next article we'll look at GRPO, which asks whether we can drop the critic entirely—estimating the advantage from a group of sampled responses instead of a learned value function—and why that turns out to be a natural fit for fine-tuning large language models.

Schulman et al., High-Dimensional Continuous Control Using Generalized Advantage Estimation, arXiv:1506.02438. Setting $\lambda = 0$ recovers the one-step TD advantage (low variance, higher bias); $\lambda = 1$ recovers the Monte Carlo return-to-go minus baseline (high variance, unbiased). ↩
Schulman et al., Trust Region Policy Optimization, arXiv:1502.05477. ↩
Schulman et al., Proximal Policy Optimization Algorithms, arXiv:1707.06347. See also Lilian Weng's policy gradient post for a unified overview. ↩