Introduction to Reinforcement Learning

Harry Huang (aka Wenyuan Huang, 黄问远)

Introduction to Reinforcement Learning (RL)

Reinforcement Learning (RL) is a field of machine learning focused on decision-making. A simple way to understand RL is through the game of Go. At any moment, a player sees the current board configuration (the state) and must choose one of many possible legal actions (moves). A strong RL algorithm for Go (such as AlphaGo) is able to choose an effective action for any given state.

RL differs from other types of machine learning—such as computer vision classification—in one fundamental way:
RL agents learn through continuous interaction with an external “environment,” primarily through the rewards they receive.

What Is an Environment?

The environment includes everything the agent interacts with. It accepts the agent’s actions and produces feedback, typically in the form of new states and rewards.

In Go, the environment receives a move, updates the board, checks if the game ended, and provides a corresponding reward (e.g., +1 for a win, 0 for a draw, −1 for a loss). Over repeated interactions, the RL agent learns which strategies produce higher long-term rewards.
This trial-and-error learning process is what distinguishes RL from other machine learning paradigms.

The Markov Property

In many RL settings, the agent must choose actions based only on the current observation. For this to work, we require the Markov Property:

The future evolution of the process depends only on its current state, not on the sequence of past events that led there.

This ensures that an optimal policy can be defined as a function of the current state alone.

Of course, many real-world scenarios violate the Markov Property.
For example, in RoboCup, a robot’s camera cannot capture the full soccer field. The agent’s observation is incomplete, making the true state partially observable. Researchers often design clever observation functions or memory mechanisms (e.g., recurrent networks) to approximate Markovian behavior, but perfect Markovness is rarely achievable in practice.

Formalizing Reinforcement Learning

A typical RL problem is defined by the following components:

  1. State space (S): all possible states the agent can observe.
  2. Action space (A): all actions the agent can take.
  3. Reward function (r(s, a)): the immediate reward after taking action (a) in state (s).
  4. Policy ((s)): a rule or function describing how the agent selects actions.

The objective in RL is to find an optimal policy—one that maximizes the expected long-term reward.

Example: Gridworld

To make this concrete, consider a simple environment called Gridworld.

An agent moves on an (N M) grid and tries to reach a specific goal cell ((g_x, g_y)). Each step allows the agent to move up, down, left, or right. When it reaches the goal, the episode ends and the agent receives a reward of +1; otherwise, each step incurs a small penalty of −0.01 to discourage wandering.

The formal components are:

  1. State space:
    [ (A_x, A_y), (G_x, G_y) A_x, G_x {1N}, A_y, G_y {1M}. ]
    Here ((A_x, A_y)) is the agent’s location, and ((G_x, G_y)) is the goal’s location.

  2. Action space:
    [ a {(0,1), (0,-1), (1,0), (-1,0)}, ]
    corresponding to the four possible movement directions.

  3. Reward function:

    • (+1) if the agent reaches the goal
    • (-0.01) otherwise
  4. Optimal policy:
    Move toward the goal using the shortest path.

This is one of the simplest examples of an RL task, but it already illustrates how states, actions, and rewards interact.

Final Thoughts

This post gives only a brief introduction to RL. Real RL research involves far more complexity. Even seemingly simple components—like the reward function—can dramatically affect an agent’s ability to learn. For instance, in the Gridworld example, the agent rarely reaches the goal early in training, so relying solely on the sparse +1 reward makes learning extremely difficult.

Designing the state representation, action space, and reward function is often one of the hardest parts of building an RL system.

Among the many RL algorithm families, Policy Gradient methods have become increasingly popular in modern research and applications.
In the next article, we will begin exploring them—starting from the most fundamental algorithm: REINFORCE.

  • Title: Introduction to Reinforcement Learning
  • Author: Harry Huang (aka Wenyuan Huang, 黄问远)
  • Created at : 2025-03-22 02:13:34
  • Updated at : 2025-11-16 21:44:06
  • Link: https://whuang369.com/blog/2025/03/22/CS/Machine_Learning/Reinforcement_Learning/RL_Intro/
  • License: This work is licensed under CC BY-NC-SA 4.0.
Comments