Reinforcement Learning

How Reinforcement Learning Works

RL frames learning as an agent interacting with an environment. At each step the agent observes a state, takes an action, and receives a reward signal. The goal: learn a policy that maximises cumulative future reward.

Agent

→ action →

Environment

→ state, reward →

Agent

Unlike supervised learning (which needs labelled data), RL learns from trial and error. The agent explores the environment, gradually discovering which actions lead to high cumulative reward.

Core Concepts

State (s)

A description of the environment at a point in time. Can be a game board, robot joint angles, stock prices, or text tokens.

Action (a)

A choice the agent can make. Discrete (move left/right) or continuous (torque to apply). The set of all actions is the action space.

Reward (r)

A scalar signal telling the agent how good its action was. Reward engineering is critical — shaping good reward functions is half the challenge.

Policy (π)

The agent's decision rule: given a state, which action to take. Can be deterministic π(s)→a or stochastic π(a|s)→probability.

Value Function (V)

Expected cumulative future reward from state s. Tells the agent how "good" it is to be in a given state under a policy.

Q-Function (Q)

Expected reward starting from state s, taking action a, then following policy π. Q(s,a) drives action selection in many algorithms.

Discount Factor (γ)

How much to value future rewards vs immediate ones. γ=0.99 means the agent plans ahead; γ=0 means purely greedy.

Exploration vs Exploitation

The core dilemma: explore new actions to gather info, or exploit known good actions. ε-greedy, UCB, and entropy bonuses balance this.

Key Algorithms

Algorithm	Type	Key Idea	Best For
Q-Learning	Model-free, off-policy	Learn Q(s,a) using Bellman equation	Discrete actions, tabular problems
SARSA	Model-free, on-policy	Update Q using actual next action taken	Safe/conservative policies
DQN	Deep Q-Network	Neural net approximates Q; replay buffer + target net	Atari games, discrete control
Policy Gradient	Policy-based	Directly optimise policy with gradient ascent on expected reward	Continuous action spaces
PPO	Actor-Critic	Proximal Policy Optimisation — clips updates for stability	Robotics, LLM fine-tuning (RLHF)
SAC	Off-policy Actor-Critic	Maximises entropy + reward for exploration	Continuous control, robotics
AlphaZero / MuZero	Model-based	Self-play + MCTS + learned world model	Games, planning problems

Deep Reinforcement Learning

Deep RL replaces hand-crafted features with neural networks to approximate value functions and policies directly from raw observations (pixels, sensor data, text).

Actor-Critic

Two networks: the actor outputs actions; the critic estimates value. Reduces variance compared to pure policy gradients. PPO, A3C, SAC are all actor-critic.

Experience Replay

Store past (s, a, r, s') transitions in a buffer. Sample mini-batches randomly to break temporal correlations and improve sample efficiency.

Target Networks

A slowly-updated copy of the Q-network used for computing training targets. Prevents oscillations and improves stability in DQN and its variants.

Reward Shaping

Adding extra reward signals (curiosity, potential-based shaping) to guide learning when the true reward is sparse — e.g., robot only rewarded on task completion.

💡 Key challenge: Deep RL is famously sample-inefficient — a human learns Pong in minutes; DQN needs tens of millions of frames. Model-based RL and offline RL are active research areas to address this.

RLHF — Aligning Language Models

Reinforcement Learning from Human Feedback (RLHF) is the technique behind ChatGPT, Claude, and most modern aligned LLMs. It adapts RL to fine-tune language models using human preference data.

Base LLM

→

SFT

→

Human Rankings

→

Reward Model

→

PPO Fine-tune

→

Aligned LLM

Step 1: SFT

Supervised Fine-Tuning on (prompt, ideal response) pairs created by human labellers. Gives the model the right output format and style.

Step 2: Reward Model

Humans compare two responses and pick the better one. A separate model (RM) is trained to predict these preferences — scores any response.

Step 3: PPO

The LLM is fine-tuned with PPO to maximise reward model scores, with a KL penalty to prevent straying too far from the SFT checkpoint.

Constitutional AI

Anthropic's variant: instead of human rankings, the model critiques its own outputs against a set of principles (the "constitution"). Scales better.

Real-World Applications

Games & Simulation

AlphaGo Zero mastered Go purely by self-play. OpenAI Five beat Dota 2 pros. RL agents achieve superhuman performance in hundreds of games.

Robotics

Boston Dynamics uses RL for locomotion. OpenAI's Dexterous Hand solved a Rubik's cube. RL enables robots to learn from simulation before real-world deployment.

Recommendation Systems

YouTube, Netflix, and TikTok model user engagement as RL problems — actions are content recommendations; reward is watch time and satisfaction.

Drug Discovery

RL optimises molecular structures by treating chemistry as a game. AlphaFold-inspired systems propose novel drug candidates with target binding properties.

Data Center Cooling

Google used RL to reduce data center cooling energy by ~40%. The agent continuously adjusts HVAC parameters for optimal energy efficiency.

Autonomous Vehicles

Waymo, Tesla, and others use RL for lane changing, merging, and complex manoeuvring in simulation before deploying to real vehicles.

Code Example — Q-Learning from Scratch

Python

import numpy as np
import gymnasium as gym

env = gym.make('FrozenLake-v1', is_slippery=False)
n_states  = env.observation_space.n   # 16
n_actions = env.action_space.n        # 4

# Q-table initialised to zeros
Q = np.zeros((n_states, n_actions))

# Hyperparameters
alpha   = 0.8   # learning rate
gamma   = 0.95  # discount factor
epsilon = 1.0   # exploration rate
episodes = 2000

for ep in range(episodes):
    state, _ = env.reset()
    done = False
    while not done:
        # ε-greedy action selection
        if np.random.random() < epsilon:
            action = env.action_space.sample()  # explore
        else:
            action = np.argmax(Q[state])        # exploit

        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated

        # Bellman update
        Q[state, action] += alpha * (
            reward + gamma * np.max(Q[next_state]) - Q[state, action]
        )
        state = next_state

    epsilon = max(0.01, epsilon * 0.995)  # decay exploration

print("Learned Q-table:", Q)

💡 Try it: Install pip install gymnasium and run this. The agent learns to navigate a 4×4 grid to reach a goal without falling into holes — purely from reward signals.

Reinforcement Learning

How Reinforcement Learning Works

Core Concepts

State (s)

Action (a)

Reward (r)

Policy (π)

Value Function (V)

Q-Function (Q)

Discount Factor (γ)

Exploration vs Exploitation

Key Algorithms

Deep Reinforcement Learning

Actor-Critic

Experience Replay

Target Networks

Reward Shaping

RLHF — Aligning Language Models

Step 1: SFT

Step 2: Reward Model

Step 3: PPO

Constitutional AI

Real-World Applications

Games & Simulation

Robotics

Recommendation Systems

Drug Discovery

Data Center Cooling

Autonomous Vehicles

Code Example — Q-Learning from Scratch

Related Topics

Machine Learning

Deep Learning

Large Language Models

AI Agents