Foundations

Reinforcement Learning

Reinforcement Learning (RL) trains agents to make sequences of decisions by rewarding desired behaviours. It powers game-playing AIs, robotic control, recommendation systems, and RLHF — the technique that aligns ChatGPT and Claude with human preferences.

1992Q-Learning introduced
5,000+Elo gained by AlphaGo Zero
RLHFPowers ChatGPT & Claude

How Reinforcement Learning Works

RL frames learning as an agent interacting with an environment. At each step the agent observes a state, takes an action, and receives a reward signal. The goal: learn a policy that maximises cumulative future reward.

Agent
→ action →
Environment
→ state, reward →
Agent

Unlike supervised learning (which needs labelled data), RL learns from trial and error. The agent explores the environment, gradually discovering which actions lead to high cumulative reward.

Core Concepts

State (s)

A description of the environment at a point in time. Can be a game board, robot joint angles, stock prices, or text tokens.

Action (a)

A choice the agent can make. Discrete (move left/right) or continuous (torque to apply). The set of all actions is the action space.

Reward (r)

A scalar signal telling the agent how good its action was. Reward engineering is critical — shaping good reward functions is half the challenge.

Policy (π)

The agent's decision rule: given a state, which action to take. Can be deterministic π(s)→a or stochastic π(a|s)→probability.

Value Function (V)

Expected cumulative future reward from state s. Tells the agent how "good" it is to be in a given state under a policy.

Q-Function (Q)

Expected reward starting from state s, taking action a, then following policy π. Q(s,a) drives action selection in many algorithms.

Discount Factor (γ)

How much to value future rewards vs immediate ones. γ=0.99 means the agent plans ahead; γ=0 means purely greedy.

Exploration vs Exploitation

The core dilemma: explore new actions to gather info, or exploit known good actions. ε-greedy, UCB, and entropy bonuses balance this.

Key Algorithms

AlgorithmTypeKey IdeaBest For
Q-LearningModel-free, off-policyLearn Q(s,a) using Bellman equationDiscrete actions, tabular problems
SARSAModel-free, on-policyUpdate Q using actual next action takenSafe/conservative policies
DQNDeep Q-NetworkNeural net approximates Q; replay buffer + target netAtari games, discrete control
Policy GradientPolicy-basedDirectly optimise policy with gradient ascent on expected rewardContinuous action spaces
PPOActor-CriticProximal Policy Optimisation — clips updates for stabilityRobotics, LLM fine-tuning (RLHF)
SACOff-policy Actor-CriticMaximises entropy + reward for explorationContinuous control, robotics
AlphaZero / MuZeroModel-basedSelf-play + MCTS + learned world modelGames, planning problems

Deep Reinforcement Learning

Deep RL replaces hand-crafted features with neural networks to approximate value functions and policies directly from raw observations (pixels, sensor data, text).

Actor-Critic

Two networks: the actor outputs actions; the critic estimates value. Reduces variance compared to pure policy gradients. PPO, A3C, SAC are all actor-critic.

Experience Replay

Store past (s, a, r, s') transitions in a buffer. Sample mini-batches randomly to break temporal correlations and improve sample efficiency.

Target Networks

A slowly-updated copy of the Q-network used for computing training targets. Prevents oscillations and improves stability in DQN and its variants.

Reward Shaping

Adding extra reward signals (curiosity, potential-based shaping) to guide learning when the true reward is sparse — e.g., robot only rewarded on task completion.

💡 Key challenge: Deep RL is famously sample-inefficient — a human learns Pong in minutes; DQN needs tens of millions of frames. Model-based RL and offline RL are active research areas to address this.

RLHF — Aligning Language Models

Reinforcement Learning from Human Feedback (RLHF) is the technique behind ChatGPT, Claude, and most modern aligned LLMs. It adapts RL to fine-tune language models using human preference data.

Base LLM
SFT
Human Rankings
Reward Model
PPO Fine-tune
Aligned LLM

Step 1: SFT

Supervised Fine-Tuning on (prompt, ideal response) pairs created by human labellers. Gives the model the right output format and style.

Step 2: Reward Model

Humans compare two responses and pick the better one. A separate model (RM) is trained to predict these preferences — scores any response.

Step 3: PPO

The LLM is fine-tuned with PPO to maximise reward model scores, with a KL penalty to prevent straying too far from the SFT checkpoint.

Constitutional AI

Anthropic's variant: instead of human rankings, the model critiques its own outputs against a set of principles (the "constitution"). Scales better.

Real-World Applications

Games & Simulation

AlphaGo Zero mastered Go purely by self-play. OpenAI Five beat Dota 2 pros. RL agents achieve superhuman performance in hundreds of games.

Robotics

Boston Dynamics uses RL for locomotion. OpenAI's Dexterous Hand solved a Rubik's cube. RL enables robots to learn from simulation before real-world deployment.

Recommendation Systems

YouTube, Netflix, and TikTok model user engagement as RL problems — actions are content recommendations; reward is watch time and satisfaction.

Drug Discovery

RL optimises molecular structures by treating chemistry as a game. AlphaFold-inspired systems propose novel drug candidates with target binding properties.

Data Center Cooling

Google used RL to reduce data center cooling energy by ~40%. The agent continuously adjusts HVAC parameters for optimal energy efficiency.

Autonomous Vehicles

Waymo, Tesla, and others use RL for lane changing, merging, and complex manoeuvring in simulation before deploying to real vehicles.

Code Example — Q-Learning from Scratch

Python
import numpy as np
import gymnasium as gym

env = gym.make('FrozenLake-v1', is_slippery=False)
n_states  = env.observation_space.n   # 16
n_actions = env.action_space.n        # 4

# Q-table initialised to zeros
Q = np.zeros((n_states, n_actions))

# Hyperparameters
alpha   = 0.8   # learning rate
gamma   = 0.95  # discount factor
epsilon = 1.0   # exploration rate
episodes = 2000

for ep in range(episodes):
    state, _ = env.reset()
    done = False
    while not done:
        # ε-greedy action selection
        if np.random.random() < epsilon:
            action = env.action_space.sample()  # explore
        else:
            action = np.argmax(Q[state])        # exploit

        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated

        # Bellman update
        Q[state, action] += alpha * (
            reward + gamma * np.max(Q[next_state]) - Q[state, action]
        )
        state = next_state

    epsilon = max(0.01, epsilon * 0.995)  # decay exploration

print("Learned Q-table:", Q)

💡 Try it: Install pip install gymnasium and run this. The agent learns to navigate a 4×4 grid to reach a goal without falling into holes — purely from reward signals.