How Reinforcement Learning Works
RL frames learning as an agent interacting with an environment. At each step the agent observes a state, takes an action, and receives a reward signal. The goal: learn a policy that maximises cumulative future reward.
Unlike supervised learning (which needs labelled data), RL learns from trial and error. The agent explores the environment, gradually discovering which actions lead to high cumulative reward.
Core Concepts
State (s)
A description of the environment at a point in time. Can be a game board, robot joint angles, stock prices, or text tokens.
Action (a)
A choice the agent can make. Discrete (move left/right) or continuous (torque to apply). The set of all actions is the action space.
Reward (r)
A scalar signal telling the agent how good its action was. Reward engineering is critical — shaping good reward functions is half the challenge.
Policy (π)
The agent's decision rule: given a state, which action to take. Can be deterministic π(s)→a or stochastic π(a|s)→probability.
Value Function (V)
Expected cumulative future reward from state s. Tells the agent how "good" it is to be in a given state under a policy.
Q-Function (Q)
Expected reward starting from state s, taking action a, then following policy π. Q(s,a) drives action selection in many algorithms.
Discount Factor (γ)
How much to value future rewards vs immediate ones. γ=0.99 means the agent plans ahead; γ=0 means purely greedy.
Exploration vs Exploitation
The core dilemma: explore new actions to gather info, or exploit known good actions. ε-greedy, UCB, and entropy bonuses balance this.
Key Algorithms
| Algorithm | Type | Key Idea | Best For |
|---|---|---|---|
| Q-Learning | Model-free, off-policy | Learn Q(s,a) using Bellman equation | Discrete actions, tabular problems |
| SARSA | Model-free, on-policy | Update Q using actual next action taken | Safe/conservative policies |
| DQN | Deep Q-Network | Neural net approximates Q; replay buffer + target net | Atari games, discrete control |
| Policy Gradient | Policy-based | Directly optimise policy with gradient ascent on expected reward | Continuous action spaces |
| PPO | Actor-Critic | Proximal Policy Optimisation — clips updates for stability | Robotics, LLM fine-tuning (RLHF) |
| SAC | Off-policy Actor-Critic | Maximises entropy + reward for exploration | Continuous control, robotics |
| AlphaZero / MuZero | Model-based | Self-play + MCTS + learned world model | Games, planning problems |
Deep Reinforcement Learning
Deep RL replaces hand-crafted features with neural networks to approximate value functions and policies directly from raw observations (pixels, sensor data, text).
Actor-Critic
Two networks: the actor outputs actions; the critic estimates value. Reduces variance compared to pure policy gradients. PPO, A3C, SAC are all actor-critic.
Experience Replay
Store past (s, a, r, s') transitions in a buffer. Sample mini-batches randomly to break temporal correlations and improve sample efficiency.
Target Networks
A slowly-updated copy of the Q-network used for computing training targets. Prevents oscillations and improves stability in DQN and its variants.
Reward Shaping
Adding extra reward signals (curiosity, potential-based shaping) to guide learning when the true reward is sparse — e.g., robot only rewarded on task completion.
💡 Key challenge: Deep RL is famously sample-inefficient — a human learns Pong in minutes; DQN needs tens of millions of frames. Model-based RL and offline RL are active research areas to address this.
RLHF — Aligning Language Models
Reinforcement Learning from Human Feedback (RLHF) is the technique behind ChatGPT, Claude, and most modern aligned LLMs. It adapts RL to fine-tune language models using human preference data.
Step 1: SFT
Supervised Fine-Tuning on (prompt, ideal response) pairs created by human labellers. Gives the model the right output format and style.
Step 2: Reward Model
Humans compare two responses and pick the better one. A separate model (RM) is trained to predict these preferences — scores any response.
Step 3: PPO
The LLM is fine-tuned with PPO to maximise reward model scores, with a KL penalty to prevent straying too far from the SFT checkpoint.
Constitutional AI
Anthropic's variant: instead of human rankings, the model critiques its own outputs against a set of principles (the "constitution"). Scales better.
Real-World Applications
Games & Simulation
AlphaGo Zero mastered Go purely by self-play. OpenAI Five beat Dota 2 pros. RL agents achieve superhuman performance in hundreds of games.
Robotics
Boston Dynamics uses RL for locomotion. OpenAI's Dexterous Hand solved a Rubik's cube. RL enables robots to learn from simulation before real-world deployment.
Recommendation Systems
YouTube, Netflix, and TikTok model user engagement as RL problems — actions are content recommendations; reward is watch time and satisfaction.
Drug Discovery
RL optimises molecular structures by treating chemistry as a game. AlphaFold-inspired systems propose novel drug candidates with target binding properties.
Data Center Cooling
Google used RL to reduce data center cooling energy by ~40%. The agent continuously adjusts HVAC parameters for optimal energy efficiency.
Autonomous Vehicles
Waymo, Tesla, and others use RL for lane changing, merging, and complex manoeuvring in simulation before deploying to real vehicles.
Code Example — Q-Learning from Scratch
import numpy as np
import gymnasium as gym
env = gym.make('FrozenLake-v1', is_slippery=False)
n_states = env.observation_space.n # 16
n_actions = env.action_space.n # 4
# Q-table initialised to zeros
Q = np.zeros((n_states, n_actions))
# Hyperparameters
alpha = 0.8 # learning rate
gamma = 0.95 # discount factor
epsilon = 1.0 # exploration rate
episodes = 2000
for ep in range(episodes):
state, _ = env.reset()
done = False
while not done:
# ε-greedy action selection
if np.random.random() < epsilon:
action = env.action_space.sample() # explore
else:
action = np.argmax(Q[state]) # exploit
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
# Bellman update
Q[state, action] += alpha * (
reward + gamma * np.max(Q[next_state]) - Q[state, action]
)
state = next_state
epsilon = max(0.01, epsilon * 0.995) # decay exploration
print("Learned Q-table:", Q)
💡 Try it: Install pip install gymnasium and run this. The agent learns to navigate a 4×4 grid to reach a goal without falling into holes — purely from reward signals.