What is Deep Learning?
Deep Learning (DL) is a subset of machine learning that uses neural networks with many layers — typically 3 or more — to model complex patterns in data. The word "deep" refers to the depth of the network: the number of successive layers through which data is transformed.
The critical difference from classical ML: deep learning learns its own features from raw data, rather than relying on domain experts to hand-engineer them. A CNN shown millions of images will discover edges, textures, shapes, and object parts on its own — without anyone telling it what to look for.
Deep Learning vs Classical ML
| Aspect | Classical ML | Deep Learning |
|---|---|---|
| Features | Hand-engineered by experts | Learned automatically from data |
| Data needed | Works with small datasets | Needs large datasets (10k–billions) |
| Interpretability | Often interpretable | Black-box (explainability is active research) |
| Hardware | CPU is usually sufficient | Requires GPU / TPU for large models |
| Best domains | Tabular, structured data | Images, text, audio, video, sequences |
Representation Learning Hierarchy
Each layer in a deep network learns a progressively more abstract representation. For image data, a typical hierarchy looks like:
This hierarchy of representations is automatically learned through gradient descent — no hand-coding required.
Neural Network Basics
Every deep learning system is built from the same fundamental components. Understanding these building blocks is essential before exploring architectures.
🧠 Neuron
Computes a weighted sum of its inputs, adds a bias, then applies an activation function: output = f(w·x + b). Inspired loosely by biological neurons.
📐 Layers
Neurons are organised into layers: Input layer receives data, hidden layers transform representations, and the output layer produces predictions.
⚡ Activation Functions
Introduce non-linearity so networks can learn complex functions. Common choices: ReLU (max(0,x)), Sigmoid (0–1 output), Tanh (−1 to 1), GELU (used in transformers).
📉 Loss Function
Measures how wrong the model's predictions are. Cross-entropy for classification, MSE for regression. The training objective is to minimise this value.
🔁 Backpropagation
Computes gradients of the loss with respect to every parameter using the chain rule of calculus. These gradients tell us which direction to adjust each weight to reduce the loss.
📊 Batch Normalization
Normalises layer activations across a mini-batch. Stabilises training, allows higher learning rates, and acts as a mild regulariser. Introduced in 2015, now ubiquitous.
Key Architectures
Different problems call for different architectures. Here are the most important neural network families:
| Architecture | Type | Key Innovation | Used For |
|---|---|---|---|
| MLP | Feedforward | Universal approximator via depth | Tabular data, classification, regression |
| CNN | Convolutional | Local weight sharing + spatial hierarchy | Image classification, object detection, segmentation |
| RNN | Recurrent | Hidden state for sequence memory | Time series, basic NLP (largely superseded) |
| LSTM | Recurrent | Gating mechanisms to control memory | Long sequences, speech, legacy NLP |
| Transformer | Attention-based | Self-attention over full context in parallel | LLMs, vision transformers, multimodal AI |
| GAN | Generative | Adversarial generator vs. discriminator | Image generation, data augmentation |
| Diffusion | Generative | Iterative denoising from noise to signal | Image/video/audio generation |
| Autoencoder | Self-supervised | Bottleneck forces compressed representation | Dimensionality reduction, anomaly detection |
Training Process
Training a neural network is an iterative optimisation loop. Each pass through the data refines the model's weights to minimise the loss.
The Training Loop in Detail
- Sample a mini-batch of data from the training set.
- Forward pass: pass inputs through the network layer by layer to produce predictions.
- Compute the loss by comparing predictions to ground-truth labels.
- Backward pass (backprop): use the chain rule to compute the gradient of the loss with respect to every parameter.
- Optimiser step: update weights using the gradients (e.g., w ← w − lr × ∇w).
- Repeat for all batches across multiple epochs.
Learning Rate: The most important hyperparameter. Too large → training diverges. Too small → training is slow and may get stuck. Common heuristic: start around 1e-3 with Adam, then schedule downward.
Epoch: One full pass through the entire training dataset. Most models train for tens to thousands of epochs. Early stopping halts training when validation loss stops improving.
Optimization Techniques
Getting deep networks to train well requires more than just gradient descent. Here are the key techniques every practitioner uses:
🚀 SGD
Stochastic Gradient Descent. The classic optimiser. Updates weights using one mini-batch at a time. Can be combined with momentum to accelerate convergence.
⚡ Adam
Adaptive Moment Estimation. Tracks per-parameter first and second moment estimates to adapt the learning rate. Most popular default optimiser for deep learning.
🔧 AdamW
Adam with decoupled weight decay regularisation. Fixes a subtle flaw in Adam's L2 penalty. Preferred for training transformers and LLMs.
📅 LR Scheduling
Learning rate isn't fixed during training. Common schedules: cosine annealing, linear warmup + decay, one-cycle policy. Critical for achieving best results.
💧 Dropout
Randomly zeroes a fraction of neuron activations during training. Forces the network to learn redundant representations, acting as a powerful regulariser against overfitting.
🛑 Early Stopping
Monitor validation loss during training and stop when it stops improving. Prevents overfitting and saves compute. Usually combined with saving the best checkpoint.
✂️ Gradient Clipping
Cap gradient norms to a maximum value (e.g., 1.0) before the optimiser step. Prevents exploding gradients — especially important for RNNs and deep transformers.
Code Example — PyTorch Neural Network on MNIST
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
# ── 1. Hyperparameters ─────────────────────────────────────────
BATCH_SIZE = 64
EPOCHS = 5
LR = 1e-3
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# ── 2. Data loading ─────────────────────────────────────────────
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,)) # MNIST mean/std
])
train_loader = DataLoader(
datasets.MNIST("data", train=True, download=True, transform=transform),
batch_size=BATCH_SIZE, shuffle=True
)
test_loader = DataLoader(
datasets.MNIST("data", train=False, download=True, transform=transform),
batch_size=BATCH_SIZE
)
# ── 3. Model definition ─────────────────────────────────────────
class MNISTNet(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Flatten(), # 28×28 → 784
nn.Linear(784, 256), nn.ReLU(), nn.Dropout(0.2),
nn.Linear(256, 128), nn.ReLU(), nn.Dropout(0.2),
nn.Linear(128, 10) # 10 digit classes
)
def forward(self, x):
return self.net(x)
model = MNISTNet().to(DEVICE)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LR)
# ── 4. Training loop ────────────────────────────────────────────
def train_epoch(epoch):
model.train()
total_loss = 0
for imgs, labels in train_loader:
imgs, labels = imgs.to(DEVICE), labels.to(DEVICE)
optimizer.zero_grad()
loss = criterion(model(imgs), labels)
loss.backward() # backpropagation
optimizer.step() # weight update
total_loss += loss.item()
print(f"Epoch {epoch} loss: {total_loss/len(train_loader):.4f}")
# ── 5. Evaluation ───────────────────────────────────────────────
def evaluate():
model.eval()
correct = 0
with torch.no_grad():
for imgs, labels in test_loader:
imgs, labels = imgs.to(DEVICE), labels.to(DEVICE)
preds = model(imgs).argmax(dim=1)
correct += (preds == labels).sum().item()
acc = correct / len(test_loader.dataset)
print(f"Test accuracy: {acc:.2%}")
for ep in range(1, EPOCHS + 1):
train_epoch(ep)
evaluate()
# → Epoch 5 loss: 0.0721
# → Test accuracy: 97.83%