Deep Learning

What is Deep Learning?

Deep Learning (DL) is a subset of machine learning that uses neural networks with many layers — typically 3 or more — to model complex patterns in data. The word "deep" refers to the depth of the network: the number of successive layers through which data is transformed.

The critical difference from classical ML: deep learning learns its own features from raw data, rather than relying on domain experts to hand-engineer them. A CNN shown millions of images will discover edges, textures, shapes, and object parts on its own — without anyone telling it what to look for.

Deep Learning vs Classical ML

Aspect	Classical ML	Deep Learning
Features	Hand-engineered by experts	Learned automatically from data
Data needed	Works with small datasets	Needs large datasets (10k–billions)
Interpretability	Often interpretable	Black-box (explainability is active research)
Hardware	CPU is usually sufficient	Requires GPU / TPU for large models
Best domains	Tabular, structured data	Images, text, audio, video, sequences

Representation Learning Hierarchy

Each layer in a deep network learns a progressively more abstract representation. For image data, a typical hierarchy looks like:

Pixels

→

Edges & Corners

→

Textures & Shapes

→

Object Parts

→

Full Objects

This hierarchy of representations is automatically learned through gradient descent — no hand-coding required.

Neural Network Basics

Every deep learning system is built from the same fundamental components. Understanding these building blocks is essential before exploring architectures.

🧠 Neuron

Computes a weighted sum of its inputs, adds a bias, then applies an activation function: output = f(w·x + b). Inspired loosely by biological neurons.

📐 Layers

Neurons are organised into layers: Input layer receives data, hidden layers transform representations, and the output layer produces predictions.

⚡ Activation Functions

Introduce non-linearity so networks can learn complex functions. Common choices: ReLU (max(0,x)), Sigmoid (0–1 output), Tanh (−1 to 1), GELU (used in transformers).

📉 Loss Function

Measures how wrong the model's predictions are. Cross-entropy for classification, MSE for regression. The training objective is to minimise this value.

🔁 Backpropagation

Computes gradients of the loss with respect to every parameter using the chain rule of calculus. These gradients tell us which direction to adjust each weight to reduce the loss.

📊 Batch Normalization

Normalises layer activations across a mini-batch. Stabilises training, allows higher learning rates, and acts as a mild regulariser. Introduced in 2015, now ubiquitous.

Key Architectures

Different problems call for different architectures. Here are the most important neural network families:

Architecture	Type	Key Innovation	Used For
MLP	Feedforward	Universal approximator via depth	Tabular data, classification, regression
CNN	Convolutional	Local weight sharing + spatial hierarchy	Image classification, object detection, segmentation
RNN	Recurrent	Hidden state for sequence memory	Time series, basic NLP (largely superseded)
LSTM	Recurrent	Gating mechanisms to control memory	Long sequences, speech, legacy NLP
Transformer	Attention-based	Self-attention over full context in parallel	LLMs, vision transformers, multimodal AI
GAN	Generative	Adversarial generator vs. discriminator	Image generation, data augmentation
Diffusion	Generative	Iterative denoising from noise to signal	Image/video/audio generation
Autoencoder	Self-supervised	Bottleneck forces compressed representation	Dimensionality reduction, anomaly detection

Training Process

Training a neural network is an iterative optimisation loop. Each pass through the data refines the model's weights to minimise the loss.

Data (batch)

→

Forward Pass

→

Loss Calculation

→

Backpropagation

→

Weight Update

↺

The Training Loop in Detail

Sample a mini-batch of data from the training set.
Forward pass: pass inputs through the network layer by layer to produce predictions.
Compute the loss by comparing predictions to ground-truth labels.
Backward pass (backprop): use the chain rule to compute the gradient of the loss with respect to every parameter.
Optimiser step: update weights using the gradients (e.g., w ← w − lr × ∇w).
Repeat for all batches across multiple epochs.

Learning Rate: The most important hyperparameter. Too large → training diverges. Too small → training is slow and may get stuck. Common heuristic: start around 1e-3 with Adam, then schedule downward.

Epoch: One full pass through the entire training dataset. Most models train for tens to thousands of epochs. Early stopping halts training when validation loss stops improving.

Optimization Techniques

Getting deep networks to train well requires more than just gradient descent. Here are the key techniques every practitioner uses:

🚀 SGD

Stochastic Gradient Descent. The classic optimiser. Updates weights using one mini-batch at a time. Can be combined with momentum to accelerate convergence.

⚡ Adam

Adaptive Moment Estimation. Tracks per-parameter first and second moment estimates to adapt the learning rate. Most popular default optimiser for deep learning.

🔧 AdamW

Adam with decoupled weight decay regularisation. Fixes a subtle flaw in Adam's L2 penalty. Preferred for training transformers and LLMs.

📅 LR Scheduling

Learning rate isn't fixed during training. Common schedules: cosine annealing, linear warmup + decay, one-cycle policy. Critical for achieving best results.

💧 Dropout

Randomly zeroes a fraction of neuron activations during training. Forces the network to learn redundant representations, acting as a powerful regulariser against overfitting.

🛑 Early Stopping

Monitor validation loss during training and stop when it stops improving. Prevents overfitting and saves compute. Usually combined with saving the best checkpoint.

✂️ Gradient Clipping

Cap gradient norms to a maximum value (e.g., 1.0) before the optimiser step. Prevents exploding gradients — especially important for RNNs and deep transformers.

Code Example — PyTorch Neural Network on MNIST

Python (PyTorch)

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# ── 1. Hyperparameters ─────────────────────────────────────────
BATCH_SIZE = 64
EPOCHS     = 5
LR         = 1e-3
DEVICE     = "cuda" if torch.cuda.is_available() else "cpu"

# ── 2. Data loading ─────────────────────────────────────────────
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))   # MNIST mean/std
])

train_loader = DataLoader(
    datasets.MNIST("data", train=True,  download=True, transform=transform),
    batch_size=BATCH_SIZE, shuffle=True
)
test_loader = DataLoader(
    datasets.MNIST("data", train=False, download=True, transform=transform),
    batch_size=BATCH_SIZE
)

# ── 3. Model definition ─────────────────────────────────────────
class MNISTNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Flatten(),                        # 28×28 → 784
            nn.Linear(784, 256), nn.ReLU(), nn.Dropout(0.2),
            nn.Linear(256, 128), nn.ReLU(), nn.Dropout(0.2),
            nn.Linear(128, 10)                  # 10 digit classes
        )

    def forward(self, x):
        return self.net(x)

model    = MNISTNet().to(DEVICE)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LR)

# ── 4. Training loop ────────────────────────────────────────────
def train_epoch(epoch):
    model.train()
    total_loss = 0
    for imgs, labels in train_loader:
        imgs, labels = imgs.to(DEVICE), labels.to(DEVICE)
        optimizer.zero_grad()
        loss = criterion(model(imgs), labels)
        loss.backward()                         # backpropagation
        optimizer.step()                         # weight update
        total_loss += loss.item()
    print(f"Epoch {epoch}  loss: {total_loss/len(train_loader):.4f}")

# ── 5. Evaluation ───────────────────────────────────────────────
def evaluate():
    model.eval()
    correct = 0
    with torch.no_grad():
        for imgs, labels in test_loader:
            imgs, labels = imgs.to(DEVICE), labels.to(DEVICE)
            preds = model(imgs).argmax(dim=1)
            correct += (preds == labels).sum().item()
    acc = correct / len(test_loader.dataset)
    print(f"Test accuracy: {acc:.2%}")

for ep in range(1, EPOCHS + 1):
    train_epoch(ep)
evaluate()
# → Epoch 5  loss: 0.0721
# → Test accuracy: 97.83%