Foundations

Deep Learning

Deep Learning uses multi-layered neural networks to learn hierarchical representations directly from raw data — eliminating the need for hand-crafted features and enabling breakthroughs in vision, language, audio, and beyond.

2012AlexNet moment — CNN wins ImageNet
100+layers in modern networks
100×faster GPU training vs CPU

What is Deep Learning?

Deep Learning (DL) is a subset of machine learning that uses neural networks with many layers — typically 3 or more — to model complex patterns in data. The word "deep" refers to the depth of the network: the number of successive layers through which data is transformed.

The critical difference from classical ML: deep learning learns its own features from raw data, rather than relying on domain experts to hand-engineer them. A CNN shown millions of images will discover edges, textures, shapes, and object parts on its own — without anyone telling it what to look for.

Deep Learning vs Classical ML

AspectClassical MLDeep Learning
FeaturesHand-engineered by expertsLearned automatically from data
Data neededWorks with small datasetsNeeds large datasets (10k–billions)
InterpretabilityOften interpretableBlack-box (explainability is active research)
HardwareCPU is usually sufficientRequires GPU / TPU for large models
Best domainsTabular, structured dataImages, text, audio, video, sequences

Representation Learning Hierarchy

Each layer in a deep network learns a progressively more abstract representation. For image data, a typical hierarchy looks like:

Pixels
Edges & Corners
Textures & Shapes
Object Parts
Full Objects

This hierarchy of representations is automatically learned through gradient descent — no hand-coding required.

Neural Network Basics

Every deep learning system is built from the same fundamental components. Understanding these building blocks is essential before exploring architectures.

🧠 Neuron

Computes a weighted sum of its inputs, adds a bias, then applies an activation function: output = f(w·x + b). Inspired loosely by biological neurons.

📐 Layers

Neurons are organised into layers: Input layer receives data, hidden layers transform representations, and the output layer produces predictions.

⚡ Activation Functions

Introduce non-linearity so networks can learn complex functions. Common choices: ReLU (max(0,x)), Sigmoid (0–1 output), Tanh (−1 to 1), GELU (used in transformers).

📉 Loss Function

Measures how wrong the model's predictions are. Cross-entropy for classification, MSE for regression. The training objective is to minimise this value.

🔁 Backpropagation

Computes gradients of the loss with respect to every parameter using the chain rule of calculus. These gradients tell us which direction to adjust each weight to reduce the loss.

📊 Batch Normalization

Normalises layer activations across a mini-batch. Stabilises training, allows higher learning rates, and acts as a mild regulariser. Introduced in 2015, now ubiquitous.

Key Architectures

Different problems call for different architectures. Here are the most important neural network families:

ArchitectureTypeKey InnovationUsed For
MLPFeedforwardUniversal approximator via depthTabular data, classification, regression
CNNConvolutionalLocal weight sharing + spatial hierarchyImage classification, object detection, segmentation
RNNRecurrentHidden state for sequence memoryTime series, basic NLP (largely superseded)
LSTMRecurrentGating mechanisms to control memoryLong sequences, speech, legacy NLP
TransformerAttention-basedSelf-attention over full context in parallelLLMs, vision transformers, multimodal AI
GANGenerativeAdversarial generator vs. discriminatorImage generation, data augmentation
DiffusionGenerativeIterative denoising from noise to signalImage/video/audio generation
AutoencoderSelf-supervisedBottleneck forces compressed representationDimensionality reduction, anomaly detection

Training Process

Training a neural network is an iterative optimisation loop. Each pass through the data refines the model's weights to minimise the loss.

Data (batch)
Forward Pass
Loss Calculation
Backpropagation
Weight Update

The Training Loop in Detail

  1. Sample a mini-batch of data from the training set.
  2. Forward pass: pass inputs through the network layer by layer to produce predictions.
  3. Compute the loss by comparing predictions to ground-truth labels.
  4. Backward pass (backprop): use the chain rule to compute the gradient of the loss with respect to every parameter.
  5. Optimiser step: update weights using the gradients (e.g., w ← w − lr × ∇w).
  6. Repeat for all batches across multiple epochs.

Learning Rate: The most important hyperparameter. Too large → training diverges. Too small → training is slow and may get stuck. Common heuristic: start around 1e-3 with Adam, then schedule downward.

Epoch: One full pass through the entire training dataset. Most models train for tens to thousands of epochs. Early stopping halts training when validation loss stops improving.

Optimization Techniques

Getting deep networks to train well requires more than just gradient descent. Here are the key techniques every practitioner uses:

🚀 SGD

Stochastic Gradient Descent. The classic optimiser. Updates weights using one mini-batch at a time. Can be combined with momentum to accelerate convergence.

⚡ Adam

Adaptive Moment Estimation. Tracks per-parameter first and second moment estimates to adapt the learning rate. Most popular default optimiser for deep learning.

🔧 AdamW

Adam with decoupled weight decay regularisation. Fixes a subtle flaw in Adam's L2 penalty. Preferred for training transformers and LLMs.

📅 LR Scheduling

Learning rate isn't fixed during training. Common schedules: cosine annealing, linear warmup + decay, one-cycle policy. Critical for achieving best results.

💧 Dropout

Randomly zeroes a fraction of neuron activations during training. Forces the network to learn redundant representations, acting as a powerful regulariser against overfitting.

🛑 Early Stopping

Monitor validation loss during training and stop when it stops improving. Prevents overfitting and saves compute. Usually combined with saving the best checkpoint.

✂️ Gradient Clipping

Cap gradient norms to a maximum value (e.g., 1.0) before the optimiser step. Prevents exploding gradients — especially important for RNNs and deep transformers.

Code Example — PyTorch Neural Network on MNIST

Python (PyTorch)
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# ── 1. Hyperparameters ─────────────────────────────────────────
BATCH_SIZE = 64
EPOCHS     = 5
LR         = 1e-3
DEVICE     = "cuda" if torch.cuda.is_available() else "cpu"

# ── 2. Data loading ─────────────────────────────────────────────
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))   # MNIST mean/std
])

train_loader = DataLoader(
    datasets.MNIST("data", train=True,  download=True, transform=transform),
    batch_size=BATCH_SIZE, shuffle=True
)
test_loader = DataLoader(
    datasets.MNIST("data", train=False, download=True, transform=transform),
    batch_size=BATCH_SIZE
)

# ── 3. Model definition ─────────────────────────────────────────
class MNISTNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Flatten(),                        # 28×28 → 784
            nn.Linear(784, 256), nn.ReLU(), nn.Dropout(0.2),
            nn.Linear(256, 128), nn.ReLU(), nn.Dropout(0.2),
            nn.Linear(128, 10)                  # 10 digit classes
        )

    def forward(self, x):
        return self.net(x)

model    = MNISTNet().to(DEVICE)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LR)

# ── 4. Training loop ────────────────────────────────────────────
def train_epoch(epoch):
    model.train()
    total_loss = 0
    for imgs, labels in train_loader:
        imgs, labels = imgs.to(DEVICE), labels.to(DEVICE)
        optimizer.zero_grad()
        loss = criterion(model(imgs), labels)
        loss.backward()                         # backpropagation
        optimizer.step()                         # weight update
        total_loss += loss.item()
    print(f"Epoch {epoch}  loss: {total_loss/len(train_loader):.4f}")

# ── 5. Evaluation ───────────────────────────────────────────────
def evaluate():
    model.eval()
    correct = 0
    with torch.no_grad():
        for imgs, labels in test_loader:
            imgs, labels = imgs.to(DEVICE), labels.to(DEVICE)
            preds = model(imgs).argmax(dim=1)
            correct += (preds == labels).sum().item()
    acc = correct / len(test_loader.dataset)
    print(f"Test accuracy: {acc:.2%}")

for ep in range(1, EPOCHS + 1):
    train_epoch(ep)
evaluate()
# → Epoch 5  loss: 0.0721
# → Test accuracy: 97.83%