Computer Vision

What is Computer Vision?

Computer Vision (CV) is the field of artificial intelligence that enables computers to extract meaningful information from images, videos, and other visual inputs. Where humans instantly perceive depth, objects, and motion, machines must learn these representations from raw pixel data.

At its core, a digital image is a 2-D grid of pixels. Each pixel stores intensity values — one number for greyscale, or three channels (R, G, B) for colour. A colour image is therefore represented as a 3-D tensor of shape [height × width × channels]. Deep learning frameworks such as PyTorch reshape this into [batch × channels × height × width] (NCHW layout) for efficient GPU processing.

Raw Pixels

→

Tensor [C×H×W]

→

Neural Network

→

Predictions

The evolution of image understanding

The field has gone through three distinct eras:

Hand-Crafted Features (pre-2012)

Engineers manually designed feature detectors such as HOG (Histogram of Oriented Gradients) and SIFT (Scale-Invariant Feature Transform), then fed these into SVMs or shallow classifiers. Performance on complex scenes was limited.

Convolutional Neural Networks (2012–2020)

AlexNet's 2012 ImageNet win triggered a paradigm shift. CNNs learn feature hierarchies directly from data — edges → textures → parts → objects — via stacked convolution, pooling, and activation layers. Accuracy leaped by double digits.

Vision Transformers (2020–present)

The ViT (Vision Transformer) split images into patches, applied self-attention, and matched or surpassed CNNs at scale. Models like SAM and CLIP pushed CV into the foundation-model era with zero-shot and promptable capabilities.

Key insight: CNNs exploit spatial locality — nearby pixels are related. Transformers exploit global context — any patch can attend to any other. Hybrid architectures (e.g., ConvNeXt, Swin Transformer) blend both approaches.

Key CV Tasks

Computer Vision encompasses a wide family of tasks depending on what information needs to be extracted from the image:

🖼️ Image Classification

Assign a single label to an entire image. "What is in this image?" — the foundational task behind ImageNet and CNNs.

📦 Object Detection

Locate and classify multiple objects in one pass, returning a bounding box and class label for each. "Where are the objects and what are they?"

🎨 Semantic Segmentation

Assign a class label to every pixel in the image. All pixels belonging to "road" share the same label, regardless of which road instance they come from.

🔵 Instance Segmentation

Like semantic segmentation, but distinguishes individual object instances — each separate person gets their own mask. Combines detection and segmentation.

🦴 Pose Estimation

Detect skeleton keypoints (shoulders, elbows, knees, etc.) to understand body pose. Used in sports analytics, physiotherapy, and motion capture.

🔤 OCR

Optical Character Recognition — detect and transcribe text within images. Powers document scanning, licence plate readers, and receipt parsing.

📐 Depth Estimation

Predict the distance of each pixel from the camera using a single RGB image (monocular depth) or stereo pair. Essential for robotics and autonomous driving.

👤 Face Recognition

Verify or identify individuals by comparing facial embeddings in a high-dimensional feature space. Uses metric learning (ArcFace, FaceNet) to maximise inter-class distance.

Architectures

Each breakthrough architecture introduced a key idea that unlocked the next wave of progress:

Architecture	Year	Key Innovation	Primary Use Case
LeNet-5	1998	First practical CNN — convolution + pooling layers, trained end-to-end with backprop	Handwritten digit recognition (MNIST)
AlexNet	2012	Deep CNN trained on GPU; introduced ReLU, dropout, and data augmentation at scale	ImageNet large-scale classification
ResNet	2015	Residual (skip) connections allow gradients to flow through 100+ layers without vanishing	Very deep networks; backbone for most downstream tasks
YOLO	2016	Single-pass detection: one network predicts boxes and classes simultaneously	Real-time object detection
ViT	2020	Treats image patches as tokens; applies Transformer self-attention across the entire image	Image classification at scale
SAM	2023	Promptable segmentation model — accepts points, boxes, or text to produce precise masks	Universal, zero-shot segmentation

Why skip connections matter: In a plain deep network, gradients diminish exponentially as they propagate backwards through many layers (vanishing gradient). ResNet's identity shortcuts let the gradient bypass any layer that is not contributing, enabling reliable training of 152-layer networks — a feat that was impossible before 2015.

Object Detection — Deep Dive

Object detection must solve two problems simultaneously: classification (what?) and localisation (where?). Several foundational concepts underpin all modern detectors.

Anchor Boxes

Most detectors pre-define a set of anchor boxes — rectangles with various aspect ratios and scales placed at regular grid positions across the feature map. The network learns to predict an offset from each anchor to the true bounding box, rather than regressing absolute coordinates. This dramatically simplifies learning because anchors already approximate the rough size and shape of common objects.

Intersection over Union (IoU)

IoU measures how well a predicted box overlaps with the ground-truth box:

IoU = Area of Overlap / Area of Union

A prediction is considered a true positive if IoU exceeds a threshold (commonly 0.5 for PASCAL VOC, or 0.5–0.95 averaged for COCO mAP). IoU is also used as a loss component in modern detectors (CIoU, DIoU) to jointly optimise box shape and position.

Non-Maximum Suppression (NMS)

Detectors produce many overlapping candidate boxes for the same object. NMS iteratively selects the highest-confidence box and removes all other boxes with IoU above a threshold, keeping only the best prediction per object. Soft-NMS and DIoU-NMS are improved variants that handle crowded scenes better.

One-Stage vs. Two-Stage Detectors

	Two-Stage (Faster R-CNN)	One-Stage (YOLO, SSD)
Stage 1	Region Proposal Network generates candidate regions of interest (RoIs)	— (no explicit proposal step)
Stage 2	RoI features classified and boxes refined independently	Single network predicts all boxes and classes in one pass
Accuracy	Higher, especially on small objects	Slightly lower historically; gap closed in YOLOv8/v9
Speed	Slower (5–15 fps typical)	Faster (30–160+ fps depending on model size)
Best for	Offline analysis, medical imaging, satellite imagery	Real-time applications, edge devices, video streams

YOLO's trick: The image is divided into an S×S grid. Each cell predicts B bounding boxes and C class probabilities. Because everything is computed in a single forward pass, YOLO achieves real-time throughput — YOLOv9 reaches 60 fps on a modern GPU at COCO accuracy that rivals much heavier two-stage models.

Segmentation

Segmentation is the task of partitioning an image into meaningful regions. There are two main flavours, plus the cutting-edge promptable approach introduced by SAM:

Semantic Segmentation

Assign one class label per pixel. All cars are "car", all pedestrians are "person" — no distinction between instances. FCN (Fully Convolutional Network) and DeepLab are classic architectures that use encoder–decoder designs with skip connections to preserve spatial resolution.

Instance Segmentation

Assign per-pixel masks to individual object instances. Mask R-CNN extends Faster R-CNN by adding a mask head that predicts a binary mask for each detected RoI, enabling precise contours around every separate object even when they overlap.

Promptable Segmentation (SAM)

Meta's Segment Anything Model accepts a point, bounding box, or free-form prompt and returns a high-quality mask in milliseconds — no task-specific training required. The image encoder (ViT-H) is run once; the lightweight mask decoder is queried interactively, enabling real-time annotation tools.

Code Example — Image Classification with ResNet

Load a pretrained ResNet-50 from torchvision, preprocess an image, and get the top-5 ImageNet predictions:

Python

import torch
from torchvision import models, transforms
from torchvision.models import ResNet50_Weights
from PIL import Image
import urllib.request, json

# 1. Load a pretrained ResNet-50 (ImageNet weights)
model = models.resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)
model.eval()

# 2. Standard ImageNet preprocessing pipeline
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    ),
])

# 3. Load and preprocess an image
img = Image.open("dog.jpg").convert("RGB")
input_tensor = preprocess(img).unsqueeze(0)  # add batch dimension → [1, 3, 224, 224]

# 4. Forward pass — no gradients needed for inference
with torch.no_grad():
    logits = model(input_tensor)                # shape: [1, 1000]
    probs  = torch.nn.functional.softmax(logits[0], dim=0)

# 5. Fetch ImageNet class labels
url = "https://raw.githubusercontent.com/anishathalye/imagenet-simple-labels/master/imagenet-simple-labels.json"
with urllib.request.urlopen(url) as r:
    labels = json.loads(r.read())

# 6. Print top-5 predictions
top5 = torch.topk(probs, 5)
print("Top-5 predictions:")
for prob, idx in zip(top5.values, top5.indices):
    print(f"  {labels[idx]:30s}  {prob.item():.2%}")

# Example output:
#   golden retriever                91.23%
#   Labrador retriever               5.41%
#   kuvasz                           1.02%
#   cocker spaniel                   0.61%
#   clumber                          0.28%

What is happening inside ResNet-50? The image passes through 5 residual stages (1 + 4 stages of bottleneck blocks), each doubling the channel depth while halving spatial resolution via strided convolution. The final 2048-dim feature vector is average-pooled and projected to 1000 class logits via a single linear layer.

Key Tools & Libraries

OpenCV

The go-to library for classical computer vision — camera I/O, colour space conversions, geometric transforms, feature matching, optical flow, and video processing. Bindings for Python, C++, and Java.

torchvision

PyTorch's official CV extension: pretrained models (ResNet, EfficientNet, ViT), standard transforms and augmentations, and popular datasets (ImageNet, COCO, VOC) with a consistent API.

Detectron2

Facebook AI's modular detection and segmentation framework. Implements Faster R-CNN, Mask R-CNN, RetinaNet, Panoptic FPN, and DensePose. Designed for research-grade reproducibility.

Ultralytics YOLOv9

The most popular real-time object detection package. One-line training, export to ONNX / TensorRT / CoreML, and 60 fps inference. Also supports segmentation, classification, and pose estimation tasks.

Segment Anything (SAM)

Meta AI's universal segmentation model. Run inference on any image with zero fine-tuning. Supports automatic mask generation (segment everything) and interactive prompting (click a point, draw a box).

Roboflow

End-to-end dataset management platform for CV — import annotations from any format, apply augmentations, version datasets, and export directly to YOLO / COCO / TFRecord training pipelines.