Foundations

Machine Learning

Machine Learning is the field of study that gives computers the ability to learn from data without being explicitly programmed. It's the backbone of modern AI, powering everything from spam filters to self-driving cars.

1959Arthur Samuel coined "ML"
80%+of AI products use ML
$209Bglobal ML market by 2029

What is Machine Learning?

Machine Learning (ML) is a subset of Artificial Intelligence where algorithms learn patterns from data to make predictions or decisions. Instead of writing explicit rules, you train a model on examples — and it discovers the rules itself.

The key insight: given enough data and compute, ML models can learn representations that humans struggle to program explicitly — like recognising faces, translating languages, or predicting diseases.

Raw Data
Feature Engineering
Model Training
Trained Model
Predictions

Traditional programming: Rules + Data → Output
Machine Learning: Data + Output → Rules (the model)

Types of Machine Learning

🎓 Supervised Learning

Train on labelled data (input → correct output). The model learns to map inputs to outputs.

🔍 Unsupervised Learning

Find hidden patterns in unlabelled data. No correct answers — the model discovers structure.

🔄 Semi-Supervised

Mix of labelled and unlabelled data. Common when labelling is expensive (e.g., medical images).

🎮 Reinforcement Learning

Agent learns by taking actions in an environment and receiving rewards or penalties.

🔁 Self-Supervised

Model creates its own labels from data (e.g., predict the next word). Used to train LLMs.

📚 Transfer Learning

Use a pre-trained model as a starting point and fine-tune on a new task. Dramatically reduces data needs.

Key Algorithms

AlgorithmTypeBest ForComplexity
Linear RegressionSupervisedContinuous value prediction⭐ Low
Logistic RegressionSupervisedBinary classification⭐ Low
Decision TreesSupervisedInterpretable rules⭐⭐ Medium
Random ForestSupervisedTabular data, robust baseline⭐⭐ Medium
Gradient Boosting (XGBoost)SupervisedTabular data, competitions⭐⭐⭐ High
SVMSupervisedHigh-dimensional, small data⭐⭐ Medium
k-Nearest NeighboursSupervisedSimple baselines⭐ Low
k-MeansUnsupervisedClustering⭐⭐ Medium
PCAUnsupervisedDimensionality reduction⭐⭐ Medium
Neural NetworksAll typesComplex patterns, large data⭐⭐⭐⭐ Very High

The ML Pipeline

A real-world ML project follows these stages — in practice you'll iterate between them many times:

1. Problem Definition

What are you predicting? What metric measures success?

2. Data Collection

Gather representative data. More quality data beats better algorithms.

3. EDA

Exploratory Data Analysis — understand distributions, outliers, correlations.

4. Preprocessing

Handle missing values, encode categoricals, normalise/scale features.

5. Feature Engineering

Create meaningful features. Domain knowledge is gold here.

6. Model Training

Split data, choose algorithm, fit the model.

7. Evaluation

Measure performance on held-out test set using the right metrics.

8. Deployment

Serve predictions via API. Monitor for data drift.

Model Evaluation

Classification Metrics

MetricFormulaWhen to use
Accuracy(TP+TN) / TotalBalanced classes
PrecisionTP / (TP+FP)When false positives are costly
RecallTP / (TP+FN)When false negatives are costly
F1 Score2×P×R / (P+R)Imbalanced classes
AUC-ROCArea under ROC curveRanking/probability output

⚠️ The bias-variance tradeoff: A model that's too simple underfits (high bias). A model that's too complex overfits (high variance). The sweet spot is minimising total error = bias² + variance + irreducible noise.

Code Example — Train a Classifier

Python
# Train a Random Forest classifier on the Iris dataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# 1. Load data
X, y = load_iris(return_X_y=True)

# 2. Split into train / test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 4. Evaluate
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print(classification_report(y_test, y_pred))
# → Accuracy: 96.67%

# 5. Feature importances
for name, imp in zip(load_iris().feature_names, model.feature_importances_):
    print(f"{name}: {imp:.3f}")

Real-World Applications

🏥 Healthcare

Disease prediction, medical imaging diagnosis, drug discovery, genomic analysis.

💳 Finance

Fraud detection, credit scoring, algorithmic trading, risk modelling.

🛒 E-Commerce

Recommendation systems (Netflix, Amazon), demand forecasting, dynamic pricing.

🚗 Automotive

Predictive maintenance, fuel optimisation, driver behaviour analysis.

📧 Communications

Spam filtering, email categorisation, autocomplete, language detection.

🔒 Security

Intrusion detection, malware classification, anomaly detection.