Machine Learning

What is Machine Learning?

Machine Learning (ML) is a subset of Artificial Intelligence where algorithms learn patterns from data to make predictions or decisions. Instead of writing explicit rules, you train a model on examples — and it discovers the rules itself.

The key insight: given enough data and compute, ML models can learn representations that humans struggle to program explicitly — like recognising faces, translating languages, or predicting diseases.

Raw Data

→

Feature Engineering

→

Model Training

→

Trained Model

→

Predictions

Traditional programming: Rules + Data → Output
Machine Learning: Data + Output → Rules (the model)

Types of Machine Learning

🎓 Supervised Learning

Train on labelled data (input → correct output). The model learns to map inputs to outputs.

🔍 Unsupervised Learning

Find hidden patterns in unlabelled data. No correct answers — the model discovers structure.

🔄 Semi-Supervised

Mix of labelled and unlabelled data. Common when labelling is expensive (e.g., medical images).

🎮 Reinforcement Learning

Agent learns by taking actions in an environment and receiving rewards or penalties.

🔁 Self-Supervised

Model creates its own labels from data (e.g., predict the next word). Used to train LLMs.

📚 Transfer Learning

Use a pre-trained model as a starting point and fine-tune on a new task. Dramatically reduces data needs.

Key Algorithms

Algorithm	Type	Best For	Complexity
Linear Regression	Supervised	Continuous value prediction	⭐ Low
Logistic Regression	Supervised	Binary classification	⭐ Low
Decision Trees	Supervised	Interpretable rules	⭐⭐ Medium
Random Forest	Supervised	Tabular data, robust baseline	⭐⭐ Medium
Gradient Boosting (XGBoost)	Supervised	Tabular data, competitions	⭐⭐⭐ High
SVM	Supervised	High-dimensional, small data	⭐⭐ Medium
k-Nearest Neighbours	Supervised	Simple baselines	⭐ Low
k-Means	Unsupervised	Clustering	⭐⭐ Medium
PCA	Unsupervised	Dimensionality reduction	⭐⭐ Medium
Neural Networks	All types	Complex patterns, large data	⭐⭐⭐⭐ Very High

The ML Pipeline

A real-world ML project follows these stages — in practice you'll iterate between them many times:

1. Problem Definition

What are you predicting? What metric measures success?

2. Data Collection

Gather representative data. More quality data beats better algorithms.

3. EDA

Exploratory Data Analysis — understand distributions, outliers, correlations.

4. Preprocessing

Handle missing values, encode categoricals, normalise/scale features.

5. Feature Engineering

Create meaningful features. Domain knowledge is gold here.

6. Model Training

Split data, choose algorithm, fit the model.

7. Evaluation

Measure performance on held-out test set using the right metrics.

8. Deployment

Serve predictions via API. Monitor for data drift.

Model Evaluation

Classification Metrics

Metric	Formula	When to use
Accuracy	(TP+TN) / Total	Balanced classes
Precision	TP / (TP+FP)	When false positives are costly
Recall	TP / (TP+FN)	When false negatives are costly
F1 Score	2×P×R / (P+R)	Imbalanced classes
AUC-ROC	Area under ROC curve	Ranking/probability output

⚠️ The bias-variance tradeoff: A model that's too simple underfits (high bias). A model that's too complex overfits (high variance). The sweet spot is minimising total error = bias² + variance + irreducible noise.

Code Example — Train a Classifier

Python

# Train a Random Forest classifier on the Iris dataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# 1. Load data
X, y = load_iris(return_X_y=True)

# 2. Split into train / test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 4. Evaluate
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print(classification_report(y_test, y_pred))
# → Accuracy: 96.67%

# 5. Feature importances
for name, imp in zip(load_iris().feature_names, model.feature_importances_):
    print(f"{name}: {imp:.3f}")

Real-World Applications

🏥 Healthcare

Disease prediction, medical imaging diagnosis, drug discovery, genomic analysis.

💳 Finance

Fraud detection, credit scoring, algorithmic trading, risk modelling.

🛒 E-Commerce

Recommendation systems (Netflix, Amazon), demand forecasting, dynamic pricing.

🚗 Automotive

Predictive maintenance, fuel optimisation, driver behaviour analysis.

📧 Communications

Spam filtering, email categorisation, autocomplete, language detection.

🔒 Security

Intrusion detection, malware classification, anomaly detection.