What is Machine Learning?
Machine Learning (ML) is a subset of Artificial Intelligence where algorithms learn patterns from data to make predictions or decisions. Instead of writing explicit rules, you train a model on examples — and it discovers the rules itself.
The key insight: given enough data and compute, ML models can learn representations that humans struggle to program explicitly — like recognising faces, translating languages, or predicting diseases.
Traditional programming: Rules + Data → Output
Machine Learning: Data + Output → Rules (the model)
Types of Machine Learning
🎓 Supervised Learning
Train on labelled data (input → correct output). The model learns to map inputs to outputs.
🔍 Unsupervised Learning
Find hidden patterns in unlabelled data. No correct answers — the model discovers structure.
🔄 Semi-Supervised
Mix of labelled and unlabelled data. Common when labelling is expensive (e.g., medical images).
🎮 Reinforcement Learning
Agent learns by taking actions in an environment and receiving rewards or penalties.
🔁 Self-Supervised
Model creates its own labels from data (e.g., predict the next word). Used to train LLMs.
📚 Transfer Learning
Use a pre-trained model as a starting point and fine-tune on a new task. Dramatically reduces data needs.
Key Algorithms
| Algorithm | Type | Best For | Complexity |
|---|---|---|---|
| Linear Regression | Supervised | Continuous value prediction | ⭐ Low |
| Logistic Regression | Supervised | Binary classification | ⭐ Low |
| Decision Trees | Supervised | Interpretable rules | ⭐⭐ Medium |
| Random Forest | Supervised | Tabular data, robust baseline | ⭐⭐ Medium |
| Gradient Boosting (XGBoost) | Supervised | Tabular data, competitions | ⭐⭐⭐ High |
| SVM | Supervised | High-dimensional, small data | ⭐⭐ Medium |
| k-Nearest Neighbours | Supervised | Simple baselines | ⭐ Low |
| k-Means | Unsupervised | Clustering | ⭐⭐ Medium |
| PCA | Unsupervised | Dimensionality reduction | ⭐⭐ Medium |
| Neural Networks | All types | Complex patterns, large data | ⭐⭐⭐⭐ Very High |
The ML Pipeline
A real-world ML project follows these stages — in practice you'll iterate between them many times:
1. Problem Definition
What are you predicting? What metric measures success?
2. Data Collection
Gather representative data. More quality data beats better algorithms.
3. EDA
Exploratory Data Analysis — understand distributions, outliers, correlations.
4. Preprocessing
Handle missing values, encode categoricals, normalise/scale features.
5. Feature Engineering
Create meaningful features. Domain knowledge is gold here.
6. Model Training
Split data, choose algorithm, fit the model.
7. Evaluation
Measure performance on held-out test set using the right metrics.
8. Deployment
Serve predictions via API. Monitor for data drift.
Model Evaluation
Classification Metrics
| Metric | Formula | When to use |
|---|---|---|
| Accuracy | (TP+TN) / Total | Balanced classes |
| Precision | TP / (TP+FP) | When false positives are costly |
| Recall | TP / (TP+FN) | When false negatives are costly |
| F1 Score | 2×P×R / (P+R) | Imbalanced classes |
| AUC-ROC | Area under ROC curve | Ranking/probability output |
⚠️ The bias-variance tradeoff: A model that's too simple underfits (high bias). A model that's too complex overfits (high variance). The sweet spot is minimising total error = bias² + variance + irreducible noise.
Code Example — Train a Classifier
# Train a Random Forest classifier on the Iris dataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# 1. Load data
X, y = load_iris(return_X_y=True)
# 2. Split into train / test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 3. Train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# 4. Evaluate
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print(classification_report(y_test, y_pred))
# → Accuracy: 96.67%
# 5. Feature importances
for name, imp in zip(load_iris().feature_names, model.feature_importances_):
print(f"{name}: {imp:.3f}")
Real-World Applications
🏥 Healthcare
Disease prediction, medical imaging diagnosis, drug discovery, genomic analysis.
💳 Finance
Fraud detection, credit scoring, algorithmic trading, risk modelling.
🛒 E-Commerce
Recommendation systems (Netflix, Amazon), demand forecasting, dynamic pricing.
🚗 Automotive
Predictive maintenance, fuel optimisation, driver behaviour analysis.
📧 Communications
Spam filtering, email categorisation, autocomplete, language detection.
🔒 Security
Intrusion detection, malware classification, anomaly detection.