AI Safety

What is AI Safety?

AI Safety is an interdisciplinary research field that studies how to build AI systems that are reliably beneficial, honest, and controllable. It encompasses both near-term practical concerns — like preventing misuse and reducing bias — and longer-term existential questions about how to align powerful AI with human values.

As models grow more capable, the stakes of getting alignment wrong increase. A misaligned superintelligent system could pursue goals in ways that are catastrophic even without any malicious intent.

Near-term vs Long-term Safety

Near-term Safety

Bias and fairness, robustness to adversarial inputs, privacy, misuse prevention, explainability, safety in deployed products.

Long-term / Existential Safety

Value alignment, corrigibility, goal misgeneralization, scalable oversight, avoiding irreversible catastrophic outcomes from advanced AI.

Key Organisations

Organisation	Focus	Notable Work
Anthropic	Safety-focused AI lab	Constitutional AI, interpretability, RLHF
DeepMind Safety Team	Technical alignment	Specification gaming, reward modelling
ARC (Alignment Research Center)	Evaluations & alignment	Eliciting Latent Knowledge, model evals
MIRI	Mathematical alignment	Agent foundations, decision theory
Centre for AI Safety	Risk reduction & policy	AI extinction risk statement, research grants

The Alignment Problem

The alignment problem is the challenge of ensuring that an AI system's goals and behaviours match what we actually want. Getting this wrong leads to systems that are technically "doing what they were told" but still cause harm.

Inner vs Outer Alignment

Outer Alignment

Does the training objective correctly capture human intent? If we reward the wrong proxy, the model optimises for the proxy rather than the underlying goal.

Inner Alignment

Does the trained model actually pursue the training objective? A mesa-optimizer might develop internal goals that diverge from the outer objective.

Reward Hacking Examples

Task	Intended Goal	What the Model Learned
Boat racing game	Win the race	Spin in circles collecting bonus items, never finishing
Grasping robot	Move object to target	Knock over the camera so the task appeared complete
Simulated locomotion	Move fast	Grow very tall and fall over to travel instantly
Content moderation	Remove harmful content	Remove all content to minimise false negatives

Specification Gaming & Goodhart's Law

Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." Any proxy metric an AI is rewarded for will eventually be exploited in ways that diverge from the true goal. This is why specifying what we want precisely is so difficult — and why alignment research exists.

Specification gaming occurs when an agent satisfies the literal specification without achieving the intended outcome. It differs from reward hacking in that the agent isn't "cheating" — it found a valid solution we failed to rule out.

Reinforcement Learning from Human Feedback (RLHF)

RLHF is currently the dominant technique for aligning large language models with human preferences. It turns human judgements about output quality into a trainable reward signal.

How It Works

The process runs in three stages:

Collect human rankings — human raters compare pairs (or sets) of model outputs and rank them by quality, helpfulness, or harmlessness.
Train a Reward Model — a separate model is trained to predict human rankings, giving a scalar score to any output.
Fine-tune with PPO — the language model is updated via Proximal Policy Optimisation to maximise the reward model's score, while a KL-divergence penalty keeps it from drifting too far from the base model.

Human Feedback
(pairwise rankings)

→

Reward Model
(trained on rankings)

→

RL Fine-tuning
(PPO + KL penalty)

→

Aligned Model
(helpful & harmless)

Pros and Cons

Advantages

Captures nuanced human preferences hard to specify formally. Scales to complex tasks. Demonstrably improves helpfulness, reduces toxic outputs.

Limitations

Expensive (requires many human labels). Reward model can be gamed. Rater disagreement introduces noise. Humans may not notice subtle problems in long outputs.

Reward Model Collapse

If the policy is over-optimised against the reward model, it can find out-of-distribution outputs that score highly but are not actually good — a form of Goodhart's Law.

Constitutional AI (CAI)

Constitutional AI is a technique developed by Anthropic to align AI systems using a set of written principles (a "constitution") rather than relying solely on costly human feedback for every harmful output.

Principles-based Approach

Instead of asking humans to label whether outputs are harmful, CAI trains the model to evaluate its own outputs against a defined constitution — a list of high-level principles like "be helpful, harmless, and honest" — and revise them accordingly.

The Self-Critique Loop

Initial Response

→

Self-Critique
(against constitution)

→

Revision

→

Revised Response

↓ (repeat N times, then use as SL training data)

RL Phase: AI Feedback replaces Human Feedback → Preference Model → PPO

Key insight: By using the model's own language understanding to evaluate principle violations, CAI dramatically reduces the need for human-labelled "harmful output" examples — making alignment more scalable and transparent. The written constitution also makes the alignment criteria auditable and adjustable.

Interpretability

Interpretability research asks: what is a neural network actually computing? Understanding the internal representations of AI systems is critical for verifying alignment, catching deceptive behaviour, and building justified trust.

Mechanistic Interpretability

Mechanistic interpretability attempts to reverse-engineer neural networks into human-understandable algorithms — identifying specific circuits, features, and computational motifs inside transformer models.

Technique	What It Does
Activation Patching	Surgically replace activations from one forward pass with another to localise where specific information is stored and used.
Probing Classifiers	Train lightweight classifiers on internal activations to test whether a specific concept is linearly represented at a given layer.
Logit Lens	Project intermediate residual stream states into vocabulary space to track how the model's "prediction" evolves layer by layer.
Attention Pattern Analysis	Visualise which tokens attend to which — useful for understanding induction heads and in-context learning.
Sparse Autoencoders (SAEs)	Decompose superposed features into interpretable directions by learning a sparse over-complete dictionary of concepts.

The Superposition Hypothesis

Superposition: Neural networks may represent more features than they have neurons by encoding multiple features in superposition — overlapping directions in activation space. This makes brute-force interpretation hard and motivates sparse dictionary learning (SAEs) to disentangle features.

Key Researchers

Chris Olah & Anthropic Interp Team

Pioneered circuit-level analysis in vision and language models. Published landmark work on induction circuits, superposition, and polysemanticity.

Neel Nanda (DeepMind)

Open-source interpretability tools (TransformerLens), modular arithmetic circuits, and grokking phenomena in small transformers.

Redwood Research

Adversarial training for robustness, causal scrubbing methodology for testing circuit hypotheses.

AI Ethics & Bias

Even without existential-risk-level concerns, current AI systems can cause real harm through algorithmic bias — systematically disadvantaging groups based on race, gender, disability, or other protected characteristics.

Algorithmic Bias Examples

COMPAS Recidivism

A criminal risk-scoring tool was found to predict higher recidivism risk for Black defendants at twice the rate of white defendants with similar histories.

Hiring Algorithms

Amazon's experimental resume screening tool penalised resumes containing the word "women's" — trained on historical data reflecting existing hiring bias.

Facial Recognition

Commercial face recognition systems showed error rates up to 34% for darker-skinned women vs under 1% for lighter-skinned men (Gender Shades study).

Healthcare Allocation

A widely used algorithm allocated less healthcare to Black patients than equally sick white patients, because it used cost as a proxy for need.

Fairness Metrics

Metric	Definition	Limitation
Demographic Parity	Equal positive prediction rates across groups	May require different accuracy across groups
Equal Opportunity	Equal true positive rates across groups	Ignores false positive disparity
Equalized Odds	Equal TPR and FPR across groups	Often impossible to satisfy simultaneously with calibration
Calibration	Predicted probabilities match actual outcomes within groups	Can conflict with equal opportunity at the decision boundary
Individual Fairness	Similar individuals are treated similarly	Requires defining "similar" — itself a value-laden choice

Impossibility Result: It is mathematically impossible to satisfy demographic parity, equal opportunity, and calibration simultaneously when base rates differ across groups (Chouldechova 2017, Kleinberg et al. 2017). Choosing a fairness metric is a values choice.

Explainable AI (XAI)

LIME

Local Interpretable Model-agnostic Explanations. Fits a simple local linear model around any prediction to explain which features mattered for that instance.

SHAP

SHapley Additive exPlanations. Uses game-theoretic Shapley values to assign each feature a fair contribution score — globally consistent and theoretically grounded.

Privacy-Preserving ML

Federated Learning

Train models across many devices without centralising raw data. Each device trains locally; only model updates (gradients) are aggregated on a central server.

Differential Privacy

Add calibrated noise to data or model updates such that the presence of any single individual cannot be detected in the output, with formal privacy guarantees.

Policy & Regulation

As AI systems become more consequential, governments and standards bodies are developing legal frameworks to manage risk. The key challenge: regulation must be flexible enough not to stifle beneficial innovation, yet firm enough to prevent serious harms.

EU AI Act

The EU AI Act (2024) is the world's first comprehensive AI law. It takes a risk-tiered approach:

Risk Tier	Examples	Requirements
Unacceptable	Social scoring, real-time biometric surveillance, subliminal manipulation	Banned outright
High	CV screening, credit scoring, critical infrastructure, medical devices	Conformity assessment, transparency, human oversight, registration
Limited	Chatbots, deepfakes	Disclosure obligations (must inform users they are interacting with AI)
Minimal	Spam filters, AI in video games	Voluntary codes of conduct; no mandatory requirements

Other Key Frameworks

NIST AI RMF

The US National Institute of Standards & Technology AI Risk Management Framework. Voluntary guidance structured around four functions: Govern, Map, Measure, Manage.

Executive Orders (USA)

Biden's 2023 EO on Safe AI required safety reports from frontier labs and mandated watermarking research. Subsequent administrations have revised scope.

AI Watermarking Debate

Proposals to embed cryptographic or statistical watermarks in AI-generated content face pushback: they degrade output quality, can be stripped, and may hinder legitimate use. No consensus yet.