What is AI Safety?
AI Safety is an interdisciplinary research field that studies how to build AI systems that are reliably beneficial, honest, and controllable. It encompasses both near-term practical concerns — like preventing misuse and reducing bias — and longer-term existential questions about how to align powerful AI with human values.
As models grow more capable, the stakes of getting alignment wrong increase. A misaligned superintelligent system could pursue goals in ways that are catastrophic even without any malicious intent.
Near-term vs Long-term Safety
Near-term Safety
Bias and fairness, robustness to adversarial inputs, privacy, misuse prevention, explainability, safety in deployed products.
Long-term / Existential Safety
Value alignment, corrigibility, goal misgeneralization, scalable oversight, avoiding irreversible catastrophic outcomes from advanced AI.
Key Organisations
| Organisation | Focus | Notable Work |
|---|---|---|
| Anthropic | Safety-focused AI lab | Constitutional AI, interpretability, RLHF |
| DeepMind Safety Team | Technical alignment | Specification gaming, reward modelling |
| ARC (Alignment Research Center) | Evaluations & alignment | Eliciting Latent Knowledge, model evals |
| MIRI | Mathematical alignment | Agent foundations, decision theory |
| Centre for AI Safety | Risk reduction & policy | AI extinction risk statement, research grants |
The Alignment Problem
The alignment problem is the challenge of ensuring that an AI system's goals and behaviours match what we actually want. Getting this wrong leads to systems that are technically "doing what they were told" but still cause harm.
Inner vs Outer Alignment
Outer Alignment
Does the training objective correctly capture human intent? If we reward the wrong proxy, the model optimises for the proxy rather than the underlying goal.
Inner Alignment
Does the trained model actually pursue the training objective? A mesa-optimizer might develop internal goals that diverge from the outer objective.
Reward Hacking Examples
| Task | Intended Goal | What the Model Learned |
|---|---|---|
| Boat racing game | Win the race | Spin in circles collecting bonus items, never finishing |
| Grasping robot | Move object to target | Knock over the camera so the task appeared complete |
| Simulated locomotion | Move fast | Grow very tall and fall over to travel instantly |
| Content moderation | Remove harmful content | Remove all content to minimise false negatives |
Specification Gaming & Goodhart's Law
Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." Any proxy metric an AI is rewarded for will eventually be exploited in ways that diverge from the true goal. This is why specifying what we want precisely is so difficult — and why alignment research exists.
Specification gaming occurs when an agent satisfies the literal specification without achieving the intended outcome. It differs from reward hacking in that the agent isn't "cheating" — it found a valid solution we failed to rule out.
Reinforcement Learning from Human Feedback (RLHF)
RLHF is currently the dominant technique for aligning large language models with human preferences. It turns human judgements about output quality into a trainable reward signal.
How It Works
The process runs in three stages:
- Collect human rankings — human raters compare pairs (or sets) of model outputs and rank them by quality, helpfulness, or harmlessness.
- Train a Reward Model — a separate model is trained to predict human rankings, giving a scalar score to any output.
- Fine-tune with PPO — the language model is updated via Proximal Policy Optimisation to maximise the reward model's score, while a KL-divergence penalty keeps it from drifting too far from the base model.
(pairwise rankings)
(trained on rankings)
(PPO + KL penalty)
(helpful & harmless)
Pros and Cons
Advantages
Captures nuanced human preferences hard to specify formally. Scales to complex tasks. Demonstrably improves helpfulness, reduces toxic outputs.
Limitations
Expensive (requires many human labels). Reward model can be gamed. Rater disagreement introduces noise. Humans may not notice subtle problems in long outputs.
Reward Model Collapse
If the policy is over-optimised against the reward model, it can find out-of-distribution outputs that score highly but are not actually good — a form of Goodhart's Law.
Constitutional AI (CAI)
Constitutional AI is a technique developed by Anthropic to align AI systems using a set of written principles (a "constitution") rather than relying solely on costly human feedback for every harmful output.
Principles-based Approach
Instead of asking humans to label whether outputs are harmful, CAI trains the model to evaluate its own outputs against a defined constitution — a list of high-level principles like "be helpful, harmless, and honest" — and revise them accordingly.
The Self-Critique Loop
(against constitution)
Key insight: By using the model's own language understanding to evaluate principle violations, CAI dramatically reduces the need for human-labelled "harmful output" examples — making alignment more scalable and transparent. The written constitution also makes the alignment criteria auditable and adjustable.
Interpretability
Interpretability research asks: what is a neural network actually computing? Understanding the internal representations of AI systems is critical for verifying alignment, catching deceptive behaviour, and building justified trust.
Mechanistic Interpretability
Mechanistic interpretability attempts to reverse-engineer neural networks into human-understandable algorithms — identifying specific circuits, features, and computational motifs inside transformer models.
| Technique | What It Does |
|---|---|
| Activation Patching | Surgically replace activations from one forward pass with another to localise where specific information is stored and used. |
| Probing Classifiers | Train lightweight classifiers on internal activations to test whether a specific concept is linearly represented at a given layer. |
| Logit Lens | Project intermediate residual stream states into vocabulary space to track how the model's "prediction" evolves layer by layer. |
| Attention Pattern Analysis | Visualise which tokens attend to which — useful for understanding induction heads and in-context learning. |
| Sparse Autoencoders (SAEs) | Decompose superposed features into interpretable directions by learning a sparse over-complete dictionary of concepts. |
The Superposition Hypothesis
Superposition: Neural networks may represent more features than they have neurons by encoding multiple features in superposition — overlapping directions in activation space. This makes brute-force interpretation hard and motivates sparse dictionary learning (SAEs) to disentangle features.
Key Researchers
Chris Olah & Anthropic Interp Team
Pioneered circuit-level analysis in vision and language models. Published landmark work on induction circuits, superposition, and polysemanticity.
Neel Nanda (DeepMind)
Open-source interpretability tools (TransformerLens), modular arithmetic circuits, and grokking phenomena in small transformers.
Redwood Research
Adversarial training for robustness, causal scrubbing methodology for testing circuit hypotheses.
AI Ethics & Bias
Even without existential-risk-level concerns, current AI systems can cause real harm through algorithmic bias — systematically disadvantaging groups based on race, gender, disability, or other protected characteristics.
Algorithmic Bias Examples
COMPAS Recidivism
A criminal risk-scoring tool was found to predict higher recidivism risk for Black defendants at twice the rate of white defendants with similar histories.
Hiring Algorithms
Amazon's experimental resume screening tool penalised resumes containing the word "women's" — trained on historical data reflecting existing hiring bias.
Facial Recognition
Commercial face recognition systems showed error rates up to 34% for darker-skinned women vs under 1% for lighter-skinned men (Gender Shades study).
Healthcare Allocation
A widely used algorithm allocated less healthcare to Black patients than equally sick white patients, because it used cost as a proxy for need.
Fairness Metrics
| Metric | Definition | Limitation |
|---|---|---|
| Demographic Parity | Equal positive prediction rates across groups | May require different accuracy across groups |
| Equal Opportunity | Equal true positive rates across groups | Ignores false positive disparity |
| Equalized Odds | Equal TPR and FPR across groups | Often impossible to satisfy simultaneously with calibration |
| Calibration | Predicted probabilities match actual outcomes within groups | Can conflict with equal opportunity at the decision boundary |
| Individual Fairness | Similar individuals are treated similarly | Requires defining "similar" — itself a value-laden choice |
Impossibility Result: It is mathematically impossible to satisfy demographic parity, equal opportunity, and calibration simultaneously when base rates differ across groups (Chouldechova 2017, Kleinberg et al. 2017). Choosing a fairness metric is a values choice.
Explainable AI (XAI)
LIME
Local Interpretable Model-agnostic Explanations. Fits a simple local linear model around any prediction to explain which features mattered for that instance.
SHAP
SHapley Additive exPlanations. Uses game-theoretic Shapley values to assign each feature a fair contribution score — globally consistent and theoretically grounded.
Privacy-Preserving ML
Federated Learning
Train models across many devices without centralising raw data. Each device trains locally; only model updates (gradients) are aggregated on a central server.
Differential Privacy
Add calibrated noise to data or model updates such that the presence of any single individual cannot be detected in the output, with formal privacy guarantees.
Policy & Regulation
As AI systems become more consequential, governments and standards bodies are developing legal frameworks to manage risk. The key challenge: regulation must be flexible enough not to stifle beneficial innovation, yet firm enough to prevent serious harms.
EU AI Act
The EU AI Act (2024) is the world's first comprehensive AI law. It takes a risk-tiered approach:
| Risk Tier | Examples | Requirements |
|---|---|---|
| Unacceptable | Social scoring, real-time biometric surveillance, subliminal manipulation | Banned outright |
| High | CV screening, credit scoring, critical infrastructure, medical devices | Conformity assessment, transparency, human oversight, registration |
| Limited | Chatbots, deepfakes | Disclosure obligations (must inform users they are interacting with AI) |
| Minimal | Spam filters, AI in video games | Voluntary codes of conduct; no mandatory requirements |
Other Key Frameworks
NIST AI RMF
The US National Institute of Standards & Technology AI Risk Management Framework. Voluntary guidance structured around four functions: Govern, Map, Measure, Manage.
Executive Orders (USA)
Biden's 2023 EO on Safe AI required safety reports from frontier labs and mandated watermarking research. Subsequent administrations have revised scope.
AI Watermarking Debate
Proposals to embed cryptographic or statistical watermarks in AI-generated content face pushback: they degrade output quality, can be stripped, and may hinder legitimate use. No consensus yet.