Large Language Models

How LLMs Work

LLMs are next-token predictors. Given a sequence of tokens (pieces of text), they predict the most likely next token. Do this autoregressively and you generate fluent text.

The magic is in scale: with enough parameters and training data, this simple objective produces models that appear to reason, code, translate, and converse — these abilities emerge from the next-token prediction task at scale.

"The cat sat on"

→

Tokenise

→

Transformer Layers

→

Softmax over vocab

→

"the" (P=0.73)

Training Stages

1. Pre-training

Train on hundreds of billions of tokens from the internet, books, and code. Objective: predict the next token. Costs $10M–$100M+ in compute.

2. Supervised Fine-tuning (SFT)

Fine-tune on high-quality (prompt, response) pairs written by humans. Teaches the model to follow instructions.

3. RLHF

Reinforcement Learning from Human Feedback. Humans rank responses → train a reward model → use RL to align the LLM with preferences.

4. Constitutional AI

Anthropic's approach: define principles (be helpful, harmless, honest) and train the model to critique and revise its own outputs.

Raw Text (internet)

→

Pre-training

→

Base Model

→

SFT + RLHF

→

Chat Model

Transformer Architecture

Every modern LLM is built on the Transformer architecture (Vaswani et al., 2017). Its key innovation: self-attention, which lets every token attend to every other token in the context.

Self-Attention

Each token produces Query, Key, Value vectors. Attention = softmax(QKᵀ/√d) × V. Captures long-range dependencies.

Multi-Head Attention

Run attention multiple times in parallel with different projections. Each "head" learns different relationships.

Feed-Forward Layer

Applied to each token independently after attention. Two linear layers with a non-linearity. Stores factual knowledge.

Positional Encoding

Adds position information since attention is order-invariant. Modern models use RoPE (Rotary Position Embedding).

Layer Norm

Normalises activations before attention and FFN layers. Crucial for stable training.

Context Window

The maximum number of tokens the model can process at once. Ranges from 4K (small) to 1M+ (Gemini Ultra).

Top Models in 2026

Model	Creator	Context	Open?	Best At
Claude 4 Opus	Anthropic	200K	No	Reasoning, Safety, Long docs
GPT-4o	OpenAI	128K	No	Multimodal, Coding, API ecosystem
Gemini 2.0 Ultra	Google	1M+	No	Context length, Search grounding
Llama 3.3 70B	Meta	128K	✓ Yes	Open-source, Fine-tuning, Local
DeepSeek R2	DeepSeek	64K	✓ Yes	Reasoning, Math, Cost-efficiency
Mistral Large 2	Mistral	128K	Partial	Multilingual, On-premise, EU compliance

Using LLMs in Your Projects

Key Techniques

Prompt Engineering

Craft inputs carefully. Use system prompts, few-shot examples, and chain-of-thought to get better outputs.

RAG

Retrieval-Augmented Generation: give the LLM relevant documents at query time. Eliminates hallucination on factual tasks.

Fine-tuning

Adapt a base model to your domain with LoRA/QLoRA. Cheaper than full fine-tuning, often better than prompting alone.

Function Calling

LLMs can call tools/APIs. Define a schema, the model decides when and how to call it. Foundation of AI agents.

Code Example — Call Claude API

Python

import anthropic

client = anthropic.Anthropic()  # uses ANTHROPIC_API_KEY env var

# Basic message
message = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Explain transformers in 3 sentences."}
    ]
)
print(message.content[0].text)

# With system prompt and conversation history
response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=2048,
    system="You are an expert ML engineer. Be concise and use code examples.",
    messages=[
        {"role": "user",      "content": "What is RAG?"},
        {"role": "assistant", "content": "RAG combines retrieval with generation..."},
        {"role": "user",      "content": "Show me a Python implementation."}
    ]
)
print(response.content[0].text)

Limitations to Know

⚠️ Hallucination: LLMs can generate confident-sounding incorrect facts. Always verify with RAG or citations.

⚠️ Knowledge cutoff: Base models only know what was in their training data. Use RAG or search-grounded models for current events.

⚠️ Context length: Even with 1M token windows, models still degrade on very long contexts ("lost in the middle" effect).

💡 Tip: For production systems, combine LLMs with structured databases, search, and verification steps rather than relying on the model alone.

Large Language Models

How LLMs Work

Training Stages

1. Pre-training

2. Supervised Fine-tuning (SFT)

3. RLHF

4. Constitutional AI

Transformer Architecture

Self-Attention

Multi-Head Attention

Feed-Forward Layer

Positional Encoding

Layer Norm

Context Window

Top Models in 2026

Using LLMs in Your Projects

Key Techniques

Prompt Engineering

RAG

Fine-tuning

Function Calling

Code Example — Call Claude API

Limitations to Know

Related Topics

Prompt Engineering

Deep Learning

AI Agents

AI Safety