How LLMs Work
LLMs are next-token predictors. Given a sequence of tokens (pieces of text), they predict the most likely next token. Do this autoregressively and you generate fluent text.
The magic is in scale: with enough parameters and training data, this simple objective produces models that appear to reason, code, translate, and converse โ these abilities emerge from the next-token prediction task at scale.
Training Stages
1. Pre-training
Train on hundreds of billions of tokens from the internet, books, and code. Objective: predict the next token. Costs $10Mโ$100M+ in compute.
2. Supervised Fine-tuning (SFT)
Fine-tune on high-quality (prompt, response) pairs written by humans. Teaches the model to follow instructions.
3. RLHF
Reinforcement Learning from Human Feedback. Humans rank responses โ train a reward model โ use RL to align the LLM with preferences.
4. Constitutional AI
Anthropic's approach: define principles (be helpful, harmless, honest) and train the model to critique and revise its own outputs.
Transformer Architecture
Every modern LLM is built on the Transformer architecture (Vaswani et al., 2017). Its key innovation: self-attention, which lets every token attend to every other token in the context.
Self-Attention
Each token produces Query, Key, Value vectors. Attention = softmax(QKแต/โd) ร V. Captures long-range dependencies.
Multi-Head Attention
Run attention multiple times in parallel with different projections. Each "head" learns different relationships.
Feed-Forward Layer
Applied to each token independently after attention. Two linear layers with a non-linearity. Stores factual knowledge.
Positional Encoding
Adds position information since attention is order-invariant. Modern models use RoPE (Rotary Position Embedding).
Layer Norm
Normalises activations before attention and FFN layers. Crucial for stable training.
Context Window
The maximum number of tokens the model can process at once. Ranges from 4K (small) to 1M+ (Gemini Ultra).
Top Models in 2026
| Model | Creator | Context | Open? | Best At |
|---|---|---|---|---|
| Claude 4 Opus | Anthropic | 200K | No | Reasoning, Safety, Long docs |
| GPT-4o | OpenAI | 128K | No | Multimodal, Coding, API ecosystem |
| Gemini 2.0 Ultra | 1M+ | No | Context length, Search grounding | |
| Llama 3.3 70B | Meta | 128K | โ Yes | Open-source, Fine-tuning, Local |
| DeepSeek R2 | DeepSeek | 64K | โ Yes | Reasoning, Math, Cost-efficiency |
| Mistral Large 2 | Mistral | 128K | Partial | Multilingual, On-premise, EU compliance |
Using LLMs in Your Projects
Key Techniques
Prompt Engineering
Craft inputs carefully. Use system prompts, few-shot examples, and chain-of-thought to get better outputs.
RAG
Retrieval-Augmented Generation: give the LLM relevant documents at query time. Eliminates hallucination on factual tasks.
Fine-tuning
Adapt a base model to your domain with LoRA/QLoRA. Cheaper than full fine-tuning, often better than prompting alone.
Function Calling
LLMs can call tools/APIs. Define a schema, the model decides when and how to call it. Foundation of AI agents.
Code Example โ Call Claude API
import anthropic
client = anthropic.Anthropic() # uses ANTHROPIC_API_KEY env var
# Basic message
message = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
messages=[
{"role": "user", "content": "Explain transformers in 3 sentences."}
]
)
print(message.content[0].text)
# With system prompt and conversation history
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=2048,
system="You are an expert ML engineer. Be concise and use code examples.",
messages=[
{"role": "user", "content": "What is RAG?"},
{"role": "assistant", "content": "RAG combines retrieval with generation..."},
{"role": "user", "content": "Show me a Python implementation."}
]
)
print(response.content[0].text)
Limitations to Know
โ ๏ธ Hallucination: LLMs can generate confident-sounding incorrect facts. Always verify with RAG or citations.
โ ๏ธ Knowledge cutoff: Base models only know what was in their training data. Use RAG or search-grounded models for current events.
โ ๏ธ Context length: Even with 1M token windows, models still degrade on very long contexts ("lost in the middle" effect).
๐ก Tip: For production systems, combine LLMs with structured databases, search, and verification steps rather than relying on the model alone.