Generative AI ๐Ÿ”ฅ Hot

Large Language Models

LLMs are transformer-based neural networks trained on trillions of tokens of text. They can understand, generate, translate, summarise, and reason about language โ€” powering ChatGPT, Claude, Gemini, and the entire modern AI revolution.

2017"Attention is All You Need" paper
1T+parameters in largest models
$150B+LLM market by 2030

How LLMs Work

LLMs are next-token predictors. Given a sequence of tokens (pieces of text), they predict the most likely next token. Do this autoregressively and you generate fluent text.

The magic is in scale: with enough parameters and training data, this simple objective produces models that appear to reason, code, translate, and converse โ€” these abilities emerge from the next-token prediction task at scale.

"The cat sat on"
โ†’
Tokenise
โ†’
Transformer Layers
โ†’
Softmax over vocab
โ†’
"the" (P=0.73)

Training Stages

1. Pre-training

Train on hundreds of billions of tokens from the internet, books, and code. Objective: predict the next token. Costs $10Mโ€“$100M+ in compute.

2. Supervised Fine-tuning (SFT)

Fine-tune on high-quality (prompt, response) pairs written by humans. Teaches the model to follow instructions.

3. RLHF

Reinforcement Learning from Human Feedback. Humans rank responses โ†’ train a reward model โ†’ use RL to align the LLM with preferences.

4. Constitutional AI

Anthropic's approach: define principles (be helpful, harmless, honest) and train the model to critique and revise its own outputs.

Raw Text (internet)
โ†’
Pre-training
โ†’
Base Model
โ†’
SFT + RLHF
โ†’
Chat Model

Transformer Architecture

Every modern LLM is built on the Transformer architecture (Vaswani et al., 2017). Its key innovation: self-attention, which lets every token attend to every other token in the context.

Self-Attention

Each token produces Query, Key, Value vectors. Attention = softmax(QKแต€/โˆšd) ร— V. Captures long-range dependencies.

Multi-Head Attention

Run attention multiple times in parallel with different projections. Each "head" learns different relationships.

Feed-Forward Layer

Applied to each token independently after attention. Two linear layers with a non-linearity. Stores factual knowledge.

Positional Encoding

Adds position information since attention is order-invariant. Modern models use RoPE (Rotary Position Embedding).

Layer Norm

Normalises activations before attention and FFN layers. Crucial for stable training.

Context Window

The maximum number of tokens the model can process at once. Ranges from 4K (small) to 1M+ (Gemini Ultra).

Top Models in 2026

ModelCreatorContextOpen?Best At
Claude 4 OpusAnthropic200KNoReasoning, Safety, Long docs
GPT-4oOpenAI128KNoMultimodal, Coding, API ecosystem
Gemini 2.0 UltraGoogle1M+NoContext length, Search grounding
Llama 3.3 70BMeta128Kโœ“ YesOpen-source, Fine-tuning, Local
DeepSeek R2DeepSeek64Kโœ“ YesReasoning, Math, Cost-efficiency
Mistral Large 2Mistral128KPartialMultilingual, On-premise, EU compliance

Using LLMs in Your Projects

Key Techniques

Prompt Engineering

Craft inputs carefully. Use system prompts, few-shot examples, and chain-of-thought to get better outputs.

RAG

Retrieval-Augmented Generation: give the LLM relevant documents at query time. Eliminates hallucination on factual tasks.

Fine-tuning

Adapt a base model to your domain with LoRA/QLoRA. Cheaper than full fine-tuning, often better than prompting alone.

Function Calling

LLMs can call tools/APIs. Define a schema, the model decides when and how to call it. Foundation of AI agents.

Code Example โ€” Call Claude API

Python
import anthropic

client = anthropic.Anthropic()  # uses ANTHROPIC_API_KEY env var

# Basic message
message = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Explain transformers in 3 sentences."}
    ]
)
print(message.content[0].text)

# With system prompt and conversation history
response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=2048,
    system="You are an expert ML engineer. Be concise and use code examples.",
    messages=[
        {"role": "user",      "content": "What is RAG?"},
        {"role": "assistant", "content": "RAG combines retrieval with generation..."},
        {"role": "user",      "content": "Show me a Python implementation."}
    ]
)
print(response.content[0].text)

Limitations to Know

โš ๏ธ Hallucination: LLMs can generate confident-sounding incorrect facts. Always verify with RAG or citations.

โš ๏ธ Knowledge cutoff: Base models only know what was in their training data. Use RAG or search-grounded models for current events.

โš ๏ธ Context length: Even with 1M token windows, models still degrade on very long contexts ("lost in the middle" effect).

๐Ÿ’ก Tip: For production systems, combine LLMs with structured databases, search, and verification steps rather than relying on the model alone.