EN DA
AI
AI

Large Language Models

Predicting the next token — at civilisation scale

llmgptscaling

Overview

A Large Language Model (LLM) is a neural network—almost always a Transformer—trained on hundreds of billions of tokens of text to predict the next token in a sequence. This deceptively simple objective produces models capable of reasoning, coding, translation, summarisation, and conversation. Scale is the defining factor: more parameters + more data + more compute reliably yield better emergent capabilities.

Key Concepts

  • Pre-training: the model predicts the next token across a massive web-scale corpus, learning grammar, facts, and reasoning patterns
  • Tokenisation: text is split into subword units (BPE, SentencePiece); GPT-4 uses ~100 000-token vocabulary
  • Autoregressive generation: the model generates output one token at a time, sampling from a probability distribution
  • Temperature and top-p sampling: controls randomness—low temperature = deterministic, high = creative
  • Instruction tuning (SFT): fine-tuning on curated prompt–response pairs teaches the model to follow instructions

Key Facts

  • Scaling laws (Hoffmann et al., "Chinchilla", 2022) show optimal training requires ~20 tokens per parameter
  • GPT-4 has an estimated 1.8 trillion parameters across a mixture-of-experts architecture
  • Emergent abilities—capabilities that appear suddenly at scale—include multi-step arithmetic, chain-of-thought reasoning, and code generation
  • Llama 3 (Meta, 2024) demonstrated that open-weight models match closed proprietary models on many benchmarks
  • Energy cost: training GPT-3 consumed an estimated 1 287 MWh—roughly the annual electricity use of 120 US homes