How LLMs Work

What Large Language Models are, how they learn through training, the Transformer architecture, and why they hallucinate.

Welcome to Foundations. This page explains what Large Language Models are, how they learn, and why they sometimes get things wrong. Start here if you’re new to AI.

What Is a Large Language Model?

A Large Language Model (LLM) is an AI system trained on enormous volumes of text – books, websites, code, scientific papers, conversations – to learn the patterns, structure, and meaning of human language. The word “large” refers both to the volume of training data and to the number of internal parameters (adjustable numerical weights) the model uses to process information.

The defining capability of an LLM is predicting what comes next. Given any text input, the model calculates what word, phrase, or sentence would logically follow – not by looking things up in a database, but by pattern-matching against everything it absorbed during training.

Key insight: LLMs are not search engines. They don’t retrieve stored facts – they generate responses based on learned patterns. This is both their power and the root of their key limitation (hallucination).

How LLMs Learn: The Training Pipeline

Step 1: Data Collection

The model is trained on a massive dataset of text scraped from the internet, books, academic papers, and code repositories. GPT-4-class models are trained on trillions of words – a scale that’s difficult to comprehend.

Step 2: Learning to Predict

The model is repeatedly shown chunks of text with words masked or removed and asked to predict what belongs there. Each incorrect prediction adjusts the model’s internal weights slightly. After billions of adjustments across trillions of examples, the model becomes remarkably good at predicting coherent language.

Step 3: Alignment (RLHF)

Raw prediction ability alone produces a model that mirrors all patterns in training data – including harmful or misleading ones. To make the model helpful and safe, it goes through Reinforcement Learning from Human Feedback (RLHF). Humans rate responses and the model is nudged toward answers that are useful, accurate, and appropriate. This is what turns a raw language predictor into an assistant like Claude or ChatGPT.

Step 4: The Transformer Architecture

Almost all leading LLMs (GPT, Claude, Gemini, LLaMA) are built on the Transformer architecture, introduced by Google in 2017. Its key innovation is the attention mechanism – a way for the model to weigh the relevance of every word against every other word. This is why LLMs can maintain coherence over long passages rather than producing word-by-word predictions.

The Transformer Architecture

The Transformer is the engine underneath every modern LLM. Here’s a simplified view of how it processes your input:

graph TD
    A[Your Input Text] --> B[Tokenization]
    B --> C[Token Embeddings]
    C --> D[Self-Attention Layers]
    D --> E[Feed-Forward Networks]
    E --> F[Output Probabilities]
    F --> G[Generated Token]
    G --> |"Feeds back as input"| D
    style A fill:#e8f4fd,stroke:#2196F3
    style D fill:#fff3e0,stroke:#FF9800
    style G fill:#e8f5e9,stroke:#4CAF50

The self-attention mechanism allows each word to “look at” every other word in the input when computing its representation. This lets the model understand that in “The cat sat on the mat because it was tired,” the word “it” refers to “cat” – not “mat.” This contextual understanding is what makes LLMs so powerful.

Parameters are the numerical weights the model learns during training. Think of them as the knobs on a massive mixing board:

Model	Parameters	Scale
GPT-2 (2019)	1.5 billion	Small by today’s standards
LLaMA 3	8B - 70B	Mid-range, very capable
GPT-4	~1.8 trillion (estimated)	Frontier scale
Claude Opus 4	Undisclosed	Frontier scale

More parameters generally means more capacity to learn patterns, but training quality and data matter more than raw size.

Inference is what happens when you actually use the model – sending it a prompt and getting a response. During inference, the model:

Tokenizes your input
Passes it through all transformer layers
Calculates probability distributions for the next token
Samples a token from that distribution
Repeats until the response is complete

Each generated token requires a full forward pass through the entire model – this is why output tokens cost more than input tokens.

Hallucination: The Core Limitation

Hallucination is when an LLM generates confident-sounding statements that are entirely false. This is a structural feature of how LLMs work, not a bug that can be fully fixed.

Because LLMs generate text based on patterns rather than retrieving verified facts, they can produce plausible-sounding but incorrect information. This occurs more often with:

Obscure topics – less training data means weaker patterns
Recent events – anything after the training cutoff date
Numerical/factual tasks – LLMs are pattern matchers, not calculators
Requests for citations – models often fabricate realistic-looking but nonexistent references

Why Hallucination Cannot Be Fully Eliminated

Hallucination is inherent to the architecture. LLMs don’t have a “truth database” they check against – they generate the most statistically likely continuation of your prompt. When the model lacks strong patterns for a topic, it fills the gap with plausible-sounding content. This is the same mechanism that makes LLMs creative and flexible – it’s a double-edged sword.

The most effective mitigation is RAG (Retrieval-Augmented Generation) – giving the model access to verified source documents before generating a response. This is covered in the Memory & RAG page.

Other Key Limitations

Training cutoff. LLMs have a knowledge cutoff date. They have no awareness of events after that date unless given tools that access current information (like web search).

No persistent memory by default. Each conversation starts fresh. The model doesn’t remember previous sessions unless external memory systems are implemented.

Not deterministic. The same prompt given twice will often produce different outputs. LLMs operate probabilistically – each token is sampled from a probability distribution.

Context window is finite. Even a 1-million-token context window has limits. Very long inputs can cause the model to lose coherence or “forget” earlier parts.

How LLMs Work

What Is a Large Language Model?

How LLMs Learn: The Training Pipeline

Step 1: Data Collection

Step 2: Learning to Predict

Step 3: Alignment (RLHF)

Step 4: The Transformer Architecture

The Transformer Architecture

Hallucination: The Core Limitation

Key Concepts Summary

Tokens

Context Window

RLHF

Hallucination

What’s Next?

Tokens & Pricing

Choosing a Model

AI Agents Explained