Tokens are the unit of everything in LLMs – cost, limits, and performance all trace back to token count. This page explains how tokenization works, what context windows are, and how to estimate and optimize costs.

What Is a Token?

When you send text to an LLM, it doesn’t process your words the way you read them. Instead, it breaks your text into tokens – small chunks that can be whole words, parts of words, punctuation marks, or spaces. This process is called tokenization.

TextTokensCount
Hello world!Hello / world / !3
unbelievableun / believ / able3
hamburgerham / burger2
AppleApple1
xqzptflx/q/z/p/t/f/l7

Common words use fewer tokens because the model learned them as unified patterns. Rare words, technical jargon, or non-English text gets broken into smaller fragments – costing more tokens per word.

The algorithm behind tokenization is called Byte Pair Encoding (BPE). It works by:

  1. Starting with individual characters as the base vocabulary
  2. Identifying the most frequently occurring character pairs in training data
  3. Merging them into single tokens
  4. Repeating until a target vocabulary size is reached

The result: common English words like “the” or “and” become single tokens, while unusual strings get split into many small pieces. You don’t need to understand BPE deeply – what matters is the practical output.

Tokenization is optimized for English. Other languages often require more tokens per word:

Language“Hello, how are you?”Token Count
EnglishHello, how are you?~6 tokens
FrenchBonjour, comment allez-vous?~9 tokens
Japaneseこんにちは、お元気ですか?~15 tokens
Arabicمرحبا، كيف حالك؟~18 tokens

This means non-English workflows cost more per word. Factor this into cost estimates for multilingual applications.


Token-to-Text Conversion Guide

These are approximations useful for quick mental math:

Unit of textApproximate tokens
1 word~0.75 tokens
1 sentence~15-20 tokens
1 paragraph~100 tokens
1 page (~250 words)~330 tokens
10-minute audio transcript~4,500-5,000 tokens
1 full novel (~90,000 words)~120,000 tokens
Real-world illustration: A 10-minute YouTube video transcript processes as roughly 4,500-5,000 tokens. A model with a 200,000-token context window could hold approximately 40 such videos worth of content in a single session.

Context Windows: The AI’s Working Memory

The context window is the maximum number of tokens an LLM can consider at one time – the model’s working memory.

graph TD
    subgraph CW["Context Window (e.g., 200K tokens)"]
        A["System Prompt / Instructions"]
        B["Conversation History"]
        C["Your Current Message (Input)"]
        D["AI's Response (Output)"]
    end
    A --> B --> C --> D
    style CW fill:#f5f5f5,stroke:#333
    style A fill:#e3f2fd,stroke:#1976D2
    style B fill:#fff8e1,stroke:#F9A825
    style C fill:#e8f5e9,stroke:#388E3C
    style D fill:#fce4ec,stroke:#C62828

All four components compete for the same limited space. In a long conversation, history alone can consume tens of thousands of tokens before you’ve typed a single new word.

Context Window Sizes (2025-2026)

ModelProviderContext WindowBest For
Gemini 2.5 FlashGoogle1,000,000 tokensHigh-volume, long-document tasks
GPT-5OpenAI400,000 tokensComplex reasoning, writing
Claude Opus 4Anthropic200,000 tokensNuanced reasoning, safety-critical
GPT-4oOpenAI128,000 tokensGeneral purpose, coding
LLaMA 4Meta128,000+ tokensPrivate deployments
What Happens When the Window Fills Up

Unlike human long-term memory, the context window doesn’t accumulate across sessions – it resets. Each new conversation starts fresh.

Within a single session, as the context fills, older information gets pushed out of the model’s active consideration. This is when you notice the AI “forgetting” earlier parts of the conversation – it’s the physics of a finite working memory.

Research shows models give less attention to information in the middle of a very long context. The beginning and end are remembered more reliably – a phenomenon called the “lost in the middle” effect.


How Pricing Works

LLM pricing is structured as cost per million tokens, with separate rates for input and output.

The key rule: Output tokens cost 3-4x more than input tokens. Why? Because generating each output token requires a full forward pass through the model, while reading input is comparatively cheaper computationally.

Pricing Comparison (2025 Rates)

ModelInput (per 1M tokens)Output (per 1M tokens)Ratio
Gemini 2.5 Flash$0.15$0.604x
GPT-4o mini$0.15$0.604x
Claude Haiku 3.5$0.80$4.005x
Claude Sonnet 4$3.00$15.005x
GPT-5$1.25$10.008x
Claude Opus 4$15.00$75.005x

Workflow: 1,000 requests/day, 500 input + 200 output tokens each, using Gemini 2.5 Flash:

  • Input: 1,000 x 500 = 500,000 tokens/day = $0.075/day
  • Output: 1,000 x 200 = 200,000 tokens/day = $0.12/day
  • Total: ~$0.20/day = ~$6/month

Same workflow using Claude Opus 4:

  • Input: 500,000 tokens/day = $7.50/day
  • Output: 200,000 tokens/day = $15.00/day
  • Total: ~$22.50/day = ~$675/month

That’s 112x more expensive for the same workload. Model selection is a cost-control decision.


Cost Optimization Strategies

1

Be Concise in Prompts

Every unnecessary word is a paid token. Strip filler phrases. Replace “Could you please kindly help me to…” with direct instructions.
2

Limit Output Length

If you only need a short answer, instruct the model explicitly: “Respond in 2-3 sentences.” This is the single most effective cost reduction strategy because output tokens cost 3-5x more.
3

Choose the Right Model

Don’t use Claude Opus 4 ($75/M output) for tasks that Gemini Flash ($0.60/M output) handles equally well. Prototype on premium, then migrate proven workflows to cheaper models.
4

Use Modular Context

Instead of injecting 10,000 tokens of context into every request, structure your docs so only what’s needed gets loaded. RAG systems do this automatically.
5

Clear History Between Tasks

Old context you no longer need is still paid for with every subsequent message. Start fresh conversations when switching topics.
6

Monitor and Benchmark

Track token usage per workflow. Identify which automations are expensive and optimize them first. Small per-request savings compound dramatically at scale.

Quick Reference