Tokens & Pricing

How tokenization works, context window mechanics, LLM pricing structures, and strategies to optimize costs.

Tokens are the unit of everything in LLMs – cost, limits, and performance all trace back to token count. This page explains how tokenization works, what context windows are, and how to estimate and optimize costs.

What Is a Token?

When you send text to an LLM, it doesn’t process your words the way you read them. Instead, it breaks your text into tokens – small chunks that can be whole words, parts of words, punctuation marks, or spaces. This process is called tokenization.

Text	Tokens	Count
`Hello world!`	`Hello` / `world` / `!`	3
`unbelievable`	`un` / `believ` / `able`	3
`hamburger`	`ham` / `burger`	2
`Apple`	`Apple`	1
`xqzptfl`	`x`/`q`/`z`/`p`/`t`/`f`/`l`	7

Common words use fewer tokens because the model learned them as unified patterns. Rare words, technical jargon, or non-English text gets broken into smaller fragments – costing more tokens per word.

The algorithm behind tokenization is called Byte Pair Encoding (BPE). It works by:

Starting with individual characters as the base vocabulary
Identifying the most frequently occurring character pairs in training data
Merging them into single tokens
Repeating until a target vocabulary size is reached

The result: common English words like “the” or “and” become single tokens, while unusual strings get split into many small pieces. You don’t need to understand BPE deeply – what matters is the practical output.

Tokenization is optimized for English. Other languages often require more tokens per word:

Language	“Hello, how are you?”	Token Count
English	Hello, how are you?	~6 tokens
French	Bonjour, comment allez-vous?	~9 tokens
Japanese	こんにちは、お元気ですか？	~15 tokens
Arabic	مرحبا، كيف حالك؟	~18 tokens

This means non-English workflows cost more per word. Factor this into cost estimates for multilingual applications.

Token-to-Text Conversion Guide

These are approximations useful for quick mental math:

Unit of text	Approximate tokens
1 word	~0.75 tokens
1 sentence	~15-20 tokens
1 paragraph	~100 tokens
1 page (~250 words)	~330 tokens
10-minute audio transcript	~4,500-5,000 tokens
1 full novel (~90,000 words)	~120,000 tokens

Real-world illustration: A 10-minute YouTube video transcript processes as roughly 4,500-5,000 tokens. A model with a 200,000-token context window could hold approximately 40 such videos worth of content in a single session.

Context Windows: The AI’s Working Memory

The context window is the maximum number of tokens an LLM can consider at one time – the model’s working memory.

graph TD
    subgraph CW["Context Window (e.g., 200K tokens)"]
        A["System Prompt / Instructions"]
        B["Conversation History"]
        C["Your Current Message (Input)"]
        D["AI's Response (Output)"]
    end
    A --> B --> C --> D
    style CW fill:#f5f5f5,stroke:#333
    style A fill:#e3f2fd,stroke:#1976D2
    style B fill:#fff8e1,stroke:#F9A825
    style C fill:#e8f5e9,stroke:#388E3C
    style D fill:#fce4ec,stroke:#C62828

All four components compete for the same limited space. In a long conversation, history alone can consume tens of thousands of tokens before you’ve typed a single new word.

Context Window Sizes (2025-2026)

Model	Provider	Context Window	Best For
Gemini 2.5 Flash	Google	1,000,000 tokens	High-volume, long-document tasks
GPT-5	OpenAI	400,000 tokens	Complex reasoning, writing
Claude Opus 4	Anthropic	200,000 tokens	Nuanced reasoning, safety-critical
GPT-4o	OpenAI	128,000 tokens	General purpose, coding
LLaMA 4	Meta	128,000+ tokens	Private deployments

What Happens When the Window Fills Up

Unlike human long-term memory, the context window doesn’t accumulate across sessions – it resets. Each new conversation starts fresh.

Within a single session, as the context fills, older information gets pushed out of the model’s active consideration. This is when you notice the AI “forgetting” earlier parts of the conversation – it’s the physics of a finite working memory.

Research shows models give less attention to information in the middle of a very long context. The beginning and end are remembered more reliably – a phenomenon called the “lost in the middle” effect.

How Pricing Works

LLM pricing is structured as cost per million tokens, with separate rates for input and output.

The key rule: Output tokens cost 3-4x more than input tokens. Why? Because generating each output token requires a full forward pass through the model, while reading input is comparatively cheaper computationally.

Pricing Comparison (2025 Rates)

Model	Input (per 1M tokens)	Output (per 1M tokens)	Ratio
Gemini 2.5 Flash	$0.15	$0.60	4x
GPT-4o mini	$0.15	$0.60	4x
Claude Haiku 3.5	$0.80	$4.00	5x
Claude Sonnet 4	$3.00	$15.00	5x
GPT-5	$1.25	$10.00	8x
Claude Opus 4	$15.00	$75.00	5x

Workflow: 1,000 requests/day, 500 input + 200 output tokens each, using Gemini 2.5 Flash:

Input: 1,000 x 500 = 500,000 tokens/day = $0.075/day
Output: 1,000 x 200 = 200,000 tokens/day = $0.12/day
Total: ~$0.20/day = ~$6/month

Same workflow using Claude Opus 4:

Input: 500,000 tokens/day = $7.50/day
Output: 200,000 tokens/day = $15.00/day
Total: ~$22.50/day = ~$675/month

That’s 112x more expensive for the same workload. Model selection is a cost-control decision.

Cost Optimization Strategies

Be Concise in Prompts

Every unnecessary word is a paid token. Strip filler phrases. Replace “Could you please kindly help me to…” with direct instructions.

Limit Output Length

If you only need a short answer, instruct the model explicitly: “Respond in 2-3 sentences.” This is the single most effective cost reduction strategy because output tokens cost 3-5x more.

Choose the Right Model

Don’t use Claude Opus 4 ($75/M output) for tasks that Gemini Flash ($0.60/M output) handles equally well. Prototype on premium, then migrate proven workflows to cheaper models.

Use Modular Context

Instead of injecting 10,000 tokens of context into every request, structure your docs so only what’s needed gets loaded. RAG systems do this automatically.

Clear History Between Tasks

Old context you no longer need is still paid for with every subsequent message. Start fresh conversations when switching topics.

Monitor and Benchmark

Track token usage per workflow. Identify which automations are expensive and optimize them first. Small per-request savings compound dramatically at scale.

Tokens & Pricing

What Is a Token?

Token-to-Text Conversion Guide

Context Windows: The AI’s Working Memory

Context Window Sizes (2025-2026)

How Pricing Works

Pricing Comparison (2025 Rates)

Cost Optimization Strategies

Be Concise in Prompts

Limit Output Length

Choose the Right Model

Use Modular Context

Clear History Between Tasks

Monitor and Benchmark

Quick Reference

100 tokens = ~75 words

Output costs 3-5x more

$675 vs $6/month

Context resets each session