What Is a Token?
When you send text to an LLM, it doesn’t process your words the way you read them. Instead, it breaks your text into tokens – small chunks that can be whole words, parts of words, punctuation marks, or spaces. This process is called tokenization.
| Text | Tokens | Count |
|---|---|---|
Hello world! | Hello / world / ! | 3 |
unbelievable | un / believ / able | 3 |
hamburger | ham / burger | 2 |
Apple | Apple | 1 |
xqzptfl | x/q/z/p/t/f/l | 7 |
Common words use fewer tokens because the model learned them as unified patterns. Rare words, technical jargon, or non-English text gets broken into smaller fragments – costing more tokens per word.
The algorithm behind tokenization is called Byte Pair Encoding (BPE). It works by:
- Starting with individual characters as the base vocabulary
- Identifying the most frequently occurring character pairs in training data
- Merging them into single tokens
- Repeating until a target vocabulary size is reached
The result: common English words like “the” or “and” become single tokens, while unusual strings get split into many small pieces. You don’t need to understand BPE deeply – what matters is the practical output.
Tokenization is optimized for English. Other languages often require more tokens per word:
| Language | “Hello, how are you?” | Token Count |
|---|---|---|
| English | Hello, how are you? | ~6 tokens |
| French | Bonjour, comment allez-vous? | ~9 tokens |
| Japanese | こんにちは、お元気ですか? | ~15 tokens |
| Arabic | مرحبا، كيف حالك؟ | ~18 tokens |
This means non-English workflows cost more per word. Factor this into cost estimates for multilingual applications.
Token-to-Text Conversion Guide
These are approximations useful for quick mental math:
| Unit of text | Approximate tokens |
|---|---|
| 1 word | ~0.75 tokens |
| 1 sentence | ~15-20 tokens |
| 1 paragraph | ~100 tokens |
| 1 page (~250 words) | ~330 tokens |
| 10-minute audio transcript | ~4,500-5,000 tokens |
| 1 full novel (~90,000 words) | ~120,000 tokens |
Context Windows: The AI’s Working Memory
The context window is the maximum number of tokens an LLM can consider at one time – the model’s working memory.
graph TD
subgraph CW["Context Window (e.g., 200K tokens)"]
A["System Prompt / Instructions"]
B["Conversation History"]
C["Your Current Message (Input)"]
D["AI's Response (Output)"]
end
A --> B --> C --> D
style CW fill:#f5f5f5,stroke:#333
style A fill:#e3f2fd,stroke:#1976D2
style B fill:#fff8e1,stroke:#F9A825
style C fill:#e8f5e9,stroke:#388E3C
style D fill:#fce4ec,stroke:#C62828
All four components compete for the same limited space. In a long conversation, history alone can consume tens of thousands of tokens before you’ve typed a single new word.
Context Window Sizes (2025-2026)
| Model | Provider | Context Window | Best For |
|---|---|---|---|
| Gemini 2.5 Flash | 1,000,000 tokens | High-volume, long-document tasks | |
| GPT-5 | OpenAI | 400,000 tokens | Complex reasoning, writing |
| Claude Opus 4 | Anthropic | 200,000 tokens | Nuanced reasoning, safety-critical |
| GPT-4o | OpenAI | 128,000 tokens | General purpose, coding |
| LLaMA 4 | Meta | 128,000+ tokens | Private deployments |
What Happens When the Window Fills Up
Unlike human long-term memory, the context window doesn’t accumulate across sessions – it resets. Each new conversation starts fresh.
Within a single session, as the context fills, older information gets pushed out of the model’s active consideration. This is when you notice the AI “forgetting” earlier parts of the conversation – it’s the physics of a finite working memory.
Research shows models give less attention to information in the middle of a very long context. The beginning and end are remembered more reliably – a phenomenon called the “lost in the middle” effect.
How Pricing Works
LLM pricing is structured as cost per million tokens, with separate rates for input and output.
Pricing Comparison (2025 Rates)
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Ratio |
|---|---|---|---|
| Gemini 2.5 Flash | $0.15 | $0.60 | 4x |
| GPT-4o mini | $0.15 | $0.60 | 4x |
| Claude Haiku 3.5 | $0.80 | $4.00 | 5x |
| Claude Sonnet 4 | $3.00 | $15.00 | 5x |
| GPT-5 | $1.25 | $10.00 | 8x |
| Claude Opus 4 | $15.00 | $75.00 | 5x |
Workflow: 1,000 requests/day, 500 input + 200 output tokens each, using Gemini 2.5 Flash:
- Input: 1,000 x 500 = 500,000 tokens/day = $0.075/day
- Output: 1,000 x 200 = 200,000 tokens/day = $0.12/day
- Total: ~$0.20/day = ~$6/month
Same workflow using Claude Opus 4:
- Input: 500,000 tokens/day = $7.50/day
- Output: 200,000 tokens/day = $15.00/day
- Total: ~$22.50/day = ~$675/month
That’s 112x more expensive for the same workload. Model selection is a cost-control decision.
Cost Optimization Strategies
Be Concise in Prompts
Limit Output Length
Choose the Right Model
Use Modular Context
Clear History Between Tasks
Monitor and Benchmark
Quick Reference
100 tokens = ~75 words
The fundamental conversion ratio for quick mental math.
Output costs 3-5x more
Optimizing response length is more impactful than optimizing prompt length.
$675 vs $6/month
The same 1,000 req/day workflow can cost 112x more depending on model choice.
Context resets each session
No persistent memory without external memory systems. Every session starts fresh.