The Stateless Problem
By default, LLMs have no memory. Every message is processed in complete isolation. The model has no idea what you discussed yesterday, last week, or even five minutes ago in a different session.
This is called being stateless. The consequences are significant:
- The agent re-introduces itself every session
- It cannot improve based on past feedback
- It cannot reference your documents or data without re-providing them
- It cannot learn from its own successes or mistakes
The Four Types of Agent Memory
graph TD
subgraph Memory["Agent Memory System"]
ST["Short-Term<br/>(Context Window)<br/>Current session only"]
SE["Semantic Memory<br/>(Facts & Knowledge)<br/>'What things are'"]
EP["Episodic Memory<br/>(Experiences)<br/>'What happened before'"]
PR["Procedural Memory<br/>(Skills & Rules)<br/>'How to do things'"]
end
style ST fill:#e3f2fd,stroke:#1976D2
style SE fill:#e8f5e9,stroke:#4CAF50
style EP fill:#fff3e0,stroke:#FF9800
style PR fill:#f3e5f5,stroke:#9C27B0
What it is: The context window – everything the agent can “see” right now.
Contains:
- Your current message
- System prompt (behavioral instructions)
- Conversation history from current session
- Retrieved information from external memory
- The agent’s own responses
Key facts:
- Resets between sessions
- Has a hard token limit (128K to 1M depending on model)
- Every token costs money (re-charged with each message)
- Quality degrades as it fills (the “lost in the middle” effect)
When it’s enough: Simple, single-session tasks – one-off questions, single-sitting document drafts.
What it is: Facts, concepts, and knowledge. The agent’s knowledge base.
Human equivalent: Things you learned in school. Paris is the capital of France, what a balance sheet is, how photosynthesis works.
For AI agents:
- Company policies, product docs, FAQs
- Domain knowledge (medical, legal, financial)
- Customer or contact data
- Any structured reference information
How it’s stored: Typically in a vector database that supports meaning-based search. The agent can find relevant information even when the query uses different words than the stored content.
Example: User asks “Can I return this if I changed my mind?” – the agent retrieves the return policy from its knowledge base and gives a precise answer instead of hallucinating.
What it is: Specific past events and experiences – the history of what happened.
Human equivalent: “Last Tuesday I had that difficult conversation with my manager.”
For AI agents:
- Records of past conversations with specific users
- History of tasks completed and outcomes
- Notes about what worked or didn’t
- User preferences expressed across sessions
Key distinction from semantic: Semantic = facts (“User prefers concise responses”). Episodic = events (“On Tuesday, I gave a long response and they asked me to be brief”).
What it is: Skills, workflows, and behavioral rules.
Human equivalent: Knowing how to ride a bike – you don’t think through each step consciously.
For AI agents:
- System prompts (role, tone, rules)
- Workflow definitions (step-by-step processes)
- Configuration files (like CLAUDE.md)
- Few-shot examples
Example: A support agent has rules: always greet by name, escalate refunds over $500, verify account before sharing details. These run automatically without re-reasoning from scratch.
Embeddings: Turning Words into Numbers
How Embeddings Work
A vector embedding represents any piece of content as a list of numbers that capture its meaning.
The analogy: Imagine plotting words on a map where proximity = similarity. “Cat” and “kitten” are close together. “Cat” and “automobile” are far apart. Embeddings do this in hundreds of dimensions.
When you embed “What is the return policy?”, the model produces numbers encoding that meaning. Later, “Can I send this back for a refund?” – completely different words – lands in nearly the same region. This enables semantic search.
Types of Embeddings
| Type | What It Captures | Common Uses |
|---|---|---|
| Text | Meaning of sentences, documents | Chatbots, search, knowledge bases |
| Image | Visual features | Product search, photo tagging |
| Audio | Sound patterns | Voice recognition, music recommendations |
| Multimodal | Text + images combined | Complex cross-media searches |
Choosing an Embedding Model
For most practical use cases:
| Model | Speed | Best For |
|---|---|---|
| text-embedding-3-small (OpenAI) | Fast, low cost | General purpose, most automations |
| text-embedding-3-large (OpenAI) | Slower, higher cost | Complex, nuanced documents |
| Cohere embed-v3 | Fast | Multilingual applications |
| Open-source (BAAI/bge) | Variable | Self-hosted, privacy-critical |
Vector Databases
A vector database stores embeddings and retrieves the most similar ones to a query – at massive scale, in milliseconds.
Leading Vector Databases
| Tool | Type | Best For |
|---|---|---|
| Pinecone | Managed cloud | Production at scale, zero ops |
| Weaviate | Open source | Hybrid search (vector + keyword) |
| Chroma | Lightweight | Local development, prototyping |
| Supabase (pgvector) | Postgres extension | Teams already using Supabase |
| Qdrant | Open source | High performance, self-hosted |
How a Search Works
User Asks a Question
Question Is Embedded
Similarity Search
Top Chunks Retrieved
Context Injection
Grounded Response
RAG: Retrieval-Augmented Generation
RAG is the technique that connects external memory to the agent’s response generation. It is the most widely used memory architecture in production AI systems.
graph LR
A["User Query"] --> B["Embed Query"]
B --> C["Vector DB Search"]
C --> D["Retrieved Chunks"]
D --> E["LLM Context Window"]
F["System Prompt"] --> E
A --> E
E --> G["Grounded Response"]
style A fill:#e3f2fd,stroke:#1976D2
style C fill:#e8f5e9,stroke:#4CAF50
style E fill:#fff3e0,stroke:#FF9800
style G fill:#f3e5f5,stroke:#9C27B0
Why RAG Prevents Hallucination
Hallucination happens when the model bridges a knowledge gap with plausible-sounding content. RAG eliminates the gap by providing the actual information. The model doesn’t need to guess – it has the source material in its context window.
User: “What’s our refund policy for international orders?” Agent: “International orders can typically be refunded within 30 days…” (hallucinated – may be completely wrong)
The agent has no access to your actual policy and generates a plausible-sounding but potentially incorrect answer.
User: “What’s our refund policy for international orders?” System retrieves: Actual policy document from vector database Agent: “According to our international returns policy, orders shipped outside the EU can be returned within 14 business days. A €15 return shipping fee applies…” (grounded in real data)
The agent’s response is based on your actual policy, not fabricated content.
RAG vs. Fine-Tuning
| RAG | Fine-Tuning | |
|---|---|---|
| What it does | Retrieves external info at runtime | Bakes knowledge into model weights |
| When to update | Add to database anytime – instant | Requires full retraining cycle |
| For changing data | Excellent | Poor – model goes stale |
| For static style/behavior | Limited | Excellent |
| Cost | Storage + retrieval per query | Significant upfront training cost |
| Transparency | You can see what was retrieved | Opaque |
The Memory Lifecycle
1. Generation: What to Store
Not everything deserves long-term storage. Effective agents selectively write:
- Key facts stated by the user (“My budget is $50,000”)
- Expressed preferences (“I prefer short summaries”)
- Important events (“User escalated on March 3rd”)
- Task outcomes (“Report completed and approved”)
Storing everything creates noise that degrades retrieval quality.
2. Retrieval: When to Retrieve
Common triggers:
- User asks a question requiring external knowledge
- User references a past event (“Like we discussed last time…”)
- Agent needs user-specific context for personalization
- Agent follows a procedure requiring knowledge base lookup
Quality depends heavily on how information was chunked when stored.
3. Updating: Keeping Memory Fresh
Static memory becomes stale. Effective systems include:
- Adding new documents as they’re created
- Overwriting outdated facts (updated policy, changed price)
- Promoting frequent episodic memories into semantic ones
4. Forgetting: Why Deletion Matters
Outdated information actively harms accuracy. An agent remembering a product was $49 when it’s now $79 is worse than no memory at all. Good systems include expiry mechanisms, relevance decay, and clear deletion pathways.
Chunking Strategies for RAG
How you split documents before embedding significantly affects retrieval quality:
| Strategy | How It Works | Best For |
|---|---|---|
| Fixed-size chunks | Split every N tokens | Simple, fast; loses some context |
| Recursive/paragraph | Split at natural boundaries | Better context preservation |
| Semantic chunking | Split when topic changes | Best quality; more complex |
| Token splitter | Split at token boundaries | Cost-optimized retrieval |
Memory Architecture by Use Case
| Use Case | Memory Types Needed | Architecture |
|---|---|---|
| Simple FAQ chatbot | Semantic (RAG) | Vector DB + RAG retrieval |
| Personalized assistant | Episodic + Semantic | Per-user history + domain KB |
| Autonomous agent | All four types | Full memory stack |
| High-volume pipeline | Semantic, optimized | Fast vector DB + caching |
Real-World Memory in Action
Vector databases power applications you use daily:
| Application | How Memory Is Used |
|---|---|
| Spotify | Song embeddings + listening habits for recommendations |
| Amazon | Purchase history + product attributes for “bought together” |
| Google Search | Semantic search – understands meaning, not just keywords |
| PayPal | Transaction pattern vectors for fraud detection |
| Photo embeddings for content recommendations |
Key Takeaways
LLMs Are Stateless
Every session starts fresh without memory systems. This is the core problem agent memory solves.
4 Memory Types
Short-term (context window), semantic (facts), episodic (events), procedural (skills). Each serves a different purpose.
RAG Is the Standard
The dominant architecture for production agents. Grounds responses in real data, prevents hallucination.
Vector DBs Are Proven
Spotify, Google, Amazon, PayPal – the technology is mature infrastructure powering billions of daily interactions.