Memory & RAG

The four types of agent memory, embeddings, vector databases, and Retrieval-Augmented Generation.

AI agents are only as useful as their ability to remember. Without memory, every interaction starts from zero. This page explains the four types of agent memory, how embeddings and vector databases work, and how RAG prevents hallucination.

The Stateless Problem

By default, LLMs have no memory. Every message is processed in complete isolation. The model has no idea what you discussed yesterday, last week, or even five minutes ago in a different session.

This is called being stateless. The consequences are significant:

The agent re-introduces itself every session
It cannot improve based on past feedback
It cannot reference your documents or data without re-providing them
It cannot learn from its own successes or mistakes

The analogy: Hiring an incredibly capable expert who develops amnesia every morning. Brilliant in the moment, but starts each day knowing nothing about you or your projects.

The Four Types of Agent Memory

graph TD
    subgraph Memory["Agent Memory System"]
        ST["Short-Term<br/>(Context Window)<br/>Current session only"]
        SE["Semantic Memory<br/>(Facts & Knowledge)<br/>'What things are'"]
        EP["Episodic Memory<br/>(Experiences)<br/>'What happened before'"]
        PR["Procedural Memory<br/>(Skills & Rules)<br/>'How to do things'"]
    end

    style ST fill:#e3f2fd,stroke:#1976D2
    style SE fill:#e8f5e9,stroke:#4CAF50
    style EP fill:#fff3e0,stroke:#FF9800
    style PR fill:#f3e5f5,stroke:#9C27B0

What it is: The context window – everything the agent can “see” right now.

Contains:

Your current message
System prompt (behavioral instructions)
Conversation history from current session
Retrieved information from external memory
The agent’s own responses

Key facts:

Resets between sessions
Has a hard token limit (128K to 1M depending on model)
Every token costs money (re-charged with each message)
Quality degrades as it fills (the “lost in the middle” effect)

When it’s enough: Simple, single-session tasks – one-off questions, single-sitting document drafts.

What it is: Facts, concepts, and knowledge. The agent’s knowledge base.

Human equivalent: Things you learned in school. Paris is the capital of France, what a balance sheet is, how photosynthesis works.

For AI agents:

Company policies, product docs, FAQs
Domain knowledge (medical, legal, financial)
Customer or contact data
Any structured reference information

How it’s stored: Typically in a vector database that supports meaning-based search. The agent can find relevant information even when the query uses different words than the stored content.

Example: User asks “Can I return this if I changed my mind?” – the agent retrieves the return policy from its knowledge base and gives a precise answer instead of hallucinating.

What it is: Specific past events and experiences – the history of what happened.

Human equivalent: “Last Tuesday I had that difficult conversation with my manager.”

For AI agents:

Records of past conversations with specific users
History of tasks completed and outcomes
Notes about what worked or didn’t
User preferences expressed across sessions

Key distinction from semantic: Semantic = facts (“User prefers concise responses”). Episodic = events (“On Tuesday, I gave a long response and they asked me to be brief”).

What it is: Skills, workflows, and behavioral rules.

Human equivalent: Knowing how to ride a bike – you don’t think through each step consciously.

For AI agents:

System prompts (role, tone, rules)
Workflow definitions (step-by-step processes)
Configuration files (like CLAUDE.md)
Few-shot examples

Example: A support agent has rules: always greet by name, escalate refunds over $500, verify account before sharing details. These run automatically without re-reasoning from scratch.

Embeddings: Turning Words into Numbers

How Embeddings Work

A vector embedding represents any piece of content as a list of numbers that capture its meaning.

The analogy: Imagine plotting words on a map where proximity = similarity. “Cat” and “kitten” are close together. “Cat” and “automobile” are far apart. Embeddings do this in hundreds of dimensions.

When you embed “What is the return policy?”, the model produces numbers encoding that meaning. Later, “Can I send this back for a refund?” – completely different words – lands in nearly the same region. This enables semantic search.

Types of Embeddings

Type	What It Captures	Common Uses
Text	Meaning of sentences, documents	Chatbots, search, knowledge bases
Image	Visual features	Product search, photo tagging
Audio	Sound patterns	Voice recognition, music recommendations
Multimodal	Text + images combined	Complex cross-media searches

Choosing an Embedding Model

For most practical use cases:

Model	Speed	Best For
text-embedding-3-small (OpenAI)	Fast, low cost	General purpose, most automations
text-embedding-3-large (OpenAI)	Slower, higher cost	Complex, nuanced documents
Cohere embed-v3	Fast	Multilingual applications
Open-source (BAAI/bge)	Variable	Self-hosted, privacy-critical

Vector Databases

A vector database stores embeddings and retrieves the most similar ones to a query – at massive scale, in milliseconds.

Traditional databases search by exact matching (“return policy” only finds records with those exact words). Vector databases search by meaning – finding content most similar in concept, even with completely different wording.

Leading Vector Databases

Tool	Type	Best For
Pinecone	Managed cloud	Production at scale, zero ops
Weaviate	Open source	Hybrid search (vector + keyword)
Chroma	Lightweight	Local development, prototyping
Supabase (pgvector)	Postgres extension	Teams already using Supabase
Qdrant	Open source	High performance, self-hosted

How a Search Works

User Asks a Question

“What’s the refund process for international orders?”

Question Is Embedded

The system converts the question to a vector embedding (list of numbers capturing meaning).

Similarity Search

The vector database finds the stored embeddings most similar to the query embedding.

Top Chunks Retrieved

The most relevant document chunks (e.g., international returns policy) are returned.

Context Injection

Retrieved content is injected into the agent’s context window alongside the user’s question.

Grounded Response

The agent generates an accurate response based on the retrieved source material.

RAG: Retrieval-Augmented Generation

RAG is the technique that connects external memory to the agent’s response generation. It is the most widely used memory architecture in production AI systems.

graph LR
    A["User Query"] --> B["Embed Query"]
    B --> C["Vector DB Search"]
    C --> D["Retrieved Chunks"]
    D --> E["LLM Context Window"]
    F["System Prompt"] --> E
    A --> E
    E --> G["Grounded Response"]

    style A fill:#e3f2fd,stroke:#1976D2
    style C fill:#e8f5e9,stroke:#4CAF50
    style E fill:#fff3e0,stroke:#FF9800
    style G fill:#f3e5f5,stroke:#9C27B0

Why RAG Prevents Hallucination

Hallucination happens when the model bridges a knowledge gap with plausible-sounding content. RAG eliminates the gap by providing the actual information. The model doesn’t need to guess – it has the source material in its context window.

User: “What’s our refund policy for international orders?” Agent: “International orders can typically be refunded within 30 days…” (hallucinated – may be completely wrong)

The agent has no access to your actual policy and generates a plausible-sounding but potentially incorrect answer.

User: “What’s our refund policy for international orders?” System retrieves: Actual policy document from vector database Agent: “According to our international returns policy, orders shipped outside the EU can be returned within 14 business days. A €15 return shipping fee applies…” (grounded in real data)

The agent’s response is based on your actual policy, not fabricated content.

RAG vs. Fine-Tuning

	RAG	Fine-Tuning
What it does	Retrieves external info at runtime	Bakes knowledge into model weights
When to update	Add to database anytime – instant	Requires full retraining cycle
For changing data	Excellent	Poor – model goes stale
For static style/behavior	Limited	Excellent
Cost	Storage + retrieval per query	Significant upfront training cost
Transparency	You can see what was retrieved	Opaque

For most production use cases involving dynamic business knowledge (policies, catalogs, customer data), RAG is the right choice. Fine-tuning is better suited for teaching specific style, tone, or narrow domain behavior.

The Memory Lifecycle

1. Generation: What to Store

Not everything deserves long-term storage. Effective agents selectively write:

Key facts stated by the user (“My budget is $50,000”)
Expressed preferences (“I prefer short summaries”)
Important events (“User escalated on March 3rd”)
Task outcomes (“Report completed and approved”)

Storing everything creates noise that degrades retrieval quality.

2. Retrieval: When to Retrieve

Common triggers:

User asks a question requiring external knowledge
User references a past event (“Like we discussed last time…”)
Agent needs user-specific context for personalization
Agent follows a procedure requiring knowledge base lookup

Quality depends heavily on how information was chunked when stored.

3. Updating: Keeping Memory Fresh

Static memory becomes stale. Effective systems include:

Adding new documents as they’re created
Overwriting outdated facts (updated policy, changed price)
Promoting frequent episodic memories into semantic ones

4. Forgetting: Why Deletion Matters

Outdated information actively harms accuracy. An agent remembering a product was $49 when it’s now $79 is worse than no memory at all. Good systems include expiry mechanisms, relevance decay, and clear deletion pathways.

Chunking Strategies for RAG

How you split documents before embedding significantly affects retrieval quality:

Strategy	How It Works	Best For
Fixed-size chunks	Split every N tokens	Simple, fast; loses some context
Recursive/paragraph	Split at natural boundaries	Better context preservation
Semantic chunking	Split when topic changes	Best quality; more complex
Token splitter	Split at token boundaries	Cost-optimized retrieval

Memory Architecture by Use Case

Use Case	Memory Types Needed	Architecture
Simple FAQ chatbot	Semantic (RAG)	Vector DB + RAG retrieval
Personalized assistant	Episodic + Semantic	Per-user history + domain KB
Autonomous agent	All four types	Full memory stack
High-volume pipeline	Semantic, optimized	Fast vector DB + caching

Real-World Memory in Action

Vector databases power applications you use daily:

Application	How Memory Is Used
Spotify	Song embeddings + listening habits for recommendations
Amazon	Purchase history + product attributes for “bought together”
Google Search	Semantic search – understands meaning, not just keywords
PayPal	Transaction pattern vectors for fraud detection
Instagram	Photo embeddings for content recommendations

Memory & RAG

The Stateless Problem

The Four Types of Agent Memory

Embeddings: Turning Words into Numbers

Vector Databases

Leading Vector Databases

How a Search Works

User Asks a Question

Question Is Embedded

Similarity Search

Top Chunks Retrieved

Context Injection

Grounded Response

RAG: Retrieval-Augmented Generation

Why RAG Prevents Hallucination

RAG vs. Fine-Tuning

The Memory Lifecycle

Chunking Strategies for RAG

Memory Architecture by Use Case

Real-World Memory in Action

Key Takeaways

LLMs Are Stateless

4 Memory Types

RAG Is the Standard

Vector DBs Are Proven