AI agents are only as useful as their ability to remember. Without memory, every interaction starts from zero. This page explains the four types of agent memory, how embeddings and vector databases work, and how RAG prevents hallucination.

The Stateless Problem

By default, LLMs have no memory. Every message is processed in complete isolation. The model has no idea what you discussed yesterday, last week, or even five minutes ago in a different session.

This is called being stateless. The consequences are significant:

  • The agent re-introduces itself every session
  • It cannot improve based on past feedback
  • It cannot reference your documents or data without re-providing them
  • It cannot learn from its own successes or mistakes
The analogy: Hiring an incredibly capable expert who develops amnesia every morning. Brilliant in the moment, but starts each day knowing nothing about you or your projects.

The Four Types of Agent Memory

graph TD
    subgraph Memory["Agent Memory System"]
        ST["Short-Term<br/>(Context Window)<br/>Current session only"]
        SE["Semantic Memory<br/>(Facts & Knowledge)<br/>'What things are'"]
        EP["Episodic Memory<br/>(Experiences)<br/>'What happened before'"]
        PR["Procedural Memory<br/>(Skills & Rules)<br/>'How to do things'"]
    end

    style ST fill:#e3f2fd,stroke:#1976D2
    style SE fill:#e8f5e9,stroke:#4CAF50
    style EP fill:#fff3e0,stroke:#FF9800
    style PR fill:#f3e5f5,stroke:#9C27B0

What it is: The context window – everything the agent can “see” right now.

Contains:

  • Your current message
  • System prompt (behavioral instructions)
  • Conversation history from current session
  • Retrieved information from external memory
  • The agent’s own responses

Key facts:

  • Resets between sessions
  • Has a hard token limit (128K to 1M depending on model)
  • Every token costs money (re-charged with each message)
  • Quality degrades as it fills (the “lost in the middle” effect)

When it’s enough: Simple, single-session tasks – one-off questions, single-sitting document drafts.

What it is: Facts, concepts, and knowledge. The agent’s knowledge base.

Human equivalent: Things you learned in school. Paris is the capital of France, what a balance sheet is, how photosynthesis works.

For AI agents:

  • Company policies, product docs, FAQs
  • Domain knowledge (medical, legal, financial)
  • Customer or contact data
  • Any structured reference information

How it’s stored: Typically in a vector database that supports meaning-based search. The agent can find relevant information even when the query uses different words than the stored content.

Example: User asks “Can I return this if I changed my mind?” – the agent retrieves the return policy from its knowledge base and gives a precise answer instead of hallucinating.

What it is: Specific past events and experiences – the history of what happened.

Human equivalent: “Last Tuesday I had that difficult conversation with my manager.”

For AI agents:

  • Records of past conversations with specific users
  • History of tasks completed and outcomes
  • Notes about what worked or didn’t
  • User preferences expressed across sessions

Key distinction from semantic: Semantic = facts (“User prefers concise responses”). Episodic = events (“On Tuesday, I gave a long response and they asked me to be brief”).

What it is: Skills, workflows, and behavioral rules.

Human equivalent: Knowing how to ride a bike – you don’t think through each step consciously.

For AI agents:

  • System prompts (role, tone, rules)
  • Workflow definitions (step-by-step processes)
  • Configuration files (like CLAUDE.md)
  • Few-shot examples

Example: A support agent has rules: always greet by name, escalate refunds over $500, verify account before sharing details. These run automatically without re-reasoning from scratch.


Embeddings: Turning Words into Numbers

How Embeddings Work

A vector embedding represents any piece of content as a list of numbers that capture its meaning.

The analogy: Imagine plotting words on a map where proximity = similarity. “Cat” and “kitten” are close together. “Cat” and “automobile” are far apart. Embeddings do this in hundreds of dimensions.

When you embed “What is the return policy?”, the model produces numbers encoding that meaning. Later, “Can I send this back for a refund?” – completely different words – lands in nearly the same region. This enables semantic search.

Types of Embeddings
TypeWhat It CapturesCommon Uses
TextMeaning of sentences, documentsChatbots, search, knowledge bases
ImageVisual featuresProduct search, photo tagging
AudioSound patternsVoice recognition, music recommendations
MultimodalText + images combinedComplex cross-media searches
Choosing an Embedding Model

For most practical use cases:

ModelSpeedBest For
text-embedding-3-small (OpenAI)Fast, low costGeneral purpose, most automations
text-embedding-3-large (OpenAI)Slower, higher costComplex, nuanced documents
Cohere embed-v3FastMultilingual applications
Open-source (BAAI/bge)VariableSelf-hosted, privacy-critical

Vector Databases

A vector database stores embeddings and retrieves the most similar ones to a query – at massive scale, in milliseconds.

Traditional databases search by exact matching (“return policy” only finds records with those exact words). Vector databases search by meaning – finding content most similar in concept, even with completely different wording.

Leading Vector Databases

ToolTypeBest For
PineconeManaged cloudProduction at scale, zero ops
WeaviateOpen sourceHybrid search (vector + keyword)
ChromaLightweightLocal development, prototyping
Supabase (pgvector)Postgres extensionTeams already using Supabase
QdrantOpen sourceHigh performance, self-hosted

How a Search Works

1

User Asks a Question

“What’s the refund process for international orders?”
2

Question Is Embedded

The system converts the question to a vector embedding (list of numbers capturing meaning).
3

Similarity Search

The vector database finds the stored embeddings most similar to the query embedding.
4

Top Chunks Retrieved

The most relevant document chunks (e.g., international returns policy) are returned.
5

Context Injection

Retrieved content is injected into the agent’s context window alongside the user’s question.
6

Grounded Response

The agent generates an accurate response based on the retrieved source material.

RAG: Retrieval-Augmented Generation

RAG is the technique that connects external memory to the agent’s response generation. It is the most widely used memory architecture in production AI systems.

graph LR
    A["User Query"] --> B["Embed Query"]
    B --> C["Vector DB Search"]
    C --> D["Retrieved Chunks"]
    D --> E["LLM Context Window"]
    F["System Prompt"] --> E
    A --> E
    E --> G["Grounded Response"]

    style A fill:#e3f2fd,stroke:#1976D2
    style C fill:#e8f5e9,stroke:#4CAF50
    style E fill:#fff3e0,stroke:#FF9800
    style G fill:#f3e5f5,stroke:#9C27B0

Why RAG Prevents Hallucination

Hallucination happens when the model bridges a knowledge gap with plausible-sounding content. RAG eliminates the gap by providing the actual information. The model doesn’t need to guess – it has the source material in its context window.

User: “What’s our refund policy for international orders?” Agent: “International orders can typically be refunded within 30 days…” (hallucinated – may be completely wrong)

The agent has no access to your actual policy and generates a plausible-sounding but potentially incorrect answer.

User: “What’s our refund policy for international orders?” System retrieves: Actual policy document from vector database Agent: “According to our international returns policy, orders shipped outside the EU can be returned within 14 business days. A €15 return shipping fee applies…” (grounded in real data)

The agent’s response is based on your actual policy, not fabricated content.

RAG vs. Fine-Tuning

RAGFine-Tuning
What it doesRetrieves external info at runtimeBakes knowledge into model weights
When to updateAdd to database anytime – instantRequires full retraining cycle
For changing dataExcellentPoor – model goes stale
For static style/behaviorLimitedExcellent
CostStorage + retrieval per querySignificant upfront training cost
TransparencyYou can see what was retrievedOpaque
For most production use cases involving dynamic business knowledge (policies, catalogs, customer data), RAG is the right choice. Fine-tuning is better suited for teaching specific style, tone, or narrow domain behavior.

The Memory Lifecycle

1. Generation: What to Store

Not everything deserves long-term storage. Effective agents selectively write:

  • Key facts stated by the user (“My budget is $50,000”)
  • Expressed preferences (“I prefer short summaries”)
  • Important events (“User escalated on March 3rd”)
  • Task outcomes (“Report completed and approved”)

Storing everything creates noise that degrades retrieval quality.

2. Retrieval: When to Retrieve

Common triggers:

  • User asks a question requiring external knowledge
  • User references a past event (“Like we discussed last time…”)
  • Agent needs user-specific context for personalization
  • Agent follows a procedure requiring knowledge base lookup

Quality depends heavily on how information was chunked when stored.

3. Updating: Keeping Memory Fresh

Static memory becomes stale. Effective systems include:

  • Adding new documents as they’re created
  • Overwriting outdated facts (updated policy, changed price)
  • Promoting frequent episodic memories into semantic ones
4. Forgetting: Why Deletion Matters

Outdated information actively harms accuracy. An agent remembering a product was $49 when it’s now $79 is worse than no memory at all. Good systems include expiry mechanisms, relevance decay, and clear deletion pathways.


Chunking Strategies for RAG

How you split documents before embedding significantly affects retrieval quality:

StrategyHow It WorksBest For
Fixed-size chunksSplit every N tokensSimple, fast; loses some context
Recursive/paragraphSplit at natural boundariesBetter context preservation
Semantic chunkingSplit when topic changesBest quality; more complex
Token splitterSplit at token boundariesCost-optimized retrieval

Memory Architecture by Use Case

Use CaseMemory Types NeededArchitecture
Simple FAQ chatbotSemantic (RAG)Vector DB + RAG retrieval
Personalized assistantEpisodic + SemanticPer-user history + domain KB
Autonomous agentAll four typesFull memory stack
High-volume pipelineSemantic, optimizedFast vector DB + caching

Real-World Memory in Action

Vector databases power applications you use daily:

ApplicationHow Memory Is Used
SpotifySong embeddings + listening habits for recommendations
AmazonPurchase history + product attributes for “bought together”
Google SearchSemantic search – understands meaning, not just keywords
PayPalTransaction pattern vectors for fraud detection
InstagramPhoto embeddings for content recommendations

Key Takeaways