Privacy Policy© 2026 DEV BAK - TECH BLOG. All rights reserved.
DEV BAK - TECH BLOG
AI

Comparing Long-Term Memory for AI Agents: Mem0 vs Letta vs Zep — Three Philosophies and How to Choose

If you've ever built an LLM-based app, you've hit this wall. "How do I make it remember past conversations?" You might think you can just shove the entire conversation into the context window, but reality doesn't cooperate. Token costs explode, the LLM's attention drifts as conversations grow longer, and when a session ends, everything disappears. Long-term memory systems for agents emerged to solve this problem, and between 2025 and 2026 this space matured rapidly, with the options crystallizing into three distinct paths.

Mem0, Letta, and Zep — these three represent completely different approaches in terms of GitHub stars, community adoption, and most importantly, architectural philosophy. A memory layer you bolt onto your existing stack with minimal changes, an agent platform that manages memory autonomously like an OS, and a temporal knowledge graph that records how facts change over time — I want to walk through how these three philosophies actually shape technology choices, drawing on hands-on experience. By the end of this article, you'll be able to narrow down the right memory system for your service's character in under 30 minutes. We'll cover how each system works, runnable code examples, and common beginner mistakes — in that order.


Core Concepts

Why External Memory Is Needed

An LLM context window is like RAM. Turn off the power and it's gone, and capacity has limits. Even if GPT-4o supports 128k tokens, storing the full history of dozens of sessions makes costs unsustainable. Long-term memory systems solve this problem with external storage. They selectively store only important facts, and when a new conversation starts, they retrieve only the relevant memories and inject them into the context.

Long-Term Memory: An infrastructure layer that enables AI agents to continuously remember and update user information, preferences, and context across sessions and time. It supplements the LLM's context window limitations with external storage, and since the memory system itself involves LLM calls, it's worth keeping in mind upfront that additional latency and cost will be incurred.

Mem0 — A Memory Layer You Bolt Onto Your Existing Stack

Mem0's approach is "minimal invasion." It doesn't change the agent framework you're already using — LangGraph, CrewAI, AutoGen — it connects a memory layer as a bolt-on on top.

Bolt-on: An integration pattern where functionality is added externally without modifying the existing system. Like a plugin, it can be attached and detached, which makes it favorable for gradual adoption.

Internally, it uses a mixture of three storage types: vector, graph, and key-value. Roughly speaking, facts where natural language semantics matter (preferences, emotions, descriptions) go into the vector store; relationships between people, organizations, and objects go into the graph; and frequently referenced key-based data goes into key-value storage. When extracting facts from conversations, it uses an LLM to deduplicate, and when conflicts arise, it overwrites with the latest information.

An April 2026 update introduced a single-pass layered extraction algorithm. Where the previous approach extracted and classified facts across multiple stages, this algorithm handles extraction and classification simultaneously in a single LLM call. Benchmark results showed improvements of +29.6pp in temporal query accuracy and +23.1pp in multi-hop reasoning. Numbers like 48,000+ GitHub stars and a $24M Series A in 2024 speak to the community response. Honestly, when I first saw those numbers I thought "can this actually be real?" — but after trying it myself, the difference was tangible for a simple customer support scenario.

Letta — An Agent Platform That Manages Memory Like an OS

Letta (formerly MemGPT) originated from the MemGPT paper out of UC Berkeley. The core idea is to layer memory the way a computer operating system does.

  • Core Memory: Essential information always loaded in context, like RAM (name, persona, key facts)
  • Recall Memory: A disk cache that stores recent conversation history in searchable form
  • Archival Memory: Cold storage for vast long-term knowledge

The most distinctive aspect is that the agent itself moves and edits data between these layers via function calls. The agent manages memory autonomously without human intervention. I personally lost half a day getting the initial memory function design right — if you don't first define the criteria for what information belongs in Core versus what should go to Archival, the agent starts storing things in the wrong layer. That design cost is paid upfront, but once it's done, the tradeoff is that you can then track behavior transparently. It's fully open source (MIT), and in January 2026, a Conversations API was released that supports shared memory across parallel agents.

Zep — A Knowledge Graph That Remembers How Things Change Over Time

Zep's core engine, Graphiti, takes a fundamentally different approach. It assigns a validity window to each fact. For example, if the fact "the user's address is Gangnam-gu, Seoul" later changes to "Haeundae-gu, Busan," the old record isn't deleted — it's invalidated. This invalidation isn't rule-based; the LLM interprets the meaning of the new episode and determines whether it conflicts with existing facts. This means you can accurately answer questions like "what was the address as of March 2025?"

Temporal Knowledge Graph: A knowledge graph that preserves the change history of facts along with timestamps. Instead of deletion, it uses invalidation to simultaneously track past and current states. It runs on top of Neo4j, and each fact node manages its validity period with valid_at and invalid_at properties.

Multi-hop Reasoning: The ability to reason toward an answer that requires connecting multiple relationships — for example, "the project managed by the team of B, who is A's supervisor." Simple vector search doesn't handle these chained relationships well; a graph structure is needed for effective traversal.

Publishing the Graphiti engine as open source under Apache 2.0 led to rapid community adoption. It achieved 94.8% accuracy on the DMR Benchmark (which measures how accurately facts can be retrieved from conversations), and with SOC 2 Type 2, HIPAA, and GDPR certifications, it has a strong presence in enterprise environments. Neo4j setup may feel like a barrier at first, but using Graphiti on its own is actually much lighter to get started with than you'd expect.


Practical Application

Example 1: Adding Memory to a Customer Support Chatbot with Mem0

Adding Mem0 to an existing OpenAI chatbot is simpler than you'd think. I kept thinking "can it really be this easy?" — and yes, it can. One caveat though: if you don't explicitly handle the case where relevant_memories is empty (like a new user's first conversation), you end up with a blank line in the system prompt, which produces slightly awkward responses. The code below shows how to handle that too.

python
from mem0 import Memory
from openai import OpenAI
 
# If you don't have Qdrant, you can spin it up instantly with: docker run -p 6333:6333 qdrant/qdrant
config = {
    "vector_store": {
        "provider": "qdrant",
        "config": {
            "collection_name": "customer_support",
            "host": "localhost",
            "port": 6333,
        }
    },
    "llm": {
        "provider": "openai",
        "config": {"model": "gpt-4o", "temperature": 0}
    }
}
 
memory = Memory.from_config(config)
client = OpenAI()
 
def chat_with_memory(user_id: str, user_message: str) -> str:
    relevant_memories = memory.search(user_message, user_id=user_id)
 
    # Handle new user's first conversation — skip the memory section entirely if the list is empty
    if relevant_memories:
        memory_context = "\n".join([m["memory"] for m in relevant_memories])
        memory_section = f"\nWhat I know about this user:\n{memory_context}"
    else:
        memory_section = ""
 
    system_prompt = f"You are a friendly customer support agent.{memory_section}"
 
    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_message}
            ]
        )
        assistant_message = response.choices[0].message.content
 
        # Automatically extract and store important facts from the conversation
        # In production, it's recommended to add retry logic and failure alerts here
        memory.add(
            [
                {"role": "user", "content": user_message},
                {"role": "assistant", "content": assistant_message}
            ],
            user_id=user_id
        )
        return assistant_message
 
    except Exception as e:
        # Better to handle memory storage failures separately so they don't block the conversation itself
        raise
 
# Usage example
print(chat_with_memory("user_123", "I'm a premium plan user and I'm having a billing issue"))
# In the next session, the context "premium user who had a billing issue" is automatically injected
print(chat_with_memory("user_123", "Was the issue I mentioned earlier resolved?"))
Code Point Description
memory.search() Vector search scoped to the user ID, returning only relevant memories
memory_section branch Prevents an empty context block from entering the prompt on a new user's first conversation
memory.add() Pass a conversation pair and the LLM automatically extracts and stores important facts
user_id parameter Memory isolation per user — prevents memories from different users from mixing

Example 2: Setting Up a Long-Running Coding Agent with Letta

For a simple customer support chatbot, Mem0 is sufficient. You choose Letta when an agent needs to work autonomously over multiple days. The key is the autonomy: a coding agent that remembers the "authentication method change" it learned today and still recalls it a week later, deciding on its own which memory layer to store it in.

python
from letta import create_client
from letta.schemas.memory import ChatMemory
 
client = create_client()
 
# Set up initial information in Core Memory — the baseline always included in context
# The human and persona design is the foundation of agent behavior, so it's worth taking time to get this right initially
agent = client.create_agent(
    name="coding-assistant",
    memory=ChatMemory(
        human="Developer, primarily uses Python/TypeScript, working on a FastAPI project",
        persona="Experienced senior developer assistant. Helps with code review and debugging."
    ),
)
 
# The agent decides on its own whether to store in Core, Recall, or Archival
response = client.send_message(
    agent_id=agent.id,
    role="user",
    message="Remember that our project's API authentication changed from JWT to OAuth2"
)
 
print(response.messages[-1].text)
 
# Memory persists across new sessions days later — the same agent_id keeps state permanently on the server
response2 = client.send_message(
    agent_id=agent.id,
    role="user",
    message="What were the things to watch out for when writing auth-related code?"
)
print(response2.messages[-1].text)
# Responds with awareness that the switch to OAuth2 happened
Code Point Description
ChatMemory(human, persona) Core Memory initialization — baseline information always included in every context
agent_id reuse Loads agent state stored on the server — maintains continuity even after session restarts
Autonomous agent judgment Which layer to store in is decided by the agent through internal function calls — traceable in logs

Example 3: Managing Temporal Facts with Graphiti (Zep's Engine)

Graphiti shines in domains like finance or CRM where you need to clearly distinguish "previous state" from "current state." Mem0 and Letta overwrite or update with the latest information, but Graphiti invalidates old facts while preserving them. The code below must run inside an async function, so I've included the asyncio.run() wrapper.

python
import asyncio
from graphiti_core import Graphiti
from graphiti_core.nodes import EpisodeType
from datetime import datetime
 
async def main():
    # If you don't have Neo4j: docker run -p 7687:7687 -e NEO4J_AUTH=neo4j/password neo4j
    graphiti = Graphiti(
        neo4j_uri="bolt://localhost:7687",
        neo4j_user="neo4j",
        neo4j_password="password"
    )
 
    # Initialize indices and constraints — only needs to run once
    await graphiti.build_indices_and_constraints()
 
    # Add a fact — stored as an episode with temporal information
    await graphiti.add_episode(
        name="Customer address change",
        episode_body="Customer Kim Cheolsoo's address changed from Gangnam-gu, Seoul to Haeundae-gu, Busan",
        source=EpisodeType.text,
        reference_time=datetime(2025, 6, 1),  # The actual time the change occurred — separate from storage time
        source_description="CRM system update"
    )
 
    # Temporal query — accurately responds whether asking for current address or address at a specific point in time
    results = await graphiti.search("Kim Cheolsoo's address", num_results=5)
 
    for edge in results:
        print(f"Fact: {edge.fact}")
        print(f"Valid from: {edge.valid_at}")
        print(f"Valid until: {edge.invalid_at or 'still valid'}")
        print("---")
 
    # Sample output:
    # Fact: Kim Cheolsoo's address is Gangnam-gu, Seoul
    # Valid from: 2024-01-15
    # Valid until: 2025-06-01
    # ---
    # Fact: Kim Cheolsoo's address is Haeundae-gu, Busan
    # Valid from: 2025-06-01
    # Valid until: still valid
 
    await graphiti.close()
 
asyncio.run(main())
Code Point Description
async def main() + asyncio.run() The entire Graphiti API is async — you cannot call await directly at the top level
reference_time The time the fact actually occurred — separated from storage time for accurate temporal tracking
invalid_at The time the fact was invalidated — None means still currently valid
add_episode() Adds a new episode without deletion; the LLM evaluates conflicts with existing facts and automatically invalidates them

What to Choose and When

The matrix below is the fastest way to narrow down your choice. If even one row is "yes," explore that system first.

Situation Mem0 Letta Zep / Graphiti
Want to keep existing LangChain/CrewAI stack ✅ ❌ △
Need a fast prototype, want to plug it in today ✅ △ △
Agent works autonomously over multiple days △ ✅ △
Open source self-hosting is required △ ✅ ✅ (Graphiti)
Temporal state changes (address changes, contract renewals, etc.) are core ❌ △ ✅
HIPAA / SOC 2 / GDPR certification required △ ❌ ✅
Multi-hop relationship reasoning needed frequently △ △ ✅
Want fine-grained code-level control over memory behavior ❌ ✅ △

△ = Possible but limited or requires additional configuration


Pros and Cons

Strengths

System Core Strengths
Mem0 Connects to existing frameworks without modification; p95 search latency 200ms, 91% token reduction; supports 20+ vector store backends; active community (GitHub 48k+)
Letta Fully open source (MIT); achieves effectively infinite context through autonomous agent memory management; write-immediately-readable transactional consistency; memory behavior is traceable via logs
Zep Best-in-class temporal queries with temporal fact management; SOC 2 / HIPAA / GDPR certified; hybrid search (semantic + keyword + graph) achieves 94.8% on DMR Benchmark; Graphiti engine is an independent Apache 2.0 open source project

Weaknesses and Caveats

One trap that's easy to miss: Mem0's graph memory is exclusive to the cloud Pro plan ($249/month). Self-hosting gives you only vector search — relational multi-hop reasoning is unavailable. If multi-hop reasoning is a core requirement, directly integrating Graphiti from the start may be the better choice.

Letta has high memory function design complexity. The mistake our team made early on was ignoring Core Memory's capacity limits and trying to push everything into Core. Since Core is always loaded into context, it has size constraints, and if you don't nail that design upfront, the agent starts storing data in unexpected layers. For small-scale prototypes, it's recommended to start with Letta Cloud to get a feel for it, then migrate to self-hosting.

Honestly, Zep's token consumption can be higher than expected. Because the LLM evaluates fact conflicts every time, it's safer to run a cost test first for high-frequency conversation services. If cost is a concern, you can also use only the Graphiti engine independently instead of the full Zep platform.

System Weakness Mitigation
Mem0 Graph memory is cloud Pro only; self-hosting limited to pure vector search Consider direct Graphiti integration if multi-hop reasoning is required
Letta Self-hosting operational overhead; Core Memory capacity design required; initial learning curve Start with Letta Cloud for small prototypes, then migrate
Zep Token consumption can be high due to LLM-based fact evaluation; full platform is SaaS-centric For cost-sensitive cases, use only the Graphiti engine independently

The Most Common Mistakes in Practice

  1. Trying to store every piece of conversation content unconditionally — our team made this mistake early on. Memory systems involve LLM calls, which means additional latency and cost. Designing criteria upfront for "what information to store and why" (importance thresholds, information type filters) is what saves money later.

  2. Choosing based solely on benchmark scores — LOCOMO (conversational memory accuracy) or LongMemEval (comprehensive long-term memory reasoning, accepted at ICLR 2025) scores don't fully represent real-world service quality. Running your own evaluation with 10–20 scenarios tailored to your workload characteristics (proportion of temporal queries, multi-hop reasoning needs, session frequency) is important to do alongside benchmarks.

  3. Deciding between self-hosting and managed SaaS based on cost alone — data sovereignty, compliance requirements (HIPAA, GDPR), and your operations team's capabilities all need to be factored in. In particular, it's worth confirming upfront that Mem0's graph memory is cloud-only, since the feature gap with self-hosting is significant.


Closing Thoughts

Personally, if I were starting a new project, I'd probably prototype quickly with Mem0 first, then switch to Graphiti or Letta at the point where I answer "yes" to either "are temporal state changes important?" or "is the agent working autonomously over multiple days?" Mem0 for fast integration, Letta for autonomous agent control, Zep for enterprise environments where you need to track changes over time — the right choice depends on the nature of your service and your team's operational capabilities.

Three steps you can take right now:

  1. 30-minute prototyping with Mem0 — Run pip install mem0ai, then add just two lines — memory.add() and memory.search() — to your existing chatbot code to quickly get a feel for long-term memory. The official documentation has a well-organized Getting Started example.

  2. Explore Letta or Graphiti — If "does the agent work autonomously over multiple days?" then try Letta; if "are temporal state changes (address changes, contract renewals, etc.) core?" then spin up Graphiti via Docker and run the example code above. If both answers are "no," digging deeper into Mem0 may be the more efficient path.

  3. Build your own evaluation set — Beyond public benchmarks, write 10–20 real service scenarios yourself and measure retrieval accuracy and latency for each system. Letta also provides an open source evaluation framework, Letta Evals, for this purpose.


References

  • Mem0 Open Source Overview | docs.mem0.ai — Getting Started and official API reference
  • Mem0 GitHub | mem0ai/mem0 — Source code and issue tracker
  • Mem0 Paper | arXiv:2504.19413 — Original single-pass layered extraction algorithm paper
  • State of AI Agent Memory 2026 | Mem0 Blog — Benchmark numbers and market overview
  • Letta Official Site | letta.com — Letta Cloud and Evals framework
  • Letta GitHub | letta-ai/letta — Full MIT open source codebase
  • Letta Memory Management Docs | docs.letta.com — In-depth explanation of Core/Recall/Archival layers
  • Agent Memory: How to Build Agents that Learn and Remember | Letta Blog — Agent memory design architecture
  • Zep Official Site | getzep.com — Enterprise plans and certification information
  • Zep Paper | arXiv:2501.13956 — Original Temporal Knowledge Graph architecture paper
  • Graphiti GitHub | getzep/graphiti — Apache 2.0 standalone open source engine
  • Graphiti: Knowledge Graph Memory for an Agentic World | Neo4j Blog — Deep dive on Neo4j integration architecture
  • Mem0 + AWS Reference Architecture | AWS Blog — Fully managed setup with ElastiCache + Neptune Analytics
  • Mem0 vs Zep vs Letta Comparison | HydraDB — Feature comparison table for all three systems
  • Agent Memory at Scale 2026 Comparison | AgentMarketCap — Market vendor landscape and comprehensive benchmarks
  • Zep vs Mem0: Benchmarks and Pricing | Atlan — Detailed performance and cost comparison
#AI에이전트#장기메모리#Mem0#Letta#Zep#Graphiti#KnowledgeGraph#LLM#벡터검색#멀티홉추론
Share

Table of Contents

Core ConceptsWhy External Memory Is NeededMem0 — A Memory Layer You Bolt Onto Your Existing StackLetta — An Agent Platform That Manages Memory Like an OSZep — A Knowledge Graph That Remembers How Things Change Over TimePractical ApplicationExample 1: Adding Memory to a Customer Support Chatbot with Mem0Example 2: Setting Up a Long-Running Coding Agent with LettaExample 3: Managing Temporal Facts with Graphiti (Zep's Engine)What to Choose and WhenPros and ConsStrengthsWeaknesses and CaveatsThe Most Common Mistakes in PracticeClosing ThoughtsReferences

Recommended Posts

Building a Multimodal RAG Pipeline: Making LLMs Understand Images and Tables
AI

Building a Multimodal RAG Pipeline: Making LLMs Understand Images and Tables

When I first introduced RAG, I had a similar experience. I parsed a few hundred PDFs, loaded them into a vector DB, and ran some searches — it retrieved text-he...

May 30, 202620 min read
Building LLM Tracing with OpenTelemetry: Tracking RAG and Multi-Agent Flows with the gen_ai Standard
AI

Building LLM Tracing with OpenTelemetry: Tracking RAG and Multi-Agent Flows with the gen_ai Standard

A service connected to GPT-4 suddenly starts giving nonsensical answers. You dig through the logs and find no errors. HTTP response codes are all 200. But users...

May 30, 202625 min read
Pydantic AI: Implementing Type-Safe LLM Tool Calls in Python AI Agents
AI

Pydantic AI: Implementing Type-Safe LLM Tool Calls in Python AI Agents

Catching runtime errors at write time with RunContext · output_type · dependency injection When you layer LLM-powered features onto Python code, a nagging an...

May 30, 202624 min read
LangGraph Supervisor Pattern: How to Stay in Control in a Multi-Agent System
AI

LangGraph Supervisor Pattern: How to Stay in Control in a Multi-Agent System

The most common mistake when first designing a multi-agent system is connecting agents loosely under the vague expectation that "they'll figure out how to collaborate." I thought the same thing at first, and the result was always the same: you can't tell where the control flow is, you can't trace where it failed, and debugging inevitably leads you to redesign everything from scratch.

May 30, 202622 min read
Why 88% of AI Agents Fail in Production: The 5-Layer Harness Architecture Is the Answer
AI

Why 88% of AI Agents Fail in Production: The 5-Layer Harness Architecture Is the Answer

When GPT-4 first came out, I—along with most developers around me—shared the same misconception: "Isn't a good model all you need?" We'd slap a few prompt lines...

May 29, 202628 min read
FP4 Quantization + Blackwell GPU: Conditions for 4× Throughput over H100 and When Not to Use It
AI

FP4 Quantization + Blackwell GPU: Conditions for 4× Throughput over H100 and When Not to Use It

llm-compressorscheme="NVFP4"ignore=["lm_head"]num_calibration_samplespip install llmcompressornvfp4_experts_onlynvfp4_experts_onlytorch.cuda.get_device_capabili...

May 29, 202622 min read