benchmark results

Engram sets a new
state of the art.

Evaluated on LOCOMO (Long-term Conversational Memory), the standard benchmark for agent memory systems. 10 conversations, ~418 turns each, 1,540 questions across 4 categories. Engram achieves a 19.6% relative improvement over Mem0.

Same benchmark used by Mem0 to claim state-of-the-art.

Overall Accuracy

LLM-as-a-Judge score on LOCOMO benchmark

Engram80%
Mem066.9%
MEMORY.md28.8%

Each system evaluated using its preferred/published LLM. MEMORY.md baseline uses a manually maintained memory file.

Key insight

"Recall-based beats extraction-based."

Engram invests intelligence at read time, when you know the query, not write time when you don't know what'll matter. This is the fundamental architectural difference.

How we stack up

Engram vs Mem0 vs Zep vs Letta

Comparing the leading AI agent memory solutions. Engram is a persistent memory MCP server built for Claude Code, Cursor, and any AI coding agent.

LOCOMO scores from arXiv:2402.17753. DMR scores from arXiv:2310.08560 (MemGPT/MSC-Self-Instruct, 500 conversations). Engram tested with Gemini 2.5 Flash; Mem0 scores from their published results with OpenAI. Zep LOCOMO corrected per Mem0 replication. Zep and MemGPT DMR scores use GPT-4 Turbo.

FeatureEngramYOU ARE HEREMem0Zep / GraphitiLetta / MemGPT
LOCOMO Benchmark80.0%66.9%~58.4% (corrected by Mem0)Not published
DMR Benchmark92.0%Not published94.8%93.4%
Token Efficiency776 tokens/query~2,000+ tokens/queryHigh (graph traversal)Variable (agent-managed)
MCP SupportNative MCP serverCommunity MCP wrapperNo native MCPNo native MCP
Setupnpm install -g engram-sdk && engram initpip install mem0ai + Qdrant setupDocker Compose + Neo4j + configpip install letta + server setup
LanguageTypeScript / Node.jsPythonPythonPython
ArchitectureSQLite + sqlite-vec, single binaryPython + Qdrant / PostgreSQLPython + Neo4j + PostgreSQLPython agent framework + PostgreSQL
ConsolidationAutomatic LLM consolidation with spreading activationBasic deduplicationGraph-based temporal reasoningAgent-managed memory editing
Temporal MemoryBi-temporal (valid_from / valid_until)NoFull bi-temporal graphNo
Knowledge GraphKnowledge graph with entity extractionLimited entity extractionNeo4j knowledge graph (core feature)No
Self-HostedYes, zero dependenciesYes, requires QdrantRequires Neo4j + DockerYes, requires PostgreSQL
LLM SupportGemini, OpenAI, Groq, Ollama, any OpenAI-compatibleOpenAI, Anthropic, others via LiteLLMOpenAI primarilyOpenAI, Anthropic, others
PricingFree (personal use) + hosted plans from $29/moOpen source + hosted APIOpen source + Zep CloudOpen source + Letta Cloud

Why developers choose Engram over Mem0, Zep, and Letta

Most AI agent memory solutions require heavy infrastructure. Mem0 needs a Qdrant vector database. Zep and Graphiti require Neo4j and Docker. Letta (formerly MemGPT) needs PostgreSQL and a separate server process. Engram runs as a single binary with SQLite, installs via npm, and works as a native MCP server for Claude Code and Cursor.

On the LOCOMO benchmark (1,540 questions across 10 conversations), Engram achieves 80.0% accuracy while using 96.6% fewer tokens than full-context approaches. The key architectural insight: invest intelligence at read time (when the query is known), not write time (when you don't know what will matter).

Engram supports any OpenAI-compatible LLM provider, including Gemini, Groq, Cerebras, Ollama, and Together AI, via a single environment variable. No vendor lock-in, no API key requirements beyond your existing LLM provider.

Token efficiency

93.6% fewer tokens than full context

Engram

1,504

tokens per query

Full Context

23,423

tokens per query

Token reduction93.6%

Better accuracy with 15x fewer tokens than stuffing the full conversation into context.

Methodology

How we tested

BenchmarkLOCOMO (arXiv:2402.17753)
Conversations evaluated10 of 10
Questions1,540
ScoringLLM-as-a-Judge
Engram LLMGemini 2.0 Flash
Mem0 LLMGPT-4o-mini (published)
Mem0 sourcearXiv:2504.19413

Mem0 scores are from their published paper (10 conversations, 10 runs averaged). Engram evaluated on all 10 LOCOMO conversations (1,540 questions). Same LLM-as-a-Judge methodology as Mem0's paper.