llamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexedllamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexed
All posts
8 min read

Six Memory Systems, One Benchmark: What We Learned

We ran every major open-source AI memory system through LongMemEval's 500-question gauntlet. The results reveal three tiers of performance — and a surprising baseline that most systems can't beat.

LongMemEval v1.0 — All Systems (500 Questions)

LlamaIndex
59.0%38K
LangChain
59.0%98K
LLM Baseline
57.6%-
AutoGPT
47.4%170K
Mem0 OSS
32.4%24.8K
Cognee
20.0%3.8K
Graphiti
0.0%4.2K
LLM Baseline (57.6%) — the bar to beat

Three Tiers Emerged

The results fall into three clear groups:

Tier 1: Above Baseline (59%)

LlamaIndex and LangChain both hit 59.0%. These frameworks add enough structure to conversation memory to slightly outperform raw context. The margin is thin — just 1.4% above baseline.

Tier 2: Below Baseline (32-47%)

AutoGPT (47.4%) and Mem0 OSS(32.4%). These systems are actively losing information compared to just using the raw LLM context window. Vector retrieval alone doesn't work for conversational memory.

Tier 3: Near Zero (0-20%)

Cognee (~20%) and Graphiti(0%). Knowledge graph systems built for document indexing don't map well to conversational memory recall. These systems may excel at different tasks, but LongMemEval isn't one of them.

Why GitHub Stars Don't Predict Performance

AutoGPT has 170K stars — 4x more than LlamaIndex. But it scored 10 points below the LLM baseline. Mem0 has 24.8K stars and an active community but scored 32.4%.

Community size correlates with usefulness, ecosystem maturity, and marketing — not memory quality. The only way to know if a memory system works is to benchmark it.

The Temporal Reasoning Gap

Every single system scored near zero on temporal reasoning questions. These ask things like “How many days between event X and event Y?” or “Which happened first?”

No system tested — including the LLM baseline — can answer these reliably. This represents the biggest opportunity in AI memory: a system that actually indexes temporal relationships would have a 40% advantage over everything else tested.

LoCoMo Results: Multi-Session Memory

We also ran LoCoMo (1,540 questions) on three systems:

SystemLoCoMo ScoreQuestions
LlamaIndex54.8%1,540
LangChain51.9%1,540
LLM Baseline50.4%1,540
Mem0 OSS0.0%1,540

Mem0 OSS scored 0% on every single LoCoMo question. This isn't a bug in our adapter — the open-source version simply doesn't handle multi-session memory at the scale LoCoMo requires.

What This Means for Builders

If you're building an AI agent that needs to remember past conversations:

  • 1.Start with LlamaIndex or LangChain — they're the only ones that beat the baseline
  • 2.Always compare against the LLM baseline — if your memory system scores below 57.6%, you'd be better off without it
  • 3.Don't trust star counts — AutoGPT has 170K stars but underperforms by 10 points
  • 4.Run your own benchmarkpip install benchd-harness and test your system in minutes

Stay in the loop

New benchmark results, methodology updates, and memory system rankings. No spam.

Unsubscribe anytime. We respect your inbox.

Command Palette

Search for a command to run...