A Raw LLM Beats Most Memory Systems on LongMemEval
We ran four systems through the full 500-question LongMemEval benchmark under identical conditions. The result that surprised us: a plain GPT-4o-mini with no memory layer outperformed most dedicated memory systems.
LongMemEval v1.0 — Nuance Scores (LLM-Judged)
All runs used GPT-4o-mini via OpenRouter. Full methodology at /methodology.
What We Tested
LongMemEval is a 500-question benchmark designed to test how well systems remember information across long conversations. It covers three dimensions:
- •Recall — Can you retrieve specific facts from past conversations?
- •Temporal reasoning — Can you understand when things happened and their order?
- •Knowledge update — When facts change, do you track the latest version?
Each system received the exact same conversation histories and questions. The “LLM Baseline” system is a plain GPT-4o-mini that receives the full conversation history in its context window with no memory layer, no vector store, and no summarization.
The Surprising Result
The LLM baseline scored 57.6%— a score that most memory systems failed to beat. Only LlamaIndex's memory module (59.0%) managed to edge it out.
This tells us something important: most memory systems are destroying information faster than they're organizing it. When you summarize, compress, or selectively store conversation turns, you lose the raw signal that the LLM could have used to answer correctly.
The baseline wins on recall-heavy questions because it literally has the full conversation. Memory systems lose when their extraction or compression drops the specific detail being asked about.
Where Memory Systems Should Win
Memory systems have the potentialto beat the baseline on temporal reasoning and knowledge updates — these require understanding structure that raw context doesn't encode well. But in practice, most current implementations don't.
Every system scored poorly on temporal reasoning. The baseline scored near-zero because it has no temporal index. But memory systems with timestamped storage alsoscored near-zero — suggesting they store timestamps but don't use them during recall.
This is the opportunity. A memory system that actually indexes temporal relationships and change events should dramatically outperform the baseline on 40% of LongMemEval questions.
A Note on Mem0's Self-Reported 93.4%
Mem0's managed platform claims 93.4% on LongMemEval. Our test of the open-source edition scored 32.4%. These are different products— the managed platform has proprietary extraction, ranking, and retrieval that the OSS library doesn't include.
We haven't verified the managed platform's score yet. That's on the roadmap. When we do, it will be an independent, signed run — not a self-report. Until then, the 93.4% claim is labeled “Self-Reported” on our leaderboard.
What's Next
We're expanding coverage. Next up: LoCoMo benchmark results (1,540 questions, different evaluation dimensions), Cognee with direct OpenAI embeddings, and a re-run of LangChain with our updated adapter that no longer crashes at question 380.
The harness is open source. If you maintain a memory system and want to verify your own scores, run the harness yourself or claim your profile for an official vendor-verified run.
| System | Recall | Temporal | Overall | Benchmark |
|---|---|---|---|---|
| LlamaIndex | 68.8 | 23.8 | 59.0 | LongMemEval (500q) |
| LLM Baseline | 72.5 | ~0 | 57.6 | LongMemEval (500q) |
| LangChain | 59.0 | ~0 | 59.0 | LongMemEval (500q) |
| Mem0 OSS | 40.2 | ~0 | 32.4 | LongMemEval (500q) |
Stay in the loop
New benchmark results, methodology updates, and memory system rankings. No spam.
Unsubscribe anytime. We respect your inbox.