llamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexedllamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexed
All posts
6 min read

We Stopped Comparing Filing Cabinets to Chatbots

gbrain scored 0% on LongMemEval — our conversational memory benchmark. That's like testing a filing cabinet on a pop quiz. It doesn't mean the filing cabinet is broken. It means we were measuring the wrong thing. So we built separate tracks.

The Problem: One Benchmark Can't Rule Them All

LongMemEval tests conversational memory — multi-session chat history, temporal reasoning, user preference tracking across dialogue. It's the right test for systems like LlamaIndex, LangChain, Mem0, and Letta that integrate into chat agents.

But gbrain isn't a chat memory system. It's a knowledge brain — a structured store for documents, facts, and domain knowledge. Asking it “what did the user say in session 3?” is a category error. It was never designed to track conversation history.

Publishing gbrain at 0% on a conversational benchmark isn't honest evaluation. It's a misleading comparison that punishes a system for not being something it never claimed to be.

The Fix: Separate Tracks

Bench'd now runs two tracks:

Conversational Memory Track

Multi-session dialogue, temporal reasoning, preference updates, entity tracking across conversations. Benchmarks: LongMemEval, Reliability, Poisoning Resistance. For: LlamaIndex, LangChain, Mem0, Letta, AutoGPT, CrewAI.

Knowledge Retrieval Track

Document storage, semantic search, knowledge updates, multi-page reasoning. Purpose-built for knowledge brains that store and retrieve structured information. For: gbrain, Cognee, Graphiti, Quivr.

Knowledge Retrieval Benchmark: What It Tests

We built 20 test cases across four dimensions:

  • Document Storage — Can the system ingest and faithfully store multi-page documents?
  • Semantic Search — Can it find relevant content from natural language queries, not just keyword matches?
  • Knowledge Updates — When a document is updated, does the system reflect the new version?
  • Multi-Page Reasoning — Can it synthesize answers from information spread across multiple documents?

Knowledge Retrieval Benchmark v1.0 — 20 Test Cases

gbrain100%
100%
LLM Baseline
95%
LlamaIndex
95%
Letta
80%
Cognee
0%
Graphiti
0%
Quivr
0%

Why Tracks Make Us More Honest, Not Less Rigorous

Separating tracks doesn't lower the bar — it puts the bar in the right place. gbrain at 100% on Knowledge Retrieval and 0% on Conversational Memory tells a clear story: this system excels at document-based knowledge work but doesn't do chat memory. That's useful information for someone choosing a system.

A single leaderboard mixing both would either unfairly penalize knowledge brains or unfairly reward chat systems that can't handle documents. Neither serves the people trying to pick the right tool.

Cognee, Graphiti, and Quivr score 0% today — but their adapters are early. We expect these scores to change as implementations mature. The benchmark is ready. The systems need to catch up.

Explore the Methodology

Full details on how we designed the Knowledge Retrieval benchmark, scoring criteria, and track assignment rules are on our methodology page:

Stay in the loop

New benchmark results, methodology updates, and memory system rankings. No spam.

Unsubscribe anytime. We respect your inbox.

Command Palette

Search for a command to run...