We Stopped Comparing Filing Cabinets to Chatbots
gbrain scored 0% on LongMemEval — our conversational memory benchmark. That's like testing a filing cabinet on a pop quiz. It doesn't mean the filing cabinet is broken. It means we were measuring the wrong thing. So we built separate tracks.
The Problem: One Benchmark Can't Rule Them All
LongMemEval tests conversational memory — multi-session chat history, temporal reasoning, user preference tracking across dialogue. It's the right test for systems like LlamaIndex, LangChain, Mem0, and Letta that integrate into chat agents.
But gbrain isn't a chat memory system. It's a knowledge brain — a structured store for documents, facts, and domain knowledge. Asking it “what did the user say in session 3?” is a category error. It was never designed to track conversation history.
Publishing gbrain at 0% on a conversational benchmark isn't honest evaluation. It's a misleading comparison that punishes a system for not being something it never claimed to be.
The Fix: Separate Tracks
Bench'd now runs two tracks:
Conversational Memory Track
Multi-session dialogue, temporal reasoning, preference updates, entity tracking across conversations. Benchmarks: LongMemEval, Reliability, Poisoning Resistance. For: LlamaIndex, LangChain, Mem0, Letta, AutoGPT, CrewAI.
Knowledge Retrieval Track
Document storage, semantic search, knowledge updates, multi-page reasoning. Purpose-built for knowledge brains that store and retrieve structured information. For: gbrain, Cognee, Graphiti, Quivr.
Knowledge Retrieval Benchmark: What It Tests
We built 20 test cases across four dimensions:
- Document Storage — Can the system ingest and faithfully store multi-page documents?
- Semantic Search — Can it find relevant content from natural language queries, not just keyword matches?
- Knowledge Updates — When a document is updated, does the system reflect the new version?
- Multi-Page Reasoning — Can it synthesize answers from information spread across multiple documents?
Knowledge Retrieval Benchmark v1.0 — 20 Test Cases
Why Tracks Make Us More Honest, Not Less Rigorous
Separating tracks doesn't lower the bar — it puts the bar in the right place. gbrain at 100% on Knowledge Retrieval and 0% on Conversational Memory tells a clear story: this system excels at document-based knowledge work but doesn't do chat memory. That's useful information for someone choosing a system.
A single leaderboard mixing both would either unfairly penalize knowledge brains or unfairly reward chat systems that can't handle documents. Neither serves the people trying to pick the right tool.
Cognee, Graphiti, and Quivr score 0% today — but their adapters are early. We expect these scores to change as implementations mature. The benchmark is ready. The systems need to catch up.
Explore the Methodology
Full details on how we designed the Knowledge Retrieval benchmark, scoring criteria, and track assignment rules are on our methodology page:
Stay in the loop
New benchmark results, methodology updates, and memory system rankings. No spam.
Unsubscribe anytime. We respect your inbox.