llamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexedllamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexed

Blog

Benchmark findings, methodology deep-dives, and analysis of how AI memory systems actually perform.

6 min read
benchmarkresultsLongMemEval

A Raw LLM Beats Most Memory Systems on LongMemEval

Our first benchmark results are in across 6 systems tested. A plain GPT-4o-mini with no memory layer scores 57.6% — higher than LangChain (34.0%) and Mem0 OSS (32.4%). Only LlamaIndex (59.0%) beats the baseline. AutoGPT Memory lands at 47.4%.

Read more
8 min read
benchmarkresultscomparison

Six Memory Systems, One Benchmark: What We Learned

We ran LlamaIndex, LangChain, AutoGPT, Mem0, Cognee, and Graphiti through 500 questions. Three tiers emerged — and most systems can't beat a plain LLM.

Read more
7 min read
benchmarkreliabilityadversarial

We Built 25 Trap Questions to Test If AI Memory Systems Hallucinate

Our new Reliability benchmark plants adversarial traps: hallucination questions, changed facts, similar entities, deletion requests. The LLM baseline scores 0% on hallucination.

Read more
9 min read
protocolmethodologyfairness

Bench'd Evaluation Protocol v0.1: How We Make Memory Benchmarks Fair

Why we wrote a formal protocol, the adapter contract, model locking, trust tiers, the BMI formula, and versioning rules. Never rewrite history.

Read more
5 min read
securitypoisoningadversarial

Zero Memory Systems Resist Injection Attacks — Except One

We built 5 adversarial injection tests. Every system fell for them except Letta, which blocked 1 out of 5. Here's what that means for production agents.

Read more
6 min read
tracksknowledge-brainmethodology

We Stopped Comparing Filing Cabinets to Chatbots

gbrain scores 100% when tested on what it's built for. Here's why we created separate tracks for Knowledge Brains vs Conversational Memory.

Read more

Command Palette

Search for a command to run...