Blog

Benchmark findings, methodology deep-dives, and analysis of how AI memory systems actually perform.

2026-05-116 min read

A Raw LLM Beats Most Memory Systems on LongMemEval

Our first benchmark results are in across 6 systems tested. A plain GPT-4o-mini with no memory layer scores 57.6% — higher than LangChain (34.0%) and Mem0 OSS (32.4%). Only LlamaIndex (59.0%) beats the baseline. AutoGPT Memory lands at 47.4%.

2026-05-128 min read

benchmarkresultscomparison

Six Memory Systems, One Benchmark: What We Learned

We ran LlamaIndex, LangChain, AutoGPT, Mem0, Cognee, and Graphiti through 500 questions. Three tiers emerged — and most systems can't beat a plain LLM.

2026-05-137 min read

benchmarkreliabilityadversarial

We Built 25 Trap Questions to Test If AI Memory Systems Hallucinate

Our new Reliability benchmark plants adversarial traps: hallucination questions, changed facts, similar entities, deletion requests. The LLM baseline scores 0% on hallucination.

2026-05-149 min read

protocolmethodologyfairness

Bench'd Evaluation Protocol v0.1: How We Make Memory Benchmarks Fair

Why we wrote a formal protocol, the adapter contract, model locking, trust tiers, the BMI formula, and versioning rules. Never rewrite history.

2026-05-165 min read

securitypoisoningadversarial

Zero Memory Systems Resist Injection Attacks — Except One

We built 5 adversarial injection tests. Every system fell for them except Letta, which blocked 1 out of 5. Here's what that means for production agents.

2026-05-166 min read

tracksknowledge-brainmethodology

We Stopped Comparing Filing Cabinets to Chatbots

gbrain scores 100% when tested on what it's built for. Here's why we created separate tracks for Knowledge Brains vs Conversational Memory.