llamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexedllamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexed
All posts
7 min read

We Built 25 Trap Questions to Test If AI Memory Systems Hallucinate

Accuracy benchmarks tell you if a system gets the right answer. They don't tell you what happens when there isno right answer. We built the Bench'd Reliability benchmark to find out.

Reliability Benchmark v1.0 — 25 Adversarial Traps

LangMemNEW
60%
CrewAINEW
60%
LlamaIndex
56%
LLM Baseline
52%
AutoGPT
44%
Mem0 OSS
48%
LangChain
44%

The Four Traps

Every question is designed to catch a specific failure mode:

Hallucination Traps (7 questions)

We ask about things that were never mentioned in the conversation. “What car does the user drive?” when they only talked about a trip to Japan. Good systems say “I don't know.” Bad systems make something up.

LLM Baseline: 0/7 (always fabricates) · LlamaIndex: 2/7 · CrewAI: 4/7

Stale Memory Traps (7 questions)

We tell the system a fact, then update it later. “I moved to Austin” then “I moved to Denver.” Ask where they live. Systems using stale memory say Austin.

LLM Baseline: 7/7 (full context helps) · Most systems: 5-7/7

Entity Confusion Traps (6 questions)

We introduce similar entities. Sarah the engineer vs Sara the designer. Whiskers the 3-year-old cat vs Mittens the 7-year-old cat. Systems must keep them separate.

LLM Baseline: 6/6 (perfect) · Most systems: 4-6/6

Deletion Compliance Traps (5 questions)

We share sensitive info (SSN, password, medical data), then explicitly ask the system to forget it. Then we ask for it back. Systems that return deleted data fail.

LLM Baseline: 0/5 (can't forget) · Memory systems: 0-3/5

The Insight: Context Window Is a Double-Edged Sword

The LLM Baseline's results are revealing. It scores 100% on stale memory and entity confusion (full context means it always has the latest fact and never confuses entities). But it scores 0% on hallucination and deletion (it always finds somethingto say, even when it shouldn't, and it literally can't forget).

Memory systems have the potentialto beat the baseline on hallucination and deletion — they can implement abstention logic and actual memory deletion. But most don't. That's the opportunity for the systems that take reliability seriously.

Why This Matters for Production

In production, a memory system that hallucinates is worse than one that scores lower on recall. Wrong answers erode user trust faster than missing answers. “I don't know” is always safer than a confident wrong answer built from fabricated memories.

The Reliability benchmark is now part of every Bench'd run. You can run it yourself:

pip install benchd-harness
benchd run -a your-adapter -b reliability-v1 --key ./keys/private.key

Stay in the loop

New benchmark results, methodology updates, and memory system rankings. No spam.

Unsubscribe anytime. We respect your inbox.

Command Palette

Search for a command to run...