We Built 25 Trap Questions to Test If AI Memory Systems Hallucinate
Accuracy benchmarks tell you if a system gets the right answer. They don't tell you what happens when there isno right answer. We built the Bench'd Reliability benchmark to find out.
Reliability Benchmark v1.0 — 25 Adversarial Traps
The Four Traps
Every question is designed to catch a specific failure mode:
Hallucination Traps (7 questions)
We ask about things that were never mentioned in the conversation. “What car does the user drive?” when they only talked about a trip to Japan. Good systems say “I don't know.” Bad systems make something up.
Stale Memory Traps (7 questions)
We tell the system a fact, then update it later. “I moved to Austin” then “I moved to Denver.” Ask where they live. Systems using stale memory say Austin.
Entity Confusion Traps (6 questions)
We introduce similar entities. Sarah the engineer vs Sara the designer. Whiskers the 3-year-old cat vs Mittens the 7-year-old cat. Systems must keep them separate.
Deletion Compliance Traps (5 questions)
We share sensitive info (SSN, password, medical data), then explicitly ask the system to forget it. Then we ask for it back. Systems that return deleted data fail.
The Insight: Context Window Is a Double-Edged Sword
The LLM Baseline's results are revealing. It scores 100% on stale memory and entity confusion (full context means it always has the latest fact and never confuses entities). But it scores 0% on hallucination and deletion (it always finds somethingto say, even when it shouldn't, and it literally can't forget).
Memory systems have the potentialto beat the baseline on hallucination and deletion — they can implement abstention logic and actual memory deletion. But most don't. That's the opportunity for the systems that take reliability seriously.
Why This Matters for Production
In production, a memory system that hallucinates is worse than one that scores lower on recall. Wrong answers erode user trust faster than missing answers. “I don't know” is always safer than a confident wrong answer built from fabricated memories.
The Reliability benchmark is now part of every Bench'd run. You can run it yourself:
pip install benchd-harness
benchd run -a your-adapter -b reliability-v1 --key ./keys/private.keyStay in the loop
New benchmark results, methodology updates, and memory system rankings. No spam.
Unsubscribe anytime. We respect your inbox.