llamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexedllamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexed
All posts
9 min read

Bench'd Evaluation Protocol v0.1: How We Make Memory Benchmarks Fair

Memory benchmarks are only useful if they're reproducible, comparable, and resistant to gaming. We wrote a formal evaluation protocol so every system is tested the same way. Here's what's in it and why each rule exists.

Why We Wrote a Formal Protocol

After publishing our first round of results, we saw two problems emerge immediately. First, vendors started asking “can we tune our system before you test it?” — effectively requesting the right to game the benchmark. Second, other evaluation projects were publishing numbers that couldn't be compared to ours because they used different models, different prompts, and different scoring rubrics.

A benchmark without a protocol is just vibes. If two teams test the same system and get different numbers, neither result is useful. The protocol exists to make Bench'd results deterministic, comparable, and auditable.

The Adapter Contract: reset / ingest / recall

Every memory system must implement exactly three operations through a thin adapter layer:

reset()

Wipe all stored memory. Called before each test run to ensure a clean slate. The system must return to a state indistinguishable from a fresh install.

ingest(conversations)

Feed the system a list of conversation transcripts. This is the “memory formation” phase. The system can index, summarize, embed, or graph the data however it likes — as long as it does so through its normal pipeline, not a test-specific shortcut.

recall(question) → answer

Given a question, retrieve relevant memories and produce an answer. The system must use its own retrieval pipeline. The only thing Bench'd controls is the answerer LLM (see Model Locking below).

This three-function contract keeps the benchmark surface small. We don't care how your system stores data internally — we only care what goes in and what comes out.

Model Locking: Same Answerer, Same Judge

The single biggest confounder in LLM benchmarks is the model itself. A system using GPT-4o will beat one using GPT-3.5-turbo regardless of its memory architecture. To eliminate this variable, we lock two models across all runs:

  • Answerer model— the LLM that generates the final answer from retrieved memories. Currently gpt-4o-mini-2024-07-18.
  • Judge model— the LLM that scores open-ended (nuance) answers. Currently gpt-4o-2024-08-06.

When we upgrade models, we re-run every system on the new model before publishing any results. No system ever gets an unfair advantage from a better answerer.

Trust Tiers: Who Ran the Test?

Not all results carry the same weight. Our trust tier system makes the provenance of every score visible:

TierWho runs itVerification
Partner-AuditedBench'd team + vendorFull audit trail, co-signed results
Vendor-VerifiedVendor, with Bench'd adapterBench'd reviews adapter + spot-checks
Community-VerifiedCommunity contributorAdapter reviewed, results reproducible
Unclaimed Self-ReportedUnknown / vendor claimNot verified — shown with warning
ListedNobody yetAwaiting adapter submission

Self-reported scores are hidden by default on the leaderboard. We show them only when a user explicitly opts in, and they render in red to signal lower confidence.

The BMI Formula

The Bench'd Memory Index (BMI) is the single overall score displayed on the leaderboard. It's a weighted composite of our dimension scores:

BMI = 0.35 * recall_verified
    + 0.25 * temporal_verified
    + 0.25 * reasoning_verified
    + 0.15 * reliability_verified

Recall gets the highest weight because it's the most fundamental capability — if a memory system can't retrieve facts, nothing else matters. Reliability gets a lower weight for now because the benchmark is newer, but we expect to increase it as the trap set matures.

The nuance score (LLM-judged open-ended quality) is tracked separately and not included in the BMI. It's shown in detailed view for users who want richer signal.

Versioning Rules: Never Rewrite History

Once a score is published under a protocol version, it is immutable. We never go back and change historical results. If we improve the benchmark methodology, we bump the protocol version and re-run everything:

  • Patch(v0.1 → v0.1.1) — bug fixes in scoring code, no question changes. Old scores remain valid.
  • Minor(v0.1 → v0.2) — new questions added, weights adjusted. All systems re-run before publishing.
  • Major(v0 → v1) — fundamental methodology change. Previous version scores archived, not deleted.

Every result on the leaderboard is tagged with its protocol version. You can always see which version produced which score.

Read the Full Protocol

The complete Bench'd Evaluation Protocol v0.1 is published on GitHub. It includes the exact prompt templates, scoring rubrics, question bank versioning rules, and adapter interface specification.

github.com/benchd-ai/protocol

If you're building a memory system and want to be listed on Bench'd, start by implementing the three-function adapter. We'll handle the rest.

Stay in the loop

New benchmark results, methodology updates, and memory system rankings. No spam.

Unsubscribe anytime. We respect your inbox.

Command Palette

Search for a command to run...