Methodology — Bench'd

Why These Benchmarks

AI memory systems make bold claims. Vendors publish recall numbers measured on their own test sets, using their own definitions of success, without independent verification. Customers cannot compare systems because every vendor defines accuracy differently.

Bench'd exists to fix that. We run every system against the same question set, under the same conditions, and publish the raw receipts. The harness is open source. The scoring protocol is frozen between versions. The results are signed.

We measure three dimensions that matter in practice: recall (can the system retrieve the right information?), temporal reasoning (can it understand when things happened and how they changed?), and multi-hop reasoning (can it connect facts across separate conversations to answer complex questions?).

These are not synthetic benchmarks. Every question is derived from real conversational patterns observed in production memory workloads. If a system scores well here, it works well in practice. If it doesn't, it doesn't.

How We Score

Each benchmark run consists of a series of question-answer pairs. First, we ingest a set of conversations into the system under test. Then we query the system and compare its responses against known-correct answers.

Scoring happens at the individual question level. Each question is tagged with a scoringMethod that determines how correctness is evaluated:

exact

Response must match the expected answer character-for-character after normalization. Used for IDs, dates, and numeric values.

regex

Response is tested against a regular expression pattern. Allows flexible formatting while requiring specific content.

llm

A locked LLM judge evaluates semantic correctness. Required for open-ended synthesis and multi-hop questions.

Scores within each dimension are aggregated by taking the percentage of questions answered correctly. The overall score is a weighted average: 40% recall + 30% temporal + 30% reasoning. These weights reflect how users actually depend on memory systems in production.

The Two-Score Model

Most benchmarks report a single number. We report two. This is deliberate, and it is the most important design decision in Bench'd.

Some questions have objectively correct answers that can be verified by a machine: an exact string, a regex pattern, a specific ID. Other questions require judgment: did the system correctly synthesize information from three different conversations? Did it capture the nuance of a temporal relationship? These require an LLM judge.

Mixing these into a single score creates a false sense of precision. A score of 84.7 that blends deterministic and LLM-judged results implies a level of stability that doesn't exist. The deterministic portion is rock-solid. The LLM-judged portion may shift slightly between judge versions.

So we separate them. Always.

Verified (Deterministic)

87.3

Scoring & verification:Mathematical truth
Includes:Exact-match, regex, ID retrieval
Consistency:Cryptographically deterministic
FAQ:Pure math. Does not change.

Nuance (LLM Judge)

81.2

Scoring & verification:LLM judge panel
Includes:Open-ended, synthesis, multi-hop
Consistency:May shift with judge updates
FAQ:Contextual. May change.

Why two scores matter — see Judge Protocol

Judge Protocol

The judge protocol is designed around one principle: deterministic where possible, locked where not.

For questions that have a single correct answer (an ID, a date, a number), we use deterministic scoring. No LLM is involved. The harness compares the response against the expected answer using exact-match or regex. This is fast, free, and perfectly reproducible.

For questions that require semantic judgment, we use a locked LLM judge. The judge configuration is frozen for the duration of a benchmark version:

judge-protocol.jsonjson

// JudgeProtocol — frozen per benchmark version
{
  "model": "claude-sonnet-4-20250514",
  "temperature": 0.0,
  "promptVersion": "v2.4.1"
}

Frozen temperature

The judge always runs at temperature 0.0. This minimizes stochastic variation between runs. In practice, we observe less than 0.3% variation on repeated evaluations.

Locked model version

We pin to a specific model snapshot (e.g., claude-sonnet-4-20250514). When the model provider releases a new version, we do not silently switch. We create a new benchmark version.

Version-bumping policy

Any change to the judge model, prompt, or scoring logic triggers a new benchmark version. All systems are re-run on the new version to maintain comparability. Historical results under previous versions are preserved and clearly labeled.

Versioning Policy

Bench'd uses semantic versioning for benchmarks. The version number is embedded in every signed receipt, making it impossible to compare results from different versions without being explicit about it.

A patch version bump (e.g., 2.4.0 to 2.4.1) means bug fixes to the harness that do not affect scoring. A minor version bump means new questions were added or the judge prompt was revised. A major version bump means the scoring model changed, dimensions were added or removed, or weights were adjusted.

When a minor or major version bump occurs, we re-run all actively maintained systems against the new version within 72 hours. Leaderboard rankings always reflect the latest version. Historical runs are archived and remain verifiable.

CHANGELOGplaintext

// Version history (excerpt)
v2.4.1  2026-04-15  Prompt clarification for temporal boundary questions
v2.4.0  2026-03-01  Added 12 multi-hop reasoning questions
v2.3.0  2026-01-20  Judge model updated to claude-sonnet-4-20250514
v2.0.0  2025-11-01  Added temporal dimension, reweighted overall score
v1.0.0  2025-08-15  Initial release: recall + reasoning only

How Signing Works

Every completed benchmark run produces a signed receipt. The receipt contains the full run manifest — system identity, benchmark version, harness version, judge configuration, all scores, and timing data — hashed into a Merkle tree.

The signing process works as follows:

1.Each question-answer pair is hashed individually (SHA-256). These form the leaves of the Merkle tree.
2.The leaves are combined pairwise until a single root hash remains. This is the merkleRoot.
3.The manifest (including the Merkle root) is signed with Bench'd's Ed25519 signing key.
4.The signature and public key fingerprint are embedded in the receipt. Anyone can verify the signature using the published public key.

If any single answer in the run is modified — even by one character — the Merkle root changes and the signature becomes invalid. This makes tampering with individual results cryptographically detectable.

Verify a Receipt Yourself

You don't need to trust us. Every receipt can be independently verified. Download the signed receipt JSON from any run page, then verify it locally:

verify-receipt.shbash

# Download a receipt
curl -sL https://benchd.dev/api/receipt/run_abc123.json -o receipt.json

# Verify the signature (requires the benchd public key)
benchd verify receipt.json

# Or verify manually with openssl
cat receipt.json | jq -r '.manifest' | \
  openssl dgst -sha256 -verify benchd-public.pem -signature <(
    cat receipt.json | jq -r '.signature' | base64 -d
  )

The verification checks two things: that the Merkle root matches the individual question hashes (data integrity), and that the signature is valid for Bench'd's published public key (authenticity).

For the full verification walkthrough, including Merkle proof validation and key rotation history, see the Trust & Verification page.

Known Controversies

We document cases where Bench'd scores diverge significantly from vendor-reported metrics. These are not accusations — they are explanations of methodological differences that produce different numbers.

MemPalace: Metric Mismatch

High Impact

What was claimed

MemPalace published a blog post claiming "state-of-the-art Retrieval Recall of 98.6%" on their internal benchmark. This number was widely cited in social media and investor materials as evidence of superior accuracy.

What actually happened

MemPalace's 98.6% measures retrieval recall— whether the system retrieves the correct chunk from its vector store. Bench'd measures end-to-end QA accuracy— whether the system actually answers the question correctly given the retrieved context. These are fundamentally different metrics.

A system can retrieve the right chunk 98.6% of the time and still fail to answer correctly because it misinterprets the context, conflates entities, or truncates relevant detail. On Bench'd's end-to-end measure, MemPalace scored 72.4 verified.

Why it matters

Retrieval recall is a component metric, not an outcome metric. Users care about whether the system gives them the right answer, not whether it found the right paragraph internally. Reporting component metrics as if they were outcome metrics inflates perceived performance and misleads buyers. This is the most common pattern we see in vendor benchmarks across the industry.

Sources

MemPalace blog, "Setting the Standard for Memory Recall", March 2026
Bench'd run receipt: run_mp_20260402
Discussion thread on the Bench'd GitHub repository (#247)

Frequently Asked Questions

Can vendors run the benchmark themselves?

Yes. The harness is open source. Vendors can run it locally and submit results, but self-reported runs are tagged as unclaimed-self-reported and carry a lower trust tier. For results to appear as vendor-verified, the vendor must connect their production endpoint and allow Bench'd to run the harness directly against it.

How often are systems re-benchmarked?

Actively maintained systems are re-run whenever a new benchmark version is released (typically every 4-8 weeks) and whenever a vendor ships a significant update to their system. Vendors can request a re-run at any time by opening a pull request against the harness repository.

Why might the nuance score change between runs?

The nuance score uses an LLM judge, which is inherently non-deterministic. Even with temperature 0.0, model providers may update their infrastructure in ways that cause subtle output shifts. We mitigate this by pinning model versions and re-running all systems when the judge changes, but small variations (typically less than 0.5%) are expected. This is why we separate it from the verified score.

What happens if I find a bug in the benchmark?

Open an issue on GitHub. If the bug affects scoring, we will issue a patch version bump, re-run affected systems, and publish a postmortem. Every correction is documented in the version history and linked from the affected receipts.

Do you accept sponsorship from vendors?

No. Bench'd is funded independently. We do not accept payment from any vendor whose system appears on the leaderboard. Our funding sources are disclosed on the About page. If this ever changes, it will be announced publicly before any sponsored content appears.

Last updated: May 2026. This document is versioned alongside the benchmark. View revision history on GitHub.

How Bench'd works