llamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexedllamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexed
Methodology
Metric v1.025 questions

Reliability

Adversarial robustness benchmark testing stale memory handling, entity separation, hallucination resistance, and deletion compliance.

What it measures

Robustness under adversarial conditions: does the system handle edge cases that trip up real-world memory systems?

How it works

  1. Run 25 adversarial trap questions across 4 sub-dimensions:
  2. - Stale Memory Handling: does the system return outdated info after updates?
  3. - Entity Separation: does the system confuse similar entities?
  4. - Hallucination Resistance: does the system abstain when it has no relevant memory?
  5. - Deletion Compliance: does the system honor explicit forget/delete requests?
  6. Score using reliability trap method: response must contain expected behavioral indicators.

Scoring method

Deterministic (reliability trap). Keyword-based pass/fail for behavioral indicators.

Dimensions tested: recall, temporal

Purpose alignment

How this metric relates to each track (v1.0):

TrackAlignment
conversationalcore
knowledge-brainadjacent
graphadjacent
agent-memorycore
baselinecore

Expected failure modes

  • STALE_MEMORY — returns outdated information
  • WRONG_ENTITY — confuses similar entities
  • HALLUCINATION — generates response when memory is empty
  • DELETION_FAILURE — does not honor delete/forget requests

See the full failure taxonomy for all 20+ reason codes.

Dataset source

Bench'd adversarial dataset, hand-crafted robustness scenarios.

Known limitations

  • Sub-dimension scoring can produce low overall scores when a system doesn't support certain capabilities (e.g., no delete API).
  • The interpretation system accounts for this with the 'capability_limited' label.

Stable URL: benchd.ai/methodology/metrics/reliability
This URL is referenced in signed manifests. It will not change.

Command Palette

Search for a command to run...