llamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexedllamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexed
Methodology
Metric v1.020 questions

Knowledge Retrieval

Measures whether a memory system can store and accurately retrieve factual information from conversational history.

What it measures

Core retrieval accuracy: given a conversation history containing specific facts, can the system find and return the correct answer when queried?

How it works

  1. Ingest a conversation history containing 5-10 turns with embedded facts (names, dates, preferences, events).
  2. Query the system with questions that require retrieving specific facts from the ingested history.
  3. Score each response using exact match with containment fallback (normalized, case-insensitive).
  4. Report the percentage of questions answered correctly.

Scoring method

Deterministic (exact match + containment). No LLM judge required.

Dimensions tested: recall

Purpose alignment

How this metric relates to each track (v1.0):

TrackAlignment
conversationaladjacent
knowledge-braincore
graphcore
agent-memorycore
baselinecore

Expected failure modes

  • RETRIEVAL_MISS — system returns no relevant context
  • WRONG_ENTITY — confuses entities mentioned in the same conversation
  • PARTIAL_ANSWER — returns incomplete information
  • HALLUCINATION — generates an answer not grounded in stored memories

See the full failure taxonomy for all 20+ reason codes.

Dataset source

Bench'd internal dataset, hand-crafted conversational scenarios.

Known limitations

  • Tests single-conversation retrieval only; does not test cross-conversation recall.
  • 20 questions may not capture long-tail failure modes.
  • Exact match scoring may miss semantically correct but differently worded answers.

Stable URL: benchd.ai/methodology/metrics/knowledge-retrieval
This URL is referenced in signed manifests. It will not change.

Command Palette

Search for a command to run...