Methodology
Metric v1.020 questions
Knowledge Retrieval
Measures whether a memory system can store and accurately retrieve factual information from conversational history.
What it measures
Core retrieval accuracy: given a conversation history containing specific facts, can the system find and return the correct answer when queried?
How it works
- Ingest a conversation history containing 5-10 turns with embedded facts (names, dates, preferences, events).
- Query the system with questions that require retrieving specific facts from the ingested history.
- Score each response using exact match with containment fallback (normalized, case-insensitive).
- Report the percentage of questions answered correctly.
Scoring method
Deterministic (exact match + containment). No LLM judge required.
Dimensions tested: recall
Purpose alignment
How this metric relates to each track (v1.0):
| Track | Alignment |
|---|---|
| conversational | adjacent |
| knowledge-brain | core |
| graph | core |
| agent-memory | core |
| baseline | core |
Expected failure modes
- RETRIEVAL_MISS — system returns no relevant context
- WRONG_ENTITY — confuses entities mentioned in the same conversation
- PARTIAL_ANSWER — returns incomplete information
- HALLUCINATION — generates an answer not grounded in stored memories
See the full failure taxonomy for all 20+ reason codes.
Dataset source
Bench'd internal dataset, hand-crafted conversational scenarios.
Known limitations
- Tests single-conversation retrieval only; does not test cross-conversation recall.
- 20 questions may not capture long-tail failure modes.
- Exact match scoring may miss semantically correct but differently worded answers.
Stable URL: benchd.ai/methodology/metrics/knowledge-retrieval
This URL is referenced in signed manifests. It will not change.