Methodology
Metric v1.0500 questions
LongMemEval
Multi-session conversational memory benchmark testing recall, temporal reasoning, and knowledge updates across extended dialogue.
What it measures
Long-horizon memory: can the system remember facts from conversations that happened many sessions ago, reason about temporal ordering, and handle knowledge updates?
How it works
- Present multi-session conversation histories spanning diverse topics.
- Query the system with questions requiring recall from specific sessions, temporal reasoning, or tracking of updated information.
- Score deterministically where possible; use LLM judge for open-ended reasoning questions.
- Report per-dimension scores (recall, temporal, reasoning) and an overall composite.
Scoring method
Mixed: exact/regex for factual questions, LLM judge for reasoning questions.
Dimensions tested: recall, temporal, reasoning
Purpose alignment
How this metric relates to each track (v1.0):
| Track | Alignment |
|---|---|
| conversational | core |
| knowledge-brain | orthogonal |
| graph | orthogonal |
| agent-memory | adjacent |
| baseline | core |
Expected failure modes
- TEMPORAL_CONFUSION — mixes up when events occurred
- STALE_MEMORY — returns outdated information that was later corrected
- RETRIEVAL_MISS — cannot locate information from earlier sessions
- WRONG_ENTITY — confuses people or topics across sessions
See the full failure taxonomy for all 20+ reason codes.
Dataset source
LongMemEval academic benchmark (Di et al., 2024).
Known limitations
- Academic benchmark may not reflect real-world conversation patterns.
- 500 questions makes this expensive to run with LLM-based adapters.
Stable URL: benchd.ai/methodology/metrics/longmemeval
This URL is referenced in signed manifests. It will not change.