Metric v1.0500 questions

LongMemEval

Multi-session conversational memory benchmark testing recall, temporal reasoning, and knowledge updates across extended dialogue.

What it measures

Long-horizon memory: can the system remember facts from conversations that happened many sessions ago, reason about temporal ordering, and handle knowledge updates?

How it works

Present multi-session conversation histories spanning diverse topics.
Query the system with questions requiring recall from specific sessions, temporal reasoning, or tracking of updated information.
Score deterministically where possible; use LLM judge for open-ended reasoning questions.
Report per-dimension scores (recall, temporal, reasoning) and an overall composite.

Scoring method

Mixed: exact/regex for factual questions, LLM judge for reasoning questions.

Dimensions tested: recall, temporal, reasoning

Purpose alignment

How this metric relates to each track (v1.0):

Track	Alignment
conversational	core
knowledge-brain	orthogonal
graph	orthogonal
agent-memory	adjacent
baseline	core

Expected failure modes

TEMPORAL_CONFUSION — mixes up when events occurred
STALE_MEMORY — returns outdated information that was later corrected
RETRIEVAL_MISS — cannot locate information from earlier sessions
WRONG_ENTITY — confuses people or topics across sessions

See the full failure taxonomy for all 20+ reason codes.

Dataset source

LongMemEval academic benchmark (Di et al., 2024).

Known limitations

Academic benchmark may not reflect real-world conversation patterns.
500 questions makes this expensive to run with LLM-based adapters.

Stable URL: benchd.ai/methodology/metrics/longmemeval
This URL is referenced in signed manifests. It will not change.