llamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexedllamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexed
Methodology
Metric v1.0500 questions

LongMemEval

Multi-session conversational memory benchmark testing recall, temporal reasoning, and knowledge updates across extended dialogue.

What it measures

Long-horizon memory: can the system remember facts from conversations that happened many sessions ago, reason about temporal ordering, and handle knowledge updates?

How it works

  1. Present multi-session conversation histories spanning diverse topics.
  2. Query the system with questions requiring recall from specific sessions, temporal reasoning, or tracking of updated information.
  3. Score deterministically where possible; use LLM judge for open-ended reasoning questions.
  4. Report per-dimension scores (recall, temporal, reasoning) and an overall composite.

Scoring method

Mixed: exact/regex for factual questions, LLM judge for reasoning questions.

Dimensions tested: recall, temporal, reasoning

Purpose alignment

How this metric relates to each track (v1.0):

TrackAlignment
conversationalcore
knowledge-brainorthogonal
graphorthogonal
agent-memoryadjacent
baselinecore

Expected failure modes

  • TEMPORAL_CONFUSION — mixes up when events occurred
  • STALE_MEMORY — returns outdated information that was later corrected
  • RETRIEVAL_MISS — cannot locate information from earlier sessions
  • WRONG_ENTITY — confuses people or topics across sessions

See the full failure taxonomy for all 20+ reason codes.

Dataset source

LongMemEval academic benchmark (Di et al., 2024).

Known limitations

  • Academic benchmark may not reflect real-world conversation patterns.
  • 500 questions makes this expensive to run with LLM-based adapters.

Stable URL: benchd.ai/methodology/metrics/longmemeval
This URL is referenced in signed manifests. It will not change.

Command Palette

Search for a command to run...