llamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexedllamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexed
Methodology
Metric v1.01540 questions

LoCoMo

Large-scale long-conversation memory benchmark with multi-hop reasoning, temporal ordering, and open-domain questions.

What it measures

Comprehensive conversational memory across diverse question types: single-hop retrieval, multi-hop reasoning, temporal ordering, and open-ended summarization.

How it works

  1. Ingest long conversation transcripts (thousands of turns).
  2. Query with questions requiring single-hop retrieval, multi-hop chaining, temporal reasoning, or summarization.
  3. Score using mixed methods: deterministic for factual, LLM judge for open-ended.
  4. Report per-type and overall accuracy.

Scoring method

Mixed: exact/regex for factual, LLM judge for open-ended and multi-hop.

Dimensions tested: recall, temporal, reasoning

Purpose alignment

How this metric relates to each track (v1.0):

TrackAlignment
conversationalcore
knowledge-brainorthogonal
graphorthogonal
agent-memoryorthogonal
baselinecore

Expected failure modes

  • RETRIEVAL_MISS — cannot find relevant conversation segments
  • TEMPORAL_CONFUSION — fails multi-hop temporal chains
  • PARTIAL_ANSWER — gets some hops right but not all
  • HALLUCINATION — fabricates connections between unrelated conversations

See the full failure taxonomy for all 20+ reason codes.

Dataset source

LoCoMo academic benchmark (Maharana et al., 2024).

Known limitations

  • 1,540 questions is the largest benchmark; expensive to run.
  • Some questions may have ambiguous ground truth.

Stable URL: benchd.ai/methodology/metrics/locomo
This URL is referenced in signed manifests. It will not change.

Command Palette

Search for a command to run...