Metric v1.01540 questions

LoCoMo

Large-scale long-conversation memory benchmark with multi-hop reasoning, temporal ordering, and open-domain questions.

What it measures

Comprehensive conversational memory across diverse question types: single-hop retrieval, multi-hop reasoning, temporal ordering, and open-ended summarization.

How it works

Ingest long conversation transcripts (thousands of turns).
Query with questions requiring single-hop retrieval, multi-hop chaining, temporal reasoning, or summarization.
Score using mixed methods: deterministic for factual, LLM judge for open-ended.
Report per-type and overall accuracy.

Scoring method

Mixed: exact/regex for factual, LLM judge for open-ended and multi-hop.

Dimensions tested: recall, temporal, reasoning

Purpose alignment

How this metric relates to each track (v1.0):

Track	Alignment
conversational	core
knowledge-brain	orthogonal
graph	orthogonal
agent-memory	orthogonal
baseline	core

Expected failure modes

RETRIEVAL_MISS — cannot find relevant conversation segments
TEMPORAL_CONFUSION — fails multi-hop temporal chains
PARTIAL_ANSWER — gets some hops right but not all
HALLUCINATION — fabricates connections between unrelated conversations

See the full failure taxonomy for all 20+ reason codes.

Dataset source

LoCoMo academic benchmark (Maharana et al., 2024).

Known limitations

1,540 questions is the largest benchmark; expensive to run.
Some questions may have ambiguous ground truth.

Stable URL: benchd.ai/methodology/metrics/locomo
This URL is referenced in signed manifests. It will not change.