Methodology
Metric v1.01540 questions
LoCoMo
Large-scale long-conversation memory benchmark with multi-hop reasoning, temporal ordering, and open-domain questions.
What it measures
Comprehensive conversational memory across diverse question types: single-hop retrieval, multi-hop reasoning, temporal ordering, and open-ended summarization.
How it works
- Ingest long conversation transcripts (thousands of turns).
- Query with questions requiring single-hop retrieval, multi-hop chaining, temporal reasoning, or summarization.
- Score using mixed methods: deterministic for factual, LLM judge for open-ended.
- Report per-type and overall accuracy.
Scoring method
Mixed: exact/regex for factual, LLM judge for open-ended and multi-hop.
Dimensions tested: recall, temporal, reasoning
Purpose alignment
How this metric relates to each track (v1.0):
| Track | Alignment |
|---|---|
| conversational | core |
| knowledge-brain | orthogonal |
| graph | orthogonal |
| agent-memory | orthogonal |
| baseline | core |
Expected failure modes
- RETRIEVAL_MISS — cannot find relevant conversation segments
- TEMPORAL_CONFUSION — fails multi-hop temporal chains
- PARTIAL_ANSWER — gets some hops right but not all
- HALLUCINATION — fabricates connections between unrelated conversations
See the full failure taxonomy for all 20+ reason codes.
Dataset source
LoCoMo academic benchmark (Maharana et al., 2024).
Known limitations
- 1,540 questions is the largest benchmark; expensive to run.
- Some questions may have ambiguous ground truth.
Stable URL: benchd.ai/methodology/metrics/locomo
This URL is referenced in signed manifests. It will not change.