llamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexedllamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexed
Methodology
Failure Taxonomy v1.0

Failure Taxonomy

Every failed benchmark question is classified with a standardized reason code. These codes appear in signed manifests and failure traces, enabling systematic analysis of why systems fail, not just that they fail.

Retrieval

RETRIEVAL_MISS

System returned no relevant context for the query.

Example: Query asks about a meeting date; system returns unrelated content.

RETRIEVAL_IRRELEVANT

System returned context, but none of it is relevant to the question.

Example: Query asks about dietary preferences; system returns work schedule.

OVER_RETRIEVAL

System returned too much context, diluting the relevant information.

Example: System dumps entire conversation history instead of targeted recall.

Temporal

STALE_MEMORY

System returned outdated information that was later corrected.

Example: User updated their address; system returns the old one.

TEMPORAL_CONFUSION

System confused the temporal ordering of events.

Example: Events A then B occurred; system says B happened before A.

TEMPORAL_MISSING

System cannot recall when something happened.

Example: Query asks 'when did X happen?'; system has no temporal context.

Entity

WRONG_ENTITY

System confused two different entities mentioned in the conversation.

Example: Alice likes coffee, Bob likes tea; system says Alice likes tea.

WRONG_FACT

System returned a fact that contradicts stored information.

Example: Stored: budget is $5000; returned: budget is $3000.

PARTIAL_ANSWER

System returned some but not all of the expected information.

Example: Expected 3 items; system returned 2.

Hallucination

HALLUCINATION

System generated information not grounded in any stored memory.

Example: System invents a meeting that never occurred.

UNSUPPORTED_CLAIM

System made a claim that cannot be traced to stored context.

Example: System says 'as we discussed last week' when no such discussion exists.

Memory Mgmt

DELETION_FAILURE

System did not honor an explicit delete or forget request.

Example: User says 'forget my phone number'; system still returns it.

MISSING_PROVENANCE

System cannot identify the source of a stored fact.

Example: System returns a fact but cannot say when or how it was learned.

CROSS_CONTAMINATION

Information from one context leaked into another.

Example: User A's data appears in User B's memory space.

Multi-Agent

CROSS_AGENT_LEAK

Information leaked between agents that should have isolated memory.

Example: Agent 1's private context appears in Agent 2's responses.

HANDOFF_LOSS

Information was lost during agent-to-agent handoff.

Example: User provides info to Agent 1; Agent 2 has no knowledge of it.

CONFLICT_UNRESOLVED

Conflicting information from multiple sources was not resolved.

Example: Two agents provide different answers; system doesn't choose.

System

EMPTY_RECALL

System returned an empty response.

Example: Recall returns '' or null.

RECALL_ERROR

System raised an error during recall.

Example: ConnectionError, TimeoutError, or internal exception.

TIMEOUT

System did not respond within the allowed time.

Example: Recall took >30s and was killed.

FORMAT_MISMATCH

System response was in an unexpected format.

Example: Expected text; got JSON or binary data.

Judge

JUDGE_DISAGREEMENT

The LLM judge scored differently than the deterministic scorer.

Example: Deterministic says FAIL; LLM judge says correct (or vice versa).

ABSTENTION_WRONG

System abstained but the answer was retrievable.

Example: System says 'I don't know' when the answer is in its memory.

Versioning

The failure taxonomy is versioned alongside the harness. When new failure codes are added, the taxonomy version is bumped and all new manifests reference the updated version. Existing manifests retain their original classification.

Current version: 1.0 | 23 codes | 8 categories

Stable URL: benchd.ai/methodology/failure-taxonomy

Command Palette

Search for a command to run...