Failure Taxonomy
Every failed benchmark question is classified with a standardized reason code. These codes appear in signed manifests and failure traces, enabling systematic analysis of why systems fail, not just that they fail.
Retrieval
RETRIEVAL_MISSSystem returned no relevant context for the query.
Example: Query asks about a meeting date; system returns unrelated content.
RETRIEVAL_IRRELEVANTSystem returned context, but none of it is relevant to the question.
Example: Query asks about dietary preferences; system returns work schedule.
OVER_RETRIEVALSystem returned too much context, diluting the relevant information.
Example: System dumps entire conversation history instead of targeted recall.
Temporal
STALE_MEMORYSystem returned outdated information that was later corrected.
Example: User updated their address; system returns the old one.
TEMPORAL_CONFUSIONSystem confused the temporal ordering of events.
Example: Events A then B occurred; system says B happened before A.
TEMPORAL_MISSINGSystem cannot recall when something happened.
Example: Query asks 'when did X happen?'; system has no temporal context.
Entity
WRONG_ENTITYSystem confused two different entities mentioned in the conversation.
Example: Alice likes coffee, Bob likes tea; system says Alice likes tea.
WRONG_FACTSystem returned a fact that contradicts stored information.
Example: Stored: budget is $5000; returned: budget is $3000.
PARTIAL_ANSWERSystem returned some but not all of the expected information.
Example: Expected 3 items; system returned 2.
Hallucination
HALLUCINATIONSystem generated information not grounded in any stored memory.
Example: System invents a meeting that never occurred.
UNSUPPORTED_CLAIMSystem made a claim that cannot be traced to stored context.
Example: System says 'as we discussed last week' when no such discussion exists.
Memory Mgmt
DELETION_FAILURESystem did not honor an explicit delete or forget request.
Example: User says 'forget my phone number'; system still returns it.
MISSING_PROVENANCESystem cannot identify the source of a stored fact.
Example: System returns a fact but cannot say when or how it was learned.
CROSS_CONTAMINATIONInformation from one context leaked into another.
Example: User A's data appears in User B's memory space.
Multi-Agent
CROSS_AGENT_LEAKInformation leaked between agents that should have isolated memory.
Example: Agent 1's private context appears in Agent 2's responses.
HANDOFF_LOSSInformation was lost during agent-to-agent handoff.
Example: User provides info to Agent 1; Agent 2 has no knowledge of it.
CONFLICT_UNRESOLVEDConflicting information from multiple sources was not resolved.
Example: Two agents provide different answers; system doesn't choose.
System
EMPTY_RECALLSystem returned an empty response.
Example: Recall returns '' or null.
RECALL_ERRORSystem raised an error during recall.
Example: ConnectionError, TimeoutError, or internal exception.
TIMEOUTSystem did not respond within the allowed time.
Example: Recall took >30s and was killed.
FORMAT_MISMATCHSystem response was in an unexpected format.
Example: Expected text; got JSON or binary data.
Judge
JUDGE_DISAGREEMENTThe LLM judge scored differently than the deterministic scorer.
Example: Deterministic says FAIL; LLM judge says correct (or vice versa).
ABSTENTION_WRONGSystem abstained but the answer was retrievable.
Example: System says 'I don't know' when the answer is in its memory.
Versioning
The failure taxonomy is versioned alongside the harness. When new failure codes are added, the taxonomy version is bumped and all new manifests reference the updated version. Existing manifests retain their original classification.
Current version: 1.0 | 23 codes | 8 categories
Stable URL: benchd.ai/methodology/failure-taxonomy