Methodology
Metric v1.025 questions
Budget Curves
Measures accuracy at different token budget tiers to understand the cost-accuracy tradeoff.
What it measures
Efficiency: how does accuracy change when the system is constrained to fewer tokens for retrieval?
How it works
- Run the same question set at 5 token tiers (e.g., 100, 500, 1000, 2000, 5000 tokens).
- At each tier, measure retrieval accuracy.
- Plot the accuracy-vs-tokens curve.
- Report the area under the curve and the knee point (where more tokens stop helping).
Scoring method
Deterministic at each tier. Curve analysis is computed post-hoc.
Dimensions tested: recall
Purpose alignment
How this metric relates to each track (v1.0):
| Track | Alignment |
|---|---|
| conversational | adjacent |
| knowledge-brain | core |
| graph | core |
| agent-memory | core |
| baseline | core |
Expected failure modes
- OVER_RETRIEVAL — uses full budget but returns irrelevant context
- RETRIEVAL_MISS — fails at low budgets where precision matters
See the full failure taxonomy for all 20+ reason codes.
Dataset source
Bench'd internal dataset, same questions as Knowledge Retrieval but with token constraints.
Known limitations
- Not all systems support token budget constraints; those that don't get a flat curve.
- Token counting is approximate.
Stable URL: benchd.ai/methodology/metrics/budget-curves
This URL is referenced in signed manifests. It will not change.