Metric v1.025 questions

Budget Curves

Measures accuracy at different token budget tiers to understand the cost-accuracy tradeoff.

What it measures

Efficiency: how does accuracy change when the system is constrained to fewer tokens for retrieval?

Run the same question set at 5 token tiers (e.g., 100, 500, 1000, 2000, 5000 tokens).
At each tier, measure retrieval accuracy.
Plot the accuracy-vs-tokens curve.
Report the area under the curve and the knee point (where more tokens stop helping).

Deterministic at each tier. Curve analysis is computed post-hoc.

Dimensions tested: recall

How this metric relates to each track (v1.0):

See the full failure taxonomy for all 20+ reason codes.

Bench'd internal dataset, same questions as Knowledge Retrieval but with token constraints.

Not all systems support token budget constraints; those that don't get a flat curve.
Token counting is approximate.

Stable URL: benchd.ai/methodology/metrics/budget-curves
This URL is referenced in signed manifests. It will not change.