llamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexedllamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexed
Methodology
Metric v1.025 questions

Budget Curves

Measures accuracy at different token budget tiers to understand the cost-accuracy tradeoff.

What it measures

Efficiency: how does accuracy change when the system is constrained to fewer tokens for retrieval?

How it works

  1. Run the same question set at 5 token tiers (e.g., 100, 500, 1000, 2000, 5000 tokens).
  2. At each tier, measure retrieval accuracy.
  3. Plot the accuracy-vs-tokens curve.
  4. Report the area under the curve and the knee point (where more tokens stop helping).

Scoring method

Deterministic at each tier. Curve analysis is computed post-hoc.

Dimensions tested: recall

Purpose alignment

How this metric relates to each track (v1.0):

TrackAlignment
conversationaladjacent
knowledge-braincore
graphcore
agent-memorycore
baselinecore

Expected failure modes

  • OVER_RETRIEVAL — uses full budget but returns irrelevant context
  • RETRIEVAL_MISS — fails at low budgets where precision matters

See the full failure taxonomy for all 20+ reason codes.

Dataset source

Bench'd internal dataset, same questions as Knowledge Retrieval but with token constraints.

Known limitations

  • Not all systems support token budget constraints; those that don't get a flat curve.
  • Token counting is approximate.

Stable URL: benchd.ai/methodology/metrics/budget-curves
This URL is referenced in signed manifests. It will not change.

Command Palette

Search for a command to run...