llamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexedllamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexed

LLM Baseline (GPT-4o-mini)

Community-Verified
No memory systemLast tested May 11, 2026

Raw LLM context window with no memory system. All conversation turns fed directly into the model. This is the baseline every memory system must beat to justify existing.

Scores from 0–100. Higher is better. LLM Baseline (no memory system) scores 57.6%. How we calculate this →

TrackLLM Baseline
Track Index
58.5/100

Based on 6 benchmarks.

Benchmark Results

BenchmarkScoreStatusReceipt
LongMemEval57.6VerifiedView
LoCoMo61.2VerifiedView
Reliability52.0VerifiedView
Truth Arbitration80.0VerifiedView
Memory Poisoning0.0VerifiedView
Budget Curves100.0VerifiedView
Other Benchmarks
Knowledge RetrievalNot applicable — outside LLM Baseline track
Knowledge ScaleNot applicable — outside LLM Baseline track

Relative Performance vs All Benchmarked Systems

vs 16 scored systems

Each dot is a system. Amber dot is LLM Baseline (GPT-4o-mini). Amber line = LLM Baseline (no memory).

Overall
No memory: 57.6%
gbrain
57.631th percentile
Recall
No memory: 57.6%
gbrain
77.656th percentile
Temporal
No memory: 57.6%
gbrain
48.344th percentile
Reasoning
No memory: 57.6%
gbrain
48.944th percentile
Bench'd Memory Index
The BMI combines accuracy (70%) and efficiency (30%) into a single production-weighted score. Formula is public and versioned.
57.6
/ 100
#3 of 8 systemsTop 37%
Accuracy (70%)57.6
Efficiency (30%)99.6

Efficiency Metrics

Avg Latency
Average time to retrieve memories and generate an answer. Lower is better.
1.4sTime per recall query
Tokens / Correct
Average tokens consumed per correctly answered question. Lower means more efficient.
43Token cost per correct answer
Recall Tokens
Average tokens returned by the memory system per query. Lower means tighter retrieval.
23Avg tokens per retrieval

Per-Capability Score Matrix

DimensionBudget CurvesKnowledge RetrievalKnowledge ScaleLoCoMoLongMemEvalMemory PoisoningReliabilitySemantic RBACTruth Arbitration
Recall------44.487.5--------
Temporal------30.038.1--------
Reasoning------76.738.5--------
Hallucination------------0.0----
Stale Memory------------100.0----
Entity Confusion------------100.0----
Deletion------------0.0----
Access control--------------0.0--
Budget 1000100.0----------------
Budget 10000100.0----------------
Budget 2000100.0----------------
Budget 500100.0----------------
Budget 5000100.0----------------
Conflict resolution----------------80.0
Document retrieval--100.0--------------
Injection resistance----------0.0------
Knowledge update--80.0--------------
Multi page--100.0--------------
Scale large----100.0------------
Scale medium----100.0------------
Scale small----100.0------------
Semantic search--100.0--------------
Overall100.095.0100.061.254.00.052.00.080.0

Per-Benchmark Breakdown

BenchmarkVerifiedNuance

Performance Over Time — LongMemEval

2026-05-11 to 2026-05-13
0255075100baseline05-1105-1205-13
LLM Baseline57.6

Add badge to your README

Show your Bench'd score on your GitHub repo.

Bench'd Verified: 57.6 BMI
Markdown
[![Bench'd Verified: 57.6 BMI](https://img.shields.io/badge/Bench'd_BMI-57.6-D9982B?style=flat&logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAzMiAzMiI+PHJlY3Qgd2lkdGg9IjMyIiBoZWlnaHQ9IjMyIiByeD0iNiIgZmlsbD0iIzExMSIvPjx0ZXh0IHg9IjgiIHk9IjIyIiBmb250LXNpemU9IjIwIiBmb250LWZhbWlseT0ic2VyaWYiIGZpbGw9IiNmZmYiIGZvbnQtd2VpZ2h0PSI2MDAiPkInPC90ZXh0PjwvcHZnPg==)](https://benchd.ai/system/llm-baseline)
HTML
<a href="https://benchd.ai/system/llm-baseline"><img src="https://img.shields.io/badge/Bench'd_BMI-57.6-D9982B?style=flat" alt="Bench'd Verified: 57.6 BMI" /></a>

Command Palette

Search for a command to run...