LLM Baseline (GPT-4o-mini)
Community-VerifiedNo memory systemLast tested May 11, 2026
Raw LLM context window with no memory system. All conversation turns fed directly into the model. This is the baseline every memory system must beat to justify existing.
Scores from 0–100. Higher is better. LLM Baseline (no memory system) scores 57.6%. How we calculate this →
TrackLLM Baseline
Track Index
58.5/100
Based on 6 benchmarks.
Benchmark Results
| Benchmark | Score | Status | Receipt |
|---|---|---|---|
| LongMemEval | 57.6 | Verified | View |
| LoCoMo | 61.2 | Verified | View |
| Reliability | 52.0 | Verified | View |
| Truth Arbitration | 80.0 | Verified | View |
| Memory Poisoning | 0.0 | Verified | View |
| Budget Curves | 100.0 | Verified | View |
| Other Benchmarks | |||
| Knowledge Retrieval | Not applicable — outside LLM Baseline track | ||
| Knowledge Scale | Not applicable — outside LLM Baseline track | ||
Relative Performance vs All Benchmarked Systems
vs 16 scored systemsEach dot is a system. Amber dot is LLM Baseline (GPT-4o-mini). Amber line = LLM Baseline (no memory).
Overall57.631th percentile
No memory: 57.6%gbrain
Recall77.656th percentile
No memory: 57.6%gbrain
Temporal48.344th percentile
No memory: 57.6%gbrain
Reasoning48.944th percentile
No memory: 57.6%gbrain
Bench'd Memory Index
The BMI combines accuracy (70%) and efficiency (30%) into a single production-weighted score. Formula is public and versioned.
57.6
/ 100
#3 of 8 systemsTop 37%
Accuracy (70%)57.6
Efficiency (30%)99.6
Efficiency Metrics
Avg Latency
Average time to retrieve memories and generate an answer. Lower is better.
Tokens / Correct
Average tokens consumed per correctly answered question. Lower means more efficient.
Recall Tokens
Average tokens returned by the memory system per query. Lower means tighter retrieval.
Per-Capability Score Matrix
| Dimension | Budget Curves | Knowledge Retrieval | Knowledge Scale | LoCoMo | LongMemEval | Memory Poisoning | Reliability | Semantic RBAC | Truth Arbitration |
|---|---|---|---|---|---|---|---|---|---|
| Recall | -- | -- | -- | 44.4 | 87.5 | -- | -- | -- | -- |
| Temporal | -- | -- | -- | 30.0 | 38.1 | -- | -- | -- | -- |
| Reasoning | -- | -- | -- | 76.7 | 38.5 | -- | -- | -- | -- |
| Hallucination | -- | -- | -- | -- | -- | -- | 0.0 | -- | -- |
| Stale Memory | -- | -- | -- | -- | -- | -- | 100.0 | -- | -- |
| Entity Confusion | -- | -- | -- | -- | -- | -- | 100.0 | -- | -- |
| Deletion | -- | -- | -- | -- | -- | -- | 0.0 | -- | -- |
| Access control | -- | -- | -- | -- | -- | -- | -- | 0.0 | -- |
| Budget 1000 | 100.0 | -- | -- | -- | -- | -- | -- | -- | -- |
| Budget 10000 | 100.0 | -- | -- | -- | -- | -- | -- | -- | -- |
| Budget 2000 | 100.0 | -- | -- | -- | -- | -- | -- | -- | -- |
| Budget 500 | 100.0 | -- | -- | -- | -- | -- | -- | -- | -- |
| Budget 5000 | 100.0 | -- | -- | -- | -- | -- | -- | -- | -- |
| Conflict resolution | -- | -- | -- | -- | -- | -- | -- | -- | 80.0 |
| Document retrieval | -- | 100.0 | -- | -- | -- | -- | -- | -- | -- |
| Injection resistance | -- | -- | -- | -- | -- | 0.0 | -- | -- | -- |
| Knowledge update | -- | 80.0 | -- | -- | -- | -- | -- | -- | -- |
| Multi page | -- | 100.0 | -- | -- | -- | -- | -- | -- | -- |
| Scale large | -- | -- | 100.0 | -- | -- | -- | -- | -- | -- |
| Scale medium | -- | -- | 100.0 | -- | -- | -- | -- | -- | -- |
| Scale small | -- | -- | 100.0 | -- | -- | -- | -- | -- | -- |
| Semantic search | -- | 100.0 | -- | -- | -- | -- | -- | -- | -- |
| Overall | 100.0 | 95.0 | 100.0 | 61.2 | 54.0 | 0.0 | 52.0 | 0.0 | 80.0 |
Per-Benchmark Breakdown
| Benchmark | Verified | Nuance |
|---|
Performance Over Time — LongMemEval
2026-05-11 to 2026-05-13LLM Baseline57.6
Most often compared with
Add badge to your README
Show your Bench'd score on your GitHub repo.
Markdown
[](https://benchd.ai/system/llm-baseline)
HTML
<a href="https://benchd.ai/system/llm-baseline"><img src="https://img.shields.io/badge/Bench'd_BMI-57.6-D9982B?style=flat" alt="Bench'd Verified: 57.6 BMI" /></a>