LLM Baseline (GPT-4o-mini)

Community-Verified

No memory systemLast tested May 11, 2026

Raw LLM context window with no memory system. All conversation turns fed directly into the model. This is the baseline every memory system must beat to justify existing.

Scores from 0–100. Higher is better. LLM Baseline (no memory system) scores 57.6%. How we calculate this →

TrackLLM Baseline

Track Index

58.5/100

Based on 6 benchmarks.

Benchmark Results

Benchmark	Score	Status	Receipt
LongMemEval	57.6	Verified	View
LoCoMo	61.2	Verified	View
Reliability	52.0	Verified	View
Truth Arbitration	80.0	Verified	View
Memory Poisoning	0.0	Verified	View
Budget Curves	100.0	Verified	View
Other Benchmarks
Knowledge Retrieval	Not applicable — outside LLM Baseline track
Knowledge Scale	Not applicable — outside LLM Baseline track

Relative Performance vs All Benchmarked Systems

vs 16 scored systems

Each dot is a system. Amber dot is LLM Baseline (GPT-4o-mini). Amber line = LLM Baseline (no memory).

Overall

No memory: 57.6%

gbrain

57.631th percentile

Recall

No memory: 57.6%

gbrain

77.656th percentile

Temporal

No memory: 57.6%

gbrain

48.344th percentile

Reasoning

No memory: 57.6%

gbrain

48.944th percentile

Bench'd Memory Index

The BMI combines accuracy (70%) and efficiency (30%) into a single production-weighted score. Formula is public and versioned.

57.6

/ 100

#3 of 8 systemsTop 37%

Accuracy (70%)57.6

Efficiency (30%)99.6

Efficiency Metrics

Avg Latency

Average time to retrieve memories and generate an answer. Lower is better.

1.4sTime per recall query

Tokens / Correct

Average tokens consumed per correctly answered question. Lower means more efficient.

43Token cost per correct answer

Recall Tokens

Average tokens returned by the memory system per query. Lower means tighter retrieval.

23Avg tokens per retrieval

Per-Capability Score Matrix

Dimension	Budget Curves	Knowledge Retrieval	Knowledge Scale	LoCoMo	LongMemEval	Memory Poisoning	Reliability	Semantic RBAC	Truth Arbitration
Recall	--	--	--	44.4	87.5	--	--	--	--
Temporal	--	--	--	30.0	38.1	--	--	--	--
Reasoning	--	--	--	76.7	38.5	--	--	--	--
Hallucination	--	--	--	--	--	--	0.0	--	--
Stale Memory	--	--	--	--	--	--	100.0	--	--
Entity Confusion	--	--	--	--	--	--	100.0	--	--
Deletion	--	--	--	--	--	--	0.0	--	--
Access control	--	--	--	--	--	--	--	0.0	--
Budget 1000	100.0	--	--	--	--	--	--	--	--
Budget 10000	100.0	--	--	--	--	--	--	--	--
Budget 2000	100.0	--	--	--	--	--	--	--	--
Budget 500	100.0	--	--	--	--	--	--	--	--
Budget 5000	100.0	--	--	--	--	--	--	--	--
Conflict resolution	--	--	--	--	--	--	--	--	80.0
Document retrieval	--	100.0	--	--	--	--	--	--	--
Injection resistance	--	--	--	--	--	0.0	--	--	--
Knowledge update	--	80.0	--	--	--	--	--	--	--
Multi page	--	100.0	--	--	--	--	--	--	--
Scale large	--	--	100.0	--	--	--	--	--	--
Scale medium	--	--	100.0	--	--	--	--	--	--
Scale small	--	--	100.0	--	--	--	--	--	--
Semantic search	--	100.0	--	--	--	--	--	--	--
Overall	100.0	95.0	100.0	61.2	54.0	0.0	52.0	0.0	80.0

Per-Benchmark Breakdown

Benchmark	Harness	Judge	Verified	Nuance	Completed	Receipt

Performance Over Time — LongMemEval

2026-05-11 to 2026-05-13

LLM Baseline57.6

Most often compared with

Add badge to your README

Show your Bench'd score on your GitHub repo.

Markdown

[![Bench'd Verified: 57.6 BMI](https://img.shields.io/badge/Bench'd_BMI-57.6-D9982B?style=flat&logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAzMiAzMiI+PHJlY3Qgd2lkdGg9IjMyIiBoZWlnaHQ9IjMyIiByeD0iNiIgZmlsbD0iIzExMSIvPjx0ZXh0IHg9IjgiIHk9IjIyIiBmb250LXNpemU9IjIwIiBmb250LWZhbWlseT0ic2VyaWYiIGZpbGw9IiNmZmYiIGZvbnQtd2VpZ2h0PSI2MDAiPkInPC90ZXh0PjwvcHZnPg==)](https://benchd.ai/system/llm-baseline)

HTML

<a href="https://benchd.ai/system/llm-baseline"><img src="https://img.shields.io/badge/Bench'd_BMI-57.6-D9982B?style=flat" alt="Bench'd Verified: 57.6 BMI" /></a>

LLM Baseline (GPT-4o-mini)

Benchmark Results

Relative Performance vs All Benchmarked Systems

Efficiency Metrics

Per-Capability Score Matrix

Per-Benchmark Breakdown

Performance Over Time — LongMemEval

Most often compared with

Add badge to your README

Command Palette