llamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexedllamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexed

AutoGPT Memory

Community-Verified
Significant GravitasWebsiteGitHub(170.0k)DocsLast tested May 12, 2026

Memory subsystem within the AutoGPT autonomous agent framework. Provides file-backed and vector-store memory for persistent task context across agent execution cycles.

Scores from 0–100. Higher is better. LLM Baseline (no memory system) scores 57.6%. How we calculate this →

TrackAgent Memory
Track Index
64.8/100

Based on 5 benchmarks.

Benchmark Results

BenchmarkScoreStatusReceipt
Knowledge Retrieval100.0VerifiedView
Truth Arbitration80.0VerifiedView
Memory Poisoning0.0VerifiedView
Budget Curves100.0VerifiedView
Reliability44.0VerifiedView
Other Benchmarks
LongMemEvalNot applicable — outside Agent Memory track
LoCoMoNot applicable — outside Agent Memory track
Knowledge ScaleNot applicable — outside Agent Memory track

Relative Performance vs All Benchmarked Systems

vs 16 scored systems

Each dot is a system. Amber dot is AutoGPT Memory. Amber line = LLM Baseline (no memory).

Overall
No memory: 57.6%
gbrain
47.425th percentile
Recall
No memory: 57.6%
gbrain
73.125th percentile
Temporal
No memory: 57.6%
gbrain
37.438th percentile
Reasoning
No memory: 57.6%
gbrain
33.138th percentile
Bench'd Memory Index
The BMI combines accuracy (70%) and efficiency (30%) into a single production-weighted score. Formula is public and versioned.
38.2
/ 100
#4 of 8 systemsTop 50%
Accuracy (70%)47.4
Efficiency (30%)52.6

Efficiency Metrics

Avg Latency
Average time to retrieve memories and generate an answer. Lower is better.
381msTime per recall query
Tokens / Correct
Average tokens consumed per correctly answered question. Lower means more efficient.
4.7kToken cost per correct answer
Recall Tokens
Average tokens returned by the memory system per query. Lower means tighter retrieval.
2.2kAvg tokens per retrieval

Per-Capability Score Matrix

DimensionBudget CurvesKnowledge RetrievalLongMemEvalMemory PoisoningReliabilityTruth Arbitration
Recall----73.1------
Temporal----37.4------
Reasoning----33.1------
Hallucination--------0.0--
Stale Memory--------71.4--
Entity Confusion--------100.0--
Deletion--------0.0--
Budget 1000100.0----------
Budget 10000100.0----------
Budget 2000100.0----------
Budget 500100.0----------
Budget 5000100.0----------
Conflict resolution----------80.0
Document retrieval--100.0--------
Injection resistance------0.0----
Knowledge update--100.0--------
Multi page--100.0--------
Semantic search--100.0--------
Overall100.0100.047.40.044.080.0

Per-Benchmark Breakdown

BenchmarkVerifiedNuance

Performance Over Time — LongMemEval

2026-05-11 to 2026-05-13
0255075100baseline05-1105-1205-13
AutoGPT47.4

Add badge to your README

Show your Bench'd score on your GitHub repo.

Bench'd Verified: 38.2 BMI
Markdown
[![Bench'd Verified: 38.2 BMI](https://img.shields.io/badge/Bench'd_BMI-38.2-D9982B?style=flat&logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAzMiAzMiI+PHJlY3Qgd2lkdGg9IjMyIiBoZWlnaHQ9IjMyIiByeD0iNiIgZmlsbD0iIzExMSIvPjx0ZXh0IHg9IjgiIHk9IjIyIiBmb250LXNpemU9IjIwIiBmb250LWZhbWlseT0ic2VyaWYiIGZpbGw9IiNmZmYiIGZvbnQtd2VpZ2h0PSI2MDAiPkInPC90ZXh0PjwvcHZnPg==)](https://benchd.ai/system/autogpt-memory)
HTML
<a href="https://benchd.ai/system/autogpt-memory"><img src="https://img.shields.io/badge/Bench'd_BMI-38.2-D9982B?style=flat" alt="Bench'd Verified: 38.2 BMI" /></a>

Command Palette

Search for a command to run...