llamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexedllamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexed

Letta

Community-Verified
Letta IncWebsiteGitHub(38.2k)DocsLast tested May 12, 2026
MCP Endpoint:https://api.letta.com/mcp/v1

Stateful LLM agent framework (formerly MemGPT) with built-in memory management, tool use, and multi-step reasoning. Self-editing memory architecture enables unbounded context.

Scores from 0–100. Higher is better. LLM Baseline (no memory system) scores 57.6%. How we calculate this →

TrackAgent Memory
Track Index
45.0/100

Based on 4 benchmarks.1 pending.

Benchmark Results

BenchmarkScoreStatusReceipt
Knowledge Retrieval80.0VerifiedView
Truth Arbitration80.0VerifiedView
Memory Poisoning20.0VerifiedView
Budget Curves0.0VerifiedView
ReliabilityPendingPending--
Other Benchmarks
LongMemEvalNot applicable — outside Agent Memory track
LoCoMoNot applicable — outside Agent Memory track
Knowledge ScaleNot applicable — outside Agent Memory track

Relative Performance vs All Benchmarked Systems

vs 16 scored systems

Each dot is a system. Amber dot is Letta. Amber line = LLM Baseline (no memory).

Overall
No memory: 57.6%
gbrain
80.063th percentile
Recall
No memory: 57.6%
gbrain
80.056th percentile
Temporal
No memory: 57.6%
gbrain
0.00th percentile
Reasoning
No memory: 57.6%
gbrain
0.00th percentile
Bench'd Memory Index
The BMI combines accuracy (70%) and efficiency (30%) into a single production-weighted score. Formula is public and versioned.
80.0
/ 100
#1 of 8 systemsTop 12%
Accuracy (70%)80.0
Efficiency (30%)--

Efficiency Metrics

Avg Latency
Average time to retrieve memories and generate an answer. Lower is better.
5.8sTime per recall query
Tokens / Correct
Average tokens consumed per correctly answered question. Lower means more efficient.
--Token cost per correct answer
Recall Tokens
Average tokens returned by the memory system per query. Lower means tighter retrieval.
45Avg tokens per retrieval

Per-Capability Score Matrix

DimensionBudget CurvesKnowledge RetrievalLongMemEvalMemory PoisoningSmoke Memory v0Truth Arbitration
Recall----0.0--0.0--
Temporal----0.0--0.0--
Reasoning----0.0--0.0--
Budget 10000.0----------
Budget 100000.0----------
Budget 20000.0----------
Budget 5000.0----------
Budget 50000.0----------
Conflict resolution----------80.0
Document retrieval--80.0--------
Injection resistance------0.0----
Knowledge update--60.0--------
Multi page--80.0--------
Semantic search--100.0--------
Overall0.080.00.00.00.080.0

Per-Benchmark Breakdown

BenchmarkVerifiedNuance
LongMemEval88.482.1
PersonaMem87.181.2

Performance Over Time — LongMemEval

2026-05-11 to 2026-05-13
0255075100baseline05-1105-1205-13

Add badge to your README

Show your Bench'd score on your GitHub repo.

Bench'd Verified: 80.0 BMI
Markdown
[![Bench'd Verified: 80.0 BMI](https://img.shields.io/badge/Bench'd_BMI-80.0-D9982B?style=flat&logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAzMiAzMiI+PHJlY3Qgd2lkdGg9IjMyIiBoZWlnaHQ9IjMyIiByeD0iNiIgZmlsbD0iIzExMSIvPjx0ZXh0IHg9IjgiIHk9IjIyIiBmb250LXNpemU9IjIwIiBmb250LWZhbWlseT0ic2VyaWYiIGZpbGw9IiNmZmYiIGZvbnQtd2VpZ2h0PSI2MDAiPkInPC90ZXh0PjwvcHZnPg==)](https://benchd.ai/system/letta)
HTML
<a href="https://benchd.ai/system/letta"><img src="https://img.shields.io/badge/Bench'd_BMI-80.0-D9982B?style=flat" alt="Bench'd Verified: 80.0 BMI" /></a>

Command Palette

Search for a command to run...