llamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexedllamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexed

CrewAI Memory

Community-Verified
CrewAI IncWebsiteGitHub(25.0k)DocsLast tested May 12, 2026

Memory module within the CrewAI multi-agent orchestration framework. Supports short-term, long-term, and entity memory types for sharing context between cooperating agents in a crew.

Scores from 0–100. Higher is better. LLM Baseline (no memory system) scores 57.6%. How we calculate this →

TrackAgent Memory
Track Index
66.4/100

Based on 5 benchmarks.

Benchmark Results

BenchmarkScoreStatusReceipt
Knowledge Retrieval100.0VerifiedView
Truth Arbitration80.0VerifiedView
Memory Poisoning0.0VerifiedView
Budget Curves100.0VerifiedView
Reliability52.0VerifiedView
Other Benchmarks
LongMemEvalNot applicable — outside Agent Memory track
LoCoMoNot applicable — outside Agent Memory track
Knowledge ScaleNot applicable — outside Agent Memory track

Relative Performance vs All Benchmarked Systems

vs 16 scored systems

Each dot is a system. Amber dot is CrewAI Memory. Amber line = LLM Baseline (no memory).

Overall
No memory: 57.6%
gbrain
46.019th percentile
Recall
No memory: 57.6%
gbrain
74.431th percentile
Temporal
No memory: 57.6%
gbrain
35.531th percentile
Reasoning
No memory: 57.6%
gbrain
29.331th percentile
Bench'd Memory Index
The BMI combines accuracy (70%) and efficiency (30%) into a single production-weighted score. Formula is public and versioned.
32.5
/ 100
#5 of 8 systemsTop 62%
Accuracy (70%)46.0
Efficiency (30%)12.0

Efficiency Metrics

Avg Latency
Average time to retrieve memories and generate an answer. Lower is better.
4.3sTime per recall query
Tokens / Correct
Average tokens consumed per correctly answered question. Lower means more efficient.
8.8kToken cost per correct answer
Recall Tokens
Average tokens returned by the memory system per query. Lower means tighter retrieval.
4.0kAvg tokens per retrieval

Per-Capability Score Matrix

DimensionBudget CurvesKnowledge RetrievalLongMemEvalMemory PoisoningReliabilityTruth Arbitration
Recall----74.4------
Temporal----35.5------
Reasoning----29.3------
Hallucination--------0.0--
Stale Memory--------100.0--
Entity Confusion--------100.0--
Deletion--------0.0--
Budget 1000100.0----------
Budget 10000100.0----------
Budget 2000100.0----------
Budget 500100.0----------
Budget 5000100.0----------
Conflict resolution----------80.0
Document retrieval--100.0--------
Injection resistance------0.0----
Knowledge update--100.0--------
Multi page--100.0--------
Semantic search--100.0--------
Overall100.0100.046.00.052.080.0

Per-Benchmark Breakdown

BenchmarkVerifiedNuance

Performance Over Time — LongMemEval

2026-05-11 to 2026-05-13
0255075100baseline05-1105-1205-13
CrewAI60.0

Add badge to your README

Show your Bench'd score on your GitHub repo.

Bench'd Verified: 32.5 BMI
Markdown
[![Bench'd Verified: 32.5 BMI](https://img.shields.io/badge/Bench'd_BMI-32.5-D9982B?style=flat&logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAzMiAzMiI+PHJlY3Qgd2lkdGg9IjMyIiBoZWlnaHQ9IjMyIiByeD0iNiIgZmlsbD0iIzExMSIvPjx0ZXh0IHg9IjgiIHk9IjIyIiBmb250LXNpemU9IjIwIiBmb250LWZhbWlseT0ic2VyaWYiIGZpbGw9IiNmZmYiIGZvbnQtd2VpZ2h0PSI2MDAiPkInPC90ZXh0PjwvcHZnPg==)](https://benchd.ai/system/crewai-memory)
HTML
<a href="https://benchd.ai/system/crewai-memory"><img src="https://img.shields.io/badge/Bench'd_BMI-32.5-D9982B?style=flat" alt="Bench'd Verified: 32.5 BMI" /></a>

Command Palette

Search for a command to run...