A Raw LLM Beats Most Memory Systems on LongMemEval
Our first benchmark results are in across 6 systems tested. A plain GPT-4o-mini with no memory layer scores 57.6% — higher than LangChain (34.0%) and Mem0 OSS (32.4%). Only LlamaIndex (59.0%) beats the baseline. AutoGPT Memory lands at 47.4%.
Read more