llamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexedllamaindex-memory 0.0 on LoCoMollm-baseline 0.0 on LoCoMomem0-local 0.0 on LongMemEvalmem0-local 0.0 on LongMemEvalllamaindex-memory 0.0 on LongMemEvalllm-baseline 0.0 on LongMemEvallangchain-memory 0.0 on LongMemEvalcognee 0.0 on LongMemEval13 systems independently scored64 systems indexed
Pricing

Independent verification
your customers can trust

Self-reported scores get flagged. Bench'd-verified scores get trusted. Every run is cryptographically signed and publicly verifiable.

Starter

$299/mo

One benchmark run per month against your production system

Get Started
1 full benchmark run per month
LongMemEval + LOCOMO + Reliability suite
Cryptographically signed receipt
Detailed failure analysis report
"Vendor-Verified" badge on leaderboard
Per-dimension score breakdown
Efficiency metrics (latency, tokens, cost)
Results published within 48 hours
Most Popular

Continuous

$699/mo

Weekly automated runs with regression monitoring

Start Monitoring
Everything in Starter
Weekly automated benchmark runs
Score regression alerts (email + webhook)
Performance-over-time dashboard
GitHub PR status checks (Bench'd CI)
README embed badge with live score
Vendor dashboard with historical data
Priority support

Enterprise

$3,999.99/mo

Custom benchmarks and dedicated engineering support

Contact Us
Everything in Continuous
Custom benchmark suites for your use case
Dedicated adapter engineering
Priority scheduling (same-day runs)
Co-branded benchmark reports
"Partner-Audited" badge (highest trust tier)
Multi-system support (test your full stack)
Dedicated account manager
Custom SLA

Common Questions

What's the difference between community-verified and vendor-verified?

Community-verified means we ran your open-source code ourselves. Vendor-verified means you connected your production endpoint and we ran against it — co-signed with both your key and ours. Vendor-verified scores reflect your actual production system, not just the OSS version.

What benchmarks do you run?

The full suite includes LongMemEval (500 questions — recall, temporal, reasoning), LOCOMO (1,540 questions — multi-session memory), and the Bench'd Reliability benchmark (25 adversarial trap questions — hallucination, stale memory, entity confusion, deletion compliance).

How long does a run take?

A full benchmark suite takes 30-90 minutes depending on your system's latency. Results are published within 48 hours of completion.

Can I dispute a score?

Yes. Every run produces a signed receipt with every input, output, and judge reasoning. If you believe a score is unfair, we review the traces together.

What if my system scores below the LLM baseline?

That's a real result and it will be published. The LLM baseline (57.6%) represents what you'd get with no memory system at all. We include detailed failure traces to help you diagnose why.

Do you offer custom benchmarks?

Yes, on the Enterprise plan. We'll design benchmark questions for your specific use case — customer support memory, coding agent context, sales pipeline recall, etc.

Not sure which plan is right? We're happy to help.

hello@benchd.ai

Command Palette

Search for a command to run...