Pricing

Independent verification
your customers can trust

Self-reported scores get flagged. Bench'd-verified scores get trusted. Every run is cryptographically signed and publicly verifiable.

Starter

$299/mo

One benchmark run per month against your production system

Get Started

1 full benchmark run per month

LongMemEval + LOCOMO + Reliability suite

Cryptographically signed receipt

Detailed failure analysis report

"Vendor-Verified" badge on leaderboard

Per-dimension score breakdown

Efficiency metrics (latency, tokens, cost)

Results published within 48 hours

Continuous

$699/mo

Weekly automated runs with regression monitoring

Start Monitoring

Everything in Starter

Weekly automated benchmark runs

Score regression alerts (email + webhook)

Performance-over-time dashboard

GitHub PR status checks (Bench'd CI)

README embed badge with live score

Vendor dashboard with historical data

Priority support

Enterprise

$3,999.99/mo

Custom benchmarks and dedicated engineering support

Everything in Continuous

Custom benchmark suites for your use case

Dedicated adapter engineering

Priority scheduling (same-day runs)

Co-branded benchmark reports

"Partner-Audited" badge (highest trust tier)

Multi-system support (test your full stack)

Dedicated account manager

Custom SLA

Common Questions

What's the difference between community-verified and vendor-verified?

Community-verified means we ran your open-source code ourselves. Vendor-verified means you connected your production endpoint and we ran against it — co-signed with both your key and ours. Vendor-verified scores reflect your actual production system, not just the OSS version.

What benchmarks do you run?

The full suite includes LongMemEval (500 questions — recall, temporal, reasoning), LOCOMO (1,540 questions — multi-session memory), and the Bench'd Reliability benchmark (25 adversarial trap questions — hallucination, stale memory, entity confusion, deletion compliance).

How long does a run take?

A full benchmark suite takes 30-90 minutes depending on your system's latency. Results are published within 48 hours of completion.

Can I dispute a score?

Yes. Every run produces a signed receipt with every input, output, and judge reasoning. If you believe a score is unfair, we review the traces together.

What if my system scores below the LLM baseline?

That's a real result and it will be published. The LLM baseline (57.6%) represents what you'd get with no memory system at all. We include detailed failure traces to help you diagnose why.

Do you offer custom benchmarks?

Yes, on the Enterprise plan. We'll design benchmark questions for your specific use case — customer support memory, coding agent context, sales pipeline recall, etc.

Not sure which plan is right? We're happy to help.

hello@benchd.ai

Independent verificationyour customers can trust

Starter

Continuous

Enterprise

Common Questions

Command Palette

Independent verification
your customers can trust