Independent verification
your customers can trust
Self-reported scores get flagged. Bench'd-verified scores get trusted. Every run is cryptographically signed and publicly verifiable.
Starter
One benchmark run per month against your production system
Get StartedContinuous
Weekly automated runs with regression monitoring
Start MonitoringEnterprise
Custom benchmarks and dedicated engineering support
Contact UsCommon Questions
What's the difference between community-verified and vendor-verified?
Community-verified means we ran your open-source code ourselves. Vendor-verified means you connected your production endpoint and we ran against it — co-signed with both your key and ours. Vendor-verified scores reflect your actual production system, not just the OSS version.
What benchmarks do you run?
The full suite includes LongMemEval (500 questions — recall, temporal, reasoning), LOCOMO (1,540 questions — multi-session memory), and the Bench'd Reliability benchmark (25 adversarial trap questions — hallucination, stale memory, entity confusion, deletion compliance).
How long does a run take?
A full benchmark suite takes 30-90 minutes depending on your system's latency. Results are published within 48 hours of completion.
Can I dispute a score?
Yes. Every run produces a signed receipt with every input, output, and judge reasoning. If you believe a score is unfair, we review the traces together.
What if my system scores below the LLM baseline?
That's a real result and it will be published. The LLM baseline (57.6%) represents what you'd get with no memory system at all. We include detailed failure traces to help you diagnose why.
Do you offer custom benchmarks?
Yes, on the Enterprise plan. We'll design benchmark questions for your specific use case — customer support memory, coding agent context, sales pipeline recall, etc.
Not sure which plan is right? We're happy to help.
hello@benchd.ai