Documentation
Benchmark your AI memory system in minutes. The harness is open source, the methodology is public, and every result is cryptographically signed.
Quick Start
1Install the harness
pip install benchd-harness
# Or from source:
git clone https://github.com/benchdai/harness.git
cd harness && pip install -e .2Generate signing keys
benchd keys generate --out ./keys
# Creates:
# keys/private.key (keep secret)
# keys/public.key (share freely)3Run a benchmark
# Set your LLM API key (for the judge)
export OPENROUTER_API_KEY=sk-or-...
# Run LongMemEval against your system
benchd run \
-a mcp \
-b longmemeval-v1 \
--judge \
--key ./keys/private.key \
--adapter-config '{"endpoint": "http://localhost:3000/mcp"}'
# Results saved to: ./runs/run_xxx/manifest.signed.json4Submit your results
benchd submit ./runs/run_xxx/manifest.signed.json
# Or upload at: https://benchd.ai/submitMCP Systems: Zero-Code Testing
If your memory system exposes an MCP server, you don't need to write any adapter code. The generic MCP adapter auto-discovers your tools and maps them to Bench'd's ingest/recall/reset interface.
Requirements
- •Your MCP server must expose at least an ingest tool and a query tool
- •Tool names are auto-detected (e.g.,
memory_ingest,memory_query) - •Override tool names with
ingest_toolandquery_toolin adapter config - •A reset tool is optional but recommended for clean benchmark runs
# Auto-discover tools
benchd run -a mcp -b longmemeval-v1 --judge \
--adapter-config '{"endpoint": "http://localhost:3000/mcp"}'
# Explicit tool names
benchd run -a mcp -b longmemeval-v1 --judge \
--adapter-config '{
"endpoint": "http://localhost:3000/mcp",
"ingest_tool": "memory_ingest",
"query_tool": "memory_query",
"reset_tool": "memory_delete"
}'Writing a Custom Adapter
If your system doesn't support MCP, write a Python adapter. It's ~50 lines:
from benchd_harness.adapters.base import BaseAdapter
from typing import Any, Dict, List, Optional
class MyMemoryAdapter(BaseAdapter):
"""Adapter for My Memory System."""
@property
def name(self) -> str:
return "my-memory-system"
@property
def version(self) -> Optional[str]:
return "1.0.0"
def setup(self) -> None:
"""Initialize your memory system client."""
self.client = MyMemoryClient()
def reset(self) -> None:
"""Clear memory between benchmark questions."""
self.client.clear()
def teardown(self) -> None:
"""Clean up resources."""
self.client.close()
def ingest(self, turns: List[Dict[str, Any]]) -> None:
"""
Feed conversation turns into your memory system.
Each turn has: role, content, timestamp (optional)
"""
for turn in turns:
self.client.add_message(
role=turn["role"],
content=turn["content"],
)
def recall(self, query: str) -> str:
"""
Query your memory system and return a plain string.
"""
results = self.client.search(query)
return results.textRegister your adapter in benchd_harness/adapters/__init__.py and run with benchd run -a my-memory-system.
System Categories
Systems are grouped by what they do. Each category has its own leaderboard and question set.
| Category | What it tests | Example Systems |
|---|---|---|
| Conversational Memory | Chat recall across sessions | Mem0, LangChain, LlamaIndex |
| Knowledge Brain | Document storage + retrieval | gbrain, Quivr, AnythingLLM |
| Agent Memory | Task/action persistence | Letta, AutoGPT, claude-mem |
| Graph/RAG | Entity graphs + retrieval | Graphiti, Cognee, GraphRAG |
Available Benchmarks
longmemeval-v1
500 questions · Recall, temporal reasoning, knowledge updates
-b longmemeval-v1
locomo-v1
1,540 questions · Multi-session conversational memory
-b locomo-v1
smoke-memory-v0
10 questions · Quick sanity check
-b smoke-memory-v0
Submitting Results
After running a benchmark, submit your signed manifest to appear on the Bench'd leaderboard. All submissions are verified before publishing.
Via CLI
benchd submit ./runs/run_xxx/manifest.signed.jsonVia Web
Upload your manifest.signed.json at benchd.ai/submit
Trust Tiers
Results on the leaderboard are categorized by how they were verified:
How Bench'd Uses VerifiedState & ProofMeter
Bench'd is a real-world example of how VerifiedState memory verification and ProofMeter spend attestation work together in production. Here's exactly how we use them.
VerifiedState — Memory verification for benchmark results
Every benchmark score on Bench'd is a claim: “System X scored 80% on Knowledge Retrieval.” That claim needs to be independently verifiable. We use VerifiedState to:
- -Ingest benchmark manifests into verified memory so the full trace of every run is queryable and auditable
- -Run verification ladders on score claims — checking that the manifest hash matches, the signature is valid, and the traces support the reported score
- -Generate signed receipts for each verified score, creating an audit trail from raw question to published leaderboard number
# VerifiedState is also benchmarked AS a memory system:
benchd run -a verifiedstate -b knowledge-retrieval-v0
# This tests VS's own memory_ingest + memory_query capabilitiesProofMeter — Spend tracking for benchmark runs
Running benchmarks costs real money — LLM judge calls, embedding API calls, model inference. ProofMeter tracks every dollar so benchmark costs are transparent and verifiable:
- -Budget authorization — before a run starts, a signed budget cap is set (e.g., $5.00 max)
- -Per-call receipts — every LLM judge call records provider, model, tokens, and cost as a signed receipt
- -Budget enforcement — if spend exceeds the budget, the run pauses automatically
- -Settlement — after the run, all receipts are Merkle-rooted into a settlement attached to the manifest
# Run reliability benchmark with $5 budget and spend tracking:
benchd run -a graphiti -b reliability-v1 --budget 5.00
# Manifest includes proofmeter section:
# {
# "proofmeter": {
# "total_spend_usd": "3.81",
# "receipt_count": 294,
# "by_model": {
# "openai/gpt-4o-mini": { "calls": 294, "cost_usd": "3.81" }
# },
# "settlement_merkle_root": "sha256:...",
# "settlement_status": "settled"
# }
# }Why this matters for AI agents and developers
When an AI agent runs a benchmark, deploys a workflow, or calls an API on your behalf, you need answers to three questions: What happened? (VerifiedState memory), What did it cost? (ProofMeter receipts), and Can I verify this without trusting the runner? (cryptographic signatures). Bench'd is the first production system that answers all three.
CI Integration (GitHub Action)
Run Bench'd on every PR to catch memory regressions before they ship.
# .github/workflows/benchd.yml
name: Bench'd Memory Benchmark
on: [pull_request]
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Start your memory server
run: docker-compose up -d memory-server
- name: Run Bench'd
uses: benchdai/benchmark-action@v1
with:
adapter: mcp
benchmark: smoke-memory-v0
endpoint: http://localhost:3000/mcp
openrouter-key: ${{ secrets.OPENROUTER_API_KEY }}
- name: Upload results
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
run: benchd submit ./runs/*/manifest.signed.jsonGitHub Action coming soon. Star benchdai/harness to get notified.
Stay in the loop
New benchmark results, methodology updates, and memory system rankings. No spam.
Unsubscribe anytime. We respect your inbox.