Documentation

Benchmark your AI memory system in minutes. The harness is open source, the methodology is public, and every result is cryptographically signed.

Quick Start

1Install the harness

pip install benchd-harness

# Or from source:
git clone https://github.com/benchdai/harness.git
cd harness && pip install -e .

2Generate signing keys

benchd keys generate --out ./keys

# Creates:
#   keys/private.key  (keep secret)
#   keys/public.key   (share freely)

3Run a benchmark

# Set your LLM API key (for the judge)
export OPENROUTER_API_KEY=sk-or-...

# Run LongMemEval against your system
benchd run \
  -a mcp \
  -b longmemeval-v1 \
  --judge \
  --key ./keys/private.key \
  --adapter-config '{"endpoint": "http://localhost:3000/mcp"}'

# Results saved to: ./runs/run_xxx/manifest.signed.json

4Submit your results

benchd submit ./runs/run_xxx/manifest.signed.json

# Or upload at: https://benchd.ai/submit

MCP Systems: Zero-Code Testing

If your memory system exposes an MCP server, you don't need to write any adapter code. The generic MCP adapter auto-discovers your tools and maps them to Bench'd's ingest/recall/reset interface.

Requirements

•Your MCP server must expose at least an ingest tool and a query tool
•Tool names are auto-detected (e.g., memory_ingest, memory_query)
•Override tool names with ingest_tool and query_tool in adapter config
•A reset tool is optional but recommended for clean benchmark runs

# Auto-discover tools
benchd run -a mcp -b longmemeval-v1 --judge \
  --adapter-config '{"endpoint": "http://localhost:3000/mcp"}'

# Explicit tool names
benchd run -a mcp -b longmemeval-v1 --judge \
  --adapter-config '{
    "endpoint": "http://localhost:3000/mcp",
    "ingest_tool": "memory_ingest",
    "query_tool": "memory_query",
    "reset_tool": "memory_delete"
  }'

Writing a Custom Adapter

If your system doesn't support MCP, write a Python adapter. It's ~50 lines:

from benchd_harness.adapters.base import BaseAdapter
from typing import Any, Dict, List, Optional


class MyMemoryAdapter(BaseAdapter):
    """Adapter for My Memory System."""

    @property
    def name(self) -> str:
        return "my-memory-system"

    @property
    def version(self) -> Optional[str]:
        return "1.0.0"

    def setup(self) -> None:
        """Initialize your memory system client."""
        self.client = MyMemoryClient()

    def reset(self) -> None:
        """Clear memory between benchmark questions."""
        self.client.clear()

    def teardown(self) -> None:
        """Clean up resources."""
        self.client.close()

    def ingest(self, turns: List[Dict[str, Any]]) -> None:
        """
        Feed conversation turns into your memory system.

        Each turn has: role, content, timestamp (optional)
        """
        for turn in turns:
            self.client.add_message(
                role=turn["role"],
                content=turn["content"],
            )

    def recall(self, query: str) -> str:
        """
        Query your memory system and return a plain string.
        """
        results = self.client.search(query)
        return results.text

System Categories

Systems are grouped by what they do. Each category has its own leaderboard and question set.

Category	What it tests	Example Systems
Conversational Memory	Chat recall across sessions	Mem0, LangChain, LlamaIndex
Knowledge Brain	Document storage + retrieval	gbrain, Quivr, AnythingLLM
Agent Memory	Task/action persistence	Letta, AutoGPT, claude-mem
Graph/RAG	Entity graphs + retrieval	Graphiti, Cognee, GraphRAG

Available Benchmarks

longmemeval-v1

500 questions · Recall, temporal reasoning, knowledge updates

-b longmemeval-v1

locomo-v1

1,540 questions · Multi-session conversational memory

-b locomo-v1

smoke-memory-v0

10 questions · Quick sanity check

-b smoke-memory-v0

Submitting Results

After running a benchmark, submit your signed manifest to appear on the Bench'd leaderboard. All submissions are verified before publishing.

Via CLI

benchd submit ./runs/run_xxx/manifest.signed.json

Via Web

Upload your manifest.signed.json at benchd.ai/submit

Trust Tiers

Results on the leaderboard are categorized by how they were verified:

Community-VerifiedRun by Bench'd with our signing key. Highest trust.

Vendor-VerifiedRun by the vendor using our harness. Co-signed with vendor's key.

Self-ReportedVendor claims, not independently verified. Flagged on leaderboard.

ListedSystem indexed but not yet benchmarked.

How Bench'd Uses VerifiedState & ProofMeter

Bench'd is a real-world example of how VerifiedState memory verification and ProofMeter spend attestation work together in production. Here's exactly how we use them.

VerifiedState — Memory verification for benchmark results

Every benchmark score on Bench'd is a claim: “System X scored 80% on Knowledge Retrieval.” That claim needs to be independently verifiable. We use VerifiedState to:

-Ingest benchmark manifests into verified memory so the full trace of every run is queryable and auditable
-Run verification ladders on score claims — checking that the manifest hash matches, the signature is valid, and the traces support the reported score
-Generate signed receipts for each verified score, creating an audit trail from raw question to published leaderboard number

# VerifiedState is also benchmarked AS a memory system:
benchd run -a verifiedstate -b knowledge-retrieval-v0
# This tests VS's own memory_ingest + memory_query capabilities

ProofMeter — Spend tracking for benchmark runs

Running benchmarks costs real money — LLM judge calls, embedding API calls, model inference. ProofMeter tracks every dollar so benchmark costs are transparent and verifiable:

-Budget authorization — before a run starts, a signed budget cap is set (e.g., $5.00 max)
-Per-call receipts — every LLM judge call records provider, model, tokens, and cost as a signed receipt
-Budget enforcement — if spend exceeds the budget, the run pauses automatically
-Settlement — after the run, all receipts are Merkle-rooted into a settlement attached to the manifest

# Run reliability benchmark with $5 budget and spend tracking:
benchd run -a graphiti -b reliability-v1 --budget 5.00

# Manifest includes proofmeter section:
# {
#   "proofmeter": {
#     "total_spend_usd": "3.81",
#     "receipt_count": 294,
#     "by_model": {
#       "openai/gpt-4o-mini": { "calls": 294, "cost_usd": "3.81" }
#     },
#     "settlement_merkle_root": "sha256:...",
#     "settlement_status": "settled"
#   }
# }

Why this matters for AI agents and developers

When an AI agent runs a benchmark, deploys a workflow, or calls an API on your behalf, you need answers to three questions: What happened? (VerifiedState memory), What did it cost? (ProofMeter receipts), and Can I verify this without trusting the runner? (cryptographic signatures). Bench'd is the first production system that answers all three.

ProofMeter Spec →Trust Tiers →Full Methodology →

CI Integration (GitHub Action)

Run Bench'd on every PR to catch memory regressions before they ship.

# .github/workflows/benchd.yml
name: Bench'd Memory Benchmark
on: [pull_request]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Start your memory server
        run: docker-compose up -d memory-server

      - name: Run Bench'd
        uses: benchdai/benchmark-action@v1
        with:
          adapter: mcp
          benchmark: smoke-memory-v0
          endpoint: http://localhost:3000/mcp
          openrouter-key: ${{ secrets.OPENROUTER_API_KEY }}

      - name: Upload results
        if: github.event_name == 'push' && github.ref == 'refs/heads/main'
        run: benchd submit ./runs/*/manifest.signed.json

GitHub Action coming soon. Star benchdai/harness to get notified.

View on GitHub Submit Results

Stay in the loop

New benchmark results, methodology updates, and memory system rankings. No spam.

Unsubscribe anytime. We respect your inbox.