Skip to content

supermemoryai/memorybench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MemoryBench

A pluggable benchmarking framework for evaluating memory and context systems.

original

Features

  • 🔌 Interoperable: mix and match any provider with any benchmark
  • 🧩 Bring your own benchmarks: plug in custom datasets and tasks
  • ♻️ Checkpointed runs: resume from any pipeline stage (ingest → index → search → answer → evaluate)
  • 🆚 Multi‑provider comparison: run the same benchmark across providers side‑by‑side
  • 🧪 Judge‑agnostic: swap GPT‑4o, Claude, Gemini, etc. without code changes
  • 📊 Structured reports: export run status, failures, and metrics for analysis
  • 🖥️ Web UI: inspect runs, questions, and failures interactively, in real-time!
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  Benchmarks │    │  Providers  │    │   Judges    │
│  (LoCoMo,   │    │ (Supermem,  │    │  (GPT-4o,   │
│  LongMem..) │    │  Mem0, Zep) │    │  Claude..)  │
└──────┬──────┘    └──────┬──────┘    └──────┬──────┘
       └──────────────────┼──────────────────┘
                         ▼
             ┌───────────────────────┐
             │      MemoryBench      │
             └───────────┬───────────┘
                         ▼
   ┌────────┬─────────┬────────┬──────────┬────────┐
   │ Ingest │ Indexing│ Search │  Answer  │Evaluate│
   └────────┴─────────┴────────┴──────────┴────────┘

Quick Start

bun install
cp .env.example .env.local  # Add your API keys
bun run src/index.ts run -p supermemory -b locomo

Configuration

# Providers (at least one)
SUPERMEMORY_API_KEY=
MEM0_API_KEY=
ZEP_API_KEY=

# Judges (at least one)
OPENAI_API_KEY=
ANTHROPIC_API_KEY=
GOOGLE_API_KEY=

Commands

Command Description
run Full pipeline: ingest → index → search → answer → evaluate → report
compare Run benchmark across multiple providers simultaneously
ingest Ingest benchmark data into provider
search Run search phase only
test Test single question
status Check run progress
list-questions Browse benchmark questions
show-failures Debug failed questions
serve Start web UI
help Show help (help providers, help models, help benchmarks)

Options

-p, --provider         Memory provider (supermemory, mem0, zep)
-b, --benchmark        Benchmark (locomo, longmemeval, convomem)
-j, --judge            Judge model (gpt-4o, sonnet-4, gemini-2.5-flash, etc.)
-r, --run-id           Run identifier (auto-generated if omitted)
-m, --answering-model  Model for answer generation (default: gpt-4o)
-l, --limit            Limit number of questions
-q, --question-id      Specific question (for test command)
--force                Clear checkpoint and restart

Examples

# Full run
bun run src/index.ts run -p mem0 -b locomo

# With custom run ID
bun run src/index.ts run -p mem0 -b locomo -r my-test

# Resume existing run
bun run src/index.ts run -r my-test

# Limited questions
bun run src/index.ts run -p supermemory -b locomo -l 10

# Different models
bun run src/index.ts run -p zep -b longmemeval -j sonnet-4 -m gemini-2.5-flash

# Compare multiple providers
bun run src/index.ts compare -p supermemory,mem0,zep -b locomo -s 5

# Test single question
bun run src/index.ts test -r my-test -q question_42

# Debug
bun run src/index.ts status -r my-test
bun run src/index.ts show-failures -r my-test

Pipeline

1. INGEST    Load benchmark sessions → Push to provider
2. INDEX     Wait for provider indexing
3. SEARCH    Query provider → Retrieve context
4. ANSWER    Build prompt → Generate answer via LLM
5. EVALUATE  Compare to ground truth → Score via judge
6. REPORT    Aggregate scores → Output accuracy + latency

Each phase checkpoints independently. Failed runs resume from last successful point.

Checkpointing

Runs persist to data/runs/{runId}/:

  • checkpoint.json - Run state and progress
  • results/ - Search results per question
  • report.json - Final report

Re-running same ID resumes. Use --force to restart.

Extending

Component Guide
Add Provider src/providers/README.md
Add Benchmark src/benchmarks/README.md
Add Judge src/judges/README.md
Project Structure src/README.md

License

MIT

About

Unified benchmark for evaluating conversational memory and RAG across multiple datasets

Topics

Resources

License

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •  

Languages