The open benchmark for personal AI assistants

TwinBench

TwinBench measures whether an AI system can behave like a real personal AI assistant: remember, act, follow up, stay safe, and operate over time.

Can your personal AI assistant beat TwinBench? Start with the reference result, then run the demo or benchmark your own system.

6 Checked-in artifacts
75.9 Current top score
84% Reference coverage
Leaderboard

Current public board

The public board shows the current reference result and challenge-worthy artifacts. Historical and degraded runs stay available, but they do not dominate first impression.

AssistantClassTierScoreCoverageDate
Nullalis Reference RuntimeReference RuntimeProduction-Grade75.984%2026-03-25
TwinBench Demo RuntimeDemo FixtureEmerging54.469%2026-03-25

See full artifact history

Why it matters

What current benchmarks miss

Most AI benchmarks still measure chat quality, coding skill, or one-shot tasks. TwinBench focuses on the missing category: personal AI assistants that persist and operate over time.

What current benchmarks miss
Run it

10-minute path

Use the repo, the demo runtime, or hand TwinBench to Codex, Claude Code, or Cursor with one prompt.

make demo
make site
Trust

Evidence-first by design

  • unsupported is not failure
  • missing bootstrap is not poor scale
  • same-user contention is diagnostic
  • measured coverage matters
Challenge

Can your assistant beat this?

Every serious result should be artifact-backed, shareable, and easy to challenge in public.

TwinBench score card
Community

GitHub first

TwinBench starts with Discussions, issues, and submissions on GitHub. That keeps the benchmark legible and searchable while the ecosystem is still forming.

Open Discussions

Challenge loop

Monthly challenge shape

Reference Runtime, Top Score This Month, Most Improved, and Verified Artifact are the first public loops. The goal is recognition without hype.

Read the challenge shape