TwinBench

TwinBench measures whether an AI system can behave like a real personal AI assistant: remember, act, follow up, stay safe, and operate over time.

Can your personal AI assistant beat TwinBench? Start with the reference result, then run the demo or benchmark your own system.

See the Reference Result Benchmark Your Assistant Compare Results

6 Checked-in artifacts

75.9 Current top score

84% Reference coverage

Leaderboard

Current public board

The public board shows the current reference result and challenge-worthy artifacts. Historical and degraded runs stay available, but they do not dominate first impression.

Assistant	Class	Tier	Score	Coverage	Date
Nullalis Reference Runtime	Reference Runtime	Production-Grade	75.9	84%	2026-03-25
TwinBench Demo Runtime	Demo Fixture	Emerging	54.4	69%	2026-03-25

See full artifact history

Why it matters

What current benchmarks miss

Most AI benchmarks still measure chat quality, coding skill, or one-shot tasks. TwinBench focuses on the missing category: personal AI assistants that persist and operate over time.

Run it

10-minute path

Use the repo, the demo runtime, or hand TwinBench to Codex, Claude Code, or Cursor with one prompt.

make demo
make site

Trust

Evidence-first by design

unsupported is not failure
missing bootstrap is not poor scale
same-user contention is diagnostic
measured coverage matters

Challenge

Can your assistant beat this?

Every serious result should be artifact-backed, shareable, and easy to challenge in public.

Community

GitHub first

TwinBench starts with Discussions, issues, and submissions on GitHub. That keeps the benchmark legible and searchable while the ecosystem is still forming.

Open Discussions

Challenge loop

Monthly challenge shape

Reference Runtime, Top Score This Month, Most Improved, and Verified Artifact are the first public loops. The goal is recognition without hype.

Read the challenge shape