Methodology

How TwinBench works

TwinBench is one benchmark family for personal AI assistants. It reports verified evidence, projected score, measured coverage, and dimension-level reason codes so weak coverage stays visible.

One board, clear classes

TwinBench should become a public board, but every result must keep its class, coverage, and evidence basis.

Coverage matters

The headline ranking number is coverage-adjusted verified score, not the most flattering number in the artifact.

Trust over hype

Unsupported surfaces, missing bootstrap, and partial measurement are reported explicitly instead of flattened into a false failure.

What the headline numbers mean

Verified is what the run directly proved. Projected is the broader estimate with explicit assumptions. Measured coverage tells you how much of the benchmark was directly exercised.

TwinBench uses coverage-adjusted verified score for public ranking because it rewards both strength and honest measurement.

Why unavailable is not failure

Some systems do not expose the runtime surfaces needed for a fair direct measurement. TwinBench records that explicitly instead of pretending they cleanly failed a dimension.