TwinBench is meant to be understandable by experts and normal builders alike.
No. TwinBench is about long-lived assistant behavior, not just one-turn chat quality.
No. Nullalis is the current reference runtime because it produced the first strong public artifact.
Yes. Use the demo path from the repo or run against a native runtime with one command.
Because some systems do not expose the runtime surfaces required for a fair direct measurement. TwinBench shows that honestly rather than hiding it.
Coverage shows how much of the benchmark was truly exercised. A flattering score with weak coverage should not outrank a strong, deeply measured artifact.
That is still useful. TwinBench prefers honest partial artifacts over fake comparability.