Current public reference result. Scale interpretation is conservative relative to the later provisioning-aware scale fix.
Current public reference result. Scale interpretation is conservative relative to the later provisioning-aware scale fix.
Use this page as the canonical public result URL for quoting, screenshots, or side-by-side comparison.
Strongest dimensions: Memory Persistence, Functional Capability, Autonomous Execution
Main limitation: Scale & Cost Efficiency
Why it matters: This is the strongest public proof in the repo that the personal AI assistant category is real and measurable.
Use coverage-adjusted verified score for public comparison, verified raw for direct measurement strength, and measured coverage to understand how much of the benchmark was truly exercised.