Focused evals — explicit outcome scoring

Evals that meet
the real world.

We maintain two purpose-built evaluations whose scoring systems are tied directly to the job: ROASBench measures growth-operator outcomes across a 12-month simulation, while BulletBench measures decision quality under a real clock.

Explore BulletBench Explore ROASBench

Focused evaluations

Scoring designed around the task.

These are the benchmarks we operate ourselves: narrow scope, explicit mechanics, and outcome measures built for the task rather than a generic judge rubric.

New - the clock is the judge

intelligence per second

BulletBench

AI models play speed chess against a chess computer on a real clock - every second a model spends thinking drains its time, and it loses when the clock hits zero. Frontier heavyweights lose on time in positions they're winning; fast models grind out full games at one second a move. Chess rating, response speed and cost per game across four time limits.

Live

12-month simulation

ROASBench

A hard-mode DTC growth simulation where models allocate budget, write ad copy, choose audiences, react to results, and compound or destroy brand momentum month by month.

Why this format

Usefulness is easier to see in a world than in a demo.

Real tradeoffs

The model has to balance budget, quality, timing, retention, and long-term outcomes instead of only producing polished text.

Persistent memory

Past choices carry forward, so impulsive decisions, weak targeting, and repetitive copy have visible downstream consequences.

Measurable outcomes

We can compare models on score, revenue, ROI, repeat rate, and the shape of their decision-making, not just whether the answer sounded good.

Evals that meet the real world.

Scoring designed around the task.

BulletBench

ROASBench

Usefulness is easier to see in a world than in a demo.

Evals that meet
the real world.