Public benchmark packs

Benchmarks for
useful LLMs.

Spring Prompt turns vague model demos into measurable benchmark packs. Some are professional, some are personal, but all of them are designed to answer the same question: can this model actually help in the real world?

Explore ROASBench Join the Waitlist

Professional

Benchmarks that test work, not demos.

These are the packs we use to compare models on planning, judgment, writing, adaptation, and long-horizon decision-making.

Live now

Simulation

ROASBench

A hard-mode DTC growth simulation where models allocate budget, write ad copy, choose audiences, react to results, and compound or destroy brand momentum month by month.

Pilot

One-off snapshot

PredictTheWeek

Models read a week of Guardian coverage and predict the next week’s headlines — scored against what actually published. Weekly automation is still to come.

Planned

Scenario pack

Cold Outreach

Prospect research, message sequencing, objection handling, and follow-ups judged by realistic buyer personas rather than vibes.

Planned

Data pack

Data Analyst

Messy business data, SQL tasks, anomaly detection, and executive summaries scored on depth, accuracy, and whether the conclusions actually match the evidence.

Planned

Timeline sim

Launch Week

Announcement copy, FAQs, stakeholder updates, bug-response comms, and post-launch analysis compressed into a realistic launch timeline.

Planned

Optimization

Conversion Doctor

Diagnose underperforming pages or funnels, recommend changes, and rewrite the weak spots with persona-based conversion feedback.

Planned

Content system

Content Engine

Turn one source asset into channel-specific outputs that actually feel native to each format instead of being the same copy rearranged five ways.

Personal

Useful outside of work, too.

These packs are designed to feel immediately relatable and shareable while still testing real planning, adaptability, and practical reasoning.

Coming soon

Fridge Roulette

Random ingredients, limited time, missing staples, dietary constraints, and tomorrow's leftovers all in one cooking benchmark.

Coming soon

Dropped In

A high-agency resourcefulness test: you are stuck in a foreign city, things have gone wrong, and the model needs to get you out step by step.

Coming soon

Life Admin

Quotes, complaints, bills, paperwork, logistics, and all the useful but annoying tasks that show whether a model can actually help.

Coming soon

Learn Anything

A 30-day adaptive learning benchmark where the plan has to change based on motivation, progress, and what the learner actually retained.

Coming soon

Fitness Architect

Training plans built under real constraints like travel, injury, limited equipment, and mid-week changes that force sensible adaptation.

Why this format

Usefulness is easier to see in a world than in a demo.

Real tradeoffs

The model has to balance budget, quality, timing, retention, and long-term outcomes instead of only producing polished text.

Persistent memory

Past choices carry forward, so impulsive decisions, weak targeting, and repetitive copy have visible downstream consequences.

Measurable outcomes

We can compare models on score, revenue, ROI, repeat rate, and the shape of their decision-making, not just whether the answer sounded good.

Benchmarks for useful LLMs.

Benchmarks that test work, not demos.

ROASBench

PredictTheWeek

Cold Outreach

Data Analyst

Launch Week

Conversion Doctor

Content Engine

Useful outside of work, too.

Fridge Roulette

Dropped In

Life Admin

Learn Anything

Fitness Architect

Usefulness is easier to see in a world than in a demo.

Benchmarks for
useful LLMs.