Spring Prompt turns vague model demos into measurable benchmark packs. Some are professional, some are personal, but all of them are designed to answer the same question: can this model actually help in the real world?
Professional
These are the packs we use to compare models on planning, judgment, writing, adaptation, and long-horizon decision-making.
Launch collection
12 casesBenchmarks for testing whether models can create clear, specific, non-generic business content that follows a brief and preserves a brand voice.
Top model: gpt-5.5-pro · 84.0
Launch collection
12 casesBenchmarks for testing whether models can turn a product brief into a clear, persuasive, conversion-aware landing page.
Top model: claude-opus-4.8-high · 83.08
Launch collection
12 casesBenchmarks for testing whether models can brief, prioritise, rewrite, and communicate in ways that reduce executive workload.
Top model: claude-opus-4.8-low · 82.0
Launch collection
12 casesBenchmarks for testing whether models can improve startup pitches, critique weak claims, and anticipate investor concerns.
Top model: claude-opus-4.7 · 84.5
Launch collection
12 casesBenchmarks for testing whether models can evaluate AI initiatives, vendor claims, implementation risks, and production readiness.
Top model: claude-opus-4.7 · 83.83
Live now
SimulationA hard-mode DTC growth simulation where models allocate budget, write ad copy, choose audiences, react to results, and compound or destroy brand momentum month by month.
Pilot
One-off snapshotModels read a week of Guardian coverage and predict the next week’s headlines — scored against what actually published. Weekly automation is still to come.
Personal
These packs are designed to feel immediately relatable and shareable while still testing real planning, adaptability, and practical reasoning.
Why this format
Real tradeoffs
The model has to balance budget, quality, timing, retention, and long-term outcomes instead of only producing polished text.
Persistent memory
Past choices carry forward, so impulsive decisions, weak targeting, and repetitive copy have visible downstream consequences.
Measurable outcomes
We can compare models on score, revenue, ROI, repeat rate, and the shape of their decision-making, not just whether the answer sounded good.