Every frontier model can produce a marketing plan that sounds smart for five minutes.
“Prioritize high-intent capture.”
“Protect premium positioning.”
“Use CRM to improve efficiency.”
“Iterate based on signal.”
None of that is hard to say. The hard part is surviving the next twelve months after you say it.
That is why ROAS Bench is worth paying attention to. Instead of grading models on whether they produce polished strategy language, it drops them into a year-long DTC performance marketing simulation and forces them to make decisions month by month. They choose budgets, channels, targeting, creative angles, discounting, remarketing intensity, and iteration strategy. Then the benchmark updates the world state and makes them live with what they just did.
That design choice matters.
Most AI marketing demos reward coherence. ROAS Bench rewards consequences.
What makes this benchmark different
The setup is deliberately hostile to lazy intelligence. Models are operating one skincare brand, Northstar Skin, selling a $68 barrier repair serum at 76% gross margin. They manage six channels across twelve months. They have to deal with seasonality, finite warm pools, learning resets, offer fatigue, and persona tradeoffs.
In other words: the benchmark behaves more like an actual growth system than a prompt engineering parlor trick.
The model cannot just write one beautiful plan and walk away. It has to:
- pace budget over time
- decide when to scale and when to hold
- write creative that matches different personas
- avoid saturating channels
- preserve account learning
- infer what is working from business results rather than from perfect hidden labels
That last part is important. ROAS Bench does not hand the model raw persona-by-persona judge feedback. It gets operating metrics, state summaries, market notes, and compressed working memory. It has to reason from imperfect evidence, which is much closer to real operating work.
The most important result
The headline is brutal: across the completed runs on the live benchmark page, only one model is actually profitable after twelve months.
That model is Claude Opus 4.6.
Its reported average results:
- score:
40.61 - average profit:
$506,094 - ROAS:
192.5% - positive months:
27 / 36 - discount months:
0
That is not just “best overall.” That is a structural gap.
The benchmark page shows Gemini 3.1 Pro Preview as the closest serious challenger on score, with an average score of 27.14 and ROAS of 132.9%, but it still ends slightly unprofitable. GPT-5.4 looks even more revealing: it generates revenue, but ends with an average profit of roughly -$250k and a ROAS of 103.2%.
That is the kind of result you only get when the benchmark is testing economic behavior instead of eloquence.
Why plausible models still lose money
The ROAS Bench write-up describes GPT-5.4 perfectly: it reads as strategically plausible in isolation, but the run-level data shows weak compounding.
That sentence should make a lot of builders uncomfortable.
We are entering a phase where the easiest failure mode for AI systems is not obvious stupidity. It is polished underperformance.
A model can:
- say premium-sounding things
- reference the right channels
- use marketer vocabulary correctly
- produce plans that look believable in a doc
…and still lose the business.
Why? Because growth is path-dependent.
If you overspend into a finite audience pool, you pay for it later. If you keep resetting the account structure, you destroy learning. If you lean too hard on discounting, you create fatigue and margin damage. If your creative is generic, high-value personas stop converting even when the targeting sounds correct on paper.
ROAS Bench is essentially a machine for exposing the gap between “knows the language” and “can compound a system.”
Why Claude is winning
The benchmark’s qualitative summary is revealing. Claude is not winning by doing something flashy. It is winning by doing boring operator work well:
- keep CRM and remarketing active every month
- avoid unnecessary discounting
- improve without repeatedly breaking learning
- stay more persona-specific in creative
- preserve account coherence over time
That is exactly the kind of pattern most benchmarks miss, because most benchmarks do not make models endure the second-order effects of their own decisions.
This is also why the benchmark feels useful beyond marketing. The deeper question is not “which model writes the best media plan?” It is “which model can operate in a dynamic system where today’s optimization can sabotage next month?”
That question applies to growth, product, sales, finance, and operations.
What people should take away from this
If you work in AI, you should probably stop being impressed by single-turn competence in economically complex domains.
If you work in marketing, you should probably stop asking whether a model “sounds like a strategist” and start asking whether it can preserve margin, resist saturation, and iterate without self-sabotage.
And if you build evals, this is the real lesson: the future belongs to benchmarks that force models to manage tradeoffs over time.
ROAS Bench matters because it is harder to game with polish.
It asks a more uncomfortable question:
Can the model actually run the business, or can it only narrate one?
Right now, those are still very different things.