Back to Blog
AI Marketing Benchmarks LLMs Performance Marketing

The AI Marketing Benchmark That Punishes Plausible-Sounding Strategy

Ellis Crosby
3 min read
The AI Marketing Benchmark That Punishes Plausible-Sounding Strategy

Every frontier model can produce a marketing plan that sounds smart for five minutes.

“Prioritize high-intent capture.”
“Protect premium positioning.”
“Use CRM to improve efficiency.”
“Iterate based on signal.”

None of that is hard to say. The hard part is surviving the next twelve months after you say it.

That is why ROAS Bench is worth paying attention to. Instead of grading models on whether they produce polished strategy language, it drops them into a year-long DTC performance marketing simulation and forces them to make decisions month by month. They choose budgets, channels, targeting, creative angles, discounting, remarketing intensity, and iteration strategy. Then the benchmark updates the world state and makes them live with what they just did.

That design choice matters.

Most AI marketing demos reward coherence. ROAS Bench rewards consequences.

What makes this benchmark different

The setup is deliberately hostile to lazy intelligence. Models are operating one skincare brand, Northstar Skin, selling a $68 barrier repair serum at 76% gross margin. They manage six channels across twelve months. They have to deal with seasonality, finite warm pools, learning resets, offer fatigue, and persona tradeoffs.

In other words: the benchmark behaves more like an actual growth system than a prompt engineering parlor trick.

The model cannot just write one beautiful plan and walk away. It has to:

  • pace budget over time
  • decide when to scale and when to hold
  • write creative that matches different personas
  • avoid saturating channels
  • preserve account learning
  • infer what is working from business results rather than from perfect hidden labels

That last part is important. ROAS Bench does not hand the model raw persona-by-persona judge feedback. It gets operating metrics, state summaries, market notes, and compressed working memory. It has to reason from imperfect evidence, which is much closer to real operating work.

The most important result

The headline is brutal: across the completed runs on the live benchmark page, only one model is actually profitable after twelve months.

That model is Claude Opus 4.6.

Its reported average results:

  • score: 40.61
  • average profit: $506,094
  • ROAS: 192.5%
  • positive months: 27 / 36
  • discount months: 0

That is not just “best overall.” That is a structural gap.

The benchmark page shows Gemini 3.1 Pro Preview as the closest serious challenger on score, with an average score of 27.14 and ROAS of 132.9%, but it still ends slightly unprofitable. GPT-5.4 looks even more revealing: it generates revenue, but ends with an average profit of roughly -$250k and a ROAS of 103.2%.

That is the kind of result you only get when the benchmark is testing economic behavior instead of eloquence.

Why plausible models still lose money

The ROAS Bench write-up describes GPT-5.4 perfectly: it reads as strategically plausible in isolation, but the run-level data shows weak compounding.

That sentence should make a lot of builders uncomfortable.

We are entering a phase where the easiest failure mode for AI systems is not obvious stupidity. It is polished underperformance.

A model can:

  • say premium-sounding things
  • reference the right channels
  • use marketer vocabulary correctly
  • produce plans that look believable in a doc

…and still lose the business.

Why? Because growth is path-dependent.

If you overspend into a finite audience pool, you pay for it later. If you keep resetting the account structure, you destroy learning. If you lean too hard on discounting, you create fatigue and margin damage. If your creative is generic, high-value personas stop converting even when the targeting sounds correct on paper.

ROAS Bench is essentially a machine for exposing the gap between “knows the language” and “can compound a system.”

Why Claude is winning

The benchmark’s qualitative summary is revealing. Claude is not winning by doing something flashy. It is winning by doing boring operator work well:

  • keep CRM and remarketing active every month
  • avoid unnecessary discounting
  • improve without repeatedly breaking learning
  • stay more persona-specific in creative
  • preserve account coherence over time

That is exactly the kind of pattern most benchmarks miss, because most benchmarks do not make models endure the second-order effects of their own decisions.

This is also why the benchmark feels useful beyond marketing. The deeper question is not “which model writes the best media plan?” It is “which model can operate in a dynamic system where today’s optimization can sabotage next month?”

That question applies to growth, product, sales, finance, and operations.

What people should take away from this

If you work in AI, you should probably stop being impressed by single-turn competence in economically complex domains.

If you work in marketing, you should probably stop asking whether a model “sounds like a strategist” and start asking whether it can preserve margin, resist saturation, and iterate without self-sabotage.

And if you build evals, this is the real lesson: the future belongs to benchmarks that force models to manage tradeoffs over time.

ROAS Bench matters because it is harder to game with polish.

It asks a more uncomfortable question:

Can the model actually run the business, or can it only narrate one?

Right now, those are still very different things.

Ellis Crosby

Related Articles

LiteLLM alternatives for 2026

LiteLLM alternatives for 2026

If you’re looking for LiteLLM alternatives, you’re usually trying to solve one of two problems: * you need a Python library that makes it easy to switch between LLM providers * you need an AI gateway / routing layer that handles fallbacks, caching, observability, and control That split matters, because the best LiteLLM alternative depends on which problem you actually have. Recent context: On March 24, 2026, LiteLLM disclosed a supply-chain incident affecting malicious PyPI releases 1.82.7 

Read More

Ready to Optimize Your AI Prompts?

Start testing and improving your prompts with Spring Prompt's professional tools.

Get Started Free