AI Marketing Benchmarks LLMs Performance Marketing

The AI Marketing Benchmark That Punishes Plausible-Sounding Strategy

Ellis Crosby

Published March 20, 2026

3 min read

The AI Marketing Benchmark That Punishes Plausible-Sounding Strategy

Every frontier model can produce a marketing plan that sounds smart for five minutes.

“Prioritize high-intent capture.”
“Protect premium positioning.”
“Use CRM to improve efficiency.”
“Iterate based on signal.”

None of that is hard to say. The hard part is surviving the next twelve months after you say it.

That is why ROAS Bench is worth paying attention to. Instead of grading models on whether they produce polished strategy language, it drops them into a year-long DTC performance marketing simulation and forces them to make decisions month by month. They choose budgets, channels, targeting, creative angles, discounting, remarketing intensity, and iteration strategy. Then the benchmark updates the world state and makes them live with what they just did.

That design choice matters.

Most AI marketing demos reward coherence. ROAS Bench rewards consequences.

What makes this benchmark different

The setup is deliberately hostile to lazy intelligence. Models are operating one skincare brand, Northstar Skin, selling a $68 barrier repair serum at 76% gross margin. They manage six channels across twelve months. They have to deal with seasonality, finite warm pools, learning resets, offer fatigue, and persona tradeoffs.

In other words: the benchmark behaves more like an actual growth system than a prompt engineering parlor trick.

The model cannot just write one beautiful plan and walk away. It has to:

pace budget over time
decide when to scale and when to hold
write creative that matches different personas
avoid saturating channels
preserve account learning
infer what is working from business results rather than from perfect hidden labels

That last part is important. ROAS Bench does not hand the model raw persona-by-persona judge feedback. It gets operating metrics, state summaries, market notes, and compressed working memory. It has to reason from imperfect evidence, which is much closer to real operating work.

The most important result

The headline is brutal: across the completed runs on the live benchmark page, only one model is actually profitable after twelve months.

That model is Claude Opus 4.6.

Its reported average results:

score: 40.61
average profit: $506,094
ROAS: 192.5%
positive months: 27 / 36
discount months: 0

That is not just “best overall.” That is a structural gap.

The benchmark page shows Gemini 3.1 Pro Preview as the closest serious challenger on score, with an average score of 27.14 and ROAS of 132.9%, but it still ends slightly unprofitable. GPT-5.4 looks even more revealing: it generates revenue, but ends with an average profit of roughly -$250k and a ROAS of 103.2%.

That is the kind of result you only get when the benchmark is testing economic behavior instead of eloquence.

Why plausible models still lose money

The ROAS Bench write-up describes GPT-5.4 perfectly: it reads as strategically plausible in isolation, but the run-level data shows weak compounding.

That sentence should make a lot of builders uncomfortable.

We are entering a phase where the easiest failure mode for AI systems is not obvious stupidity. It is polished underperformance.

A model can:

say premium-sounding things
reference the right channels
use marketer vocabulary correctly
produce plans that look believable in a doc

…and still lose the business.

Why? Because growth is path-dependent.

If you overspend into a finite audience pool, you pay for it later. If you keep resetting the account structure, you destroy learning. If you lean too hard on discounting, you create fatigue and margin damage. If your creative is generic, high-value personas stop converting even when the targeting sounds correct on paper.

ROAS Bench is essentially a machine for exposing the gap between “knows the language” and “can compound a system.”

Why Claude is winning

The benchmark’s qualitative summary is revealing. Claude is not winning by doing something flashy. It is winning by doing boring operator work well:

keep CRM and remarketing active every month
avoid unnecessary discounting
improve without repeatedly breaking learning
stay more persona-specific in creative
preserve account coherence over time

That is exactly the kind of pattern most benchmarks miss, because most benchmarks do not make models endure the second-order effects of their own decisions.

This is also why the benchmark feels useful beyond marketing. The deeper question is not “which model writes the best media plan?” It is “which model can operate in a dynamic system where today’s optimization can sabotage next month?”

That question applies to growth, product, sales, finance, and operations.

What people should take away from this

If you work in AI, you should probably stop being impressed by single-turn competence in economically complex domains.

If you work in marketing, you should probably stop asking whether a model “sounds like a strategist” and start asking whether it can preserve margin, resist saturation, and iterate without self-sabotage.

And if you build evals, this is the real lesson: the future belongs to benchmarks that force models to manage tradeoffs over time.

ROAS Bench matters because it is harder to game with polish.

It asks a more uncomfortable question:

Can the model actually run the business, or can it only narrate one?

Right now, those are still very different things.

The AI Marketing Benchmark That Punishes Plausible-Sounding Strategy

What makes this benchmark different

The most important result

Why plausible models still lose money

Why Claude is winning

What people should take away from this

Ellis Crosby

Related Articles

Ready to Optimize Your AI Prompts?

The AI Marketing Benchmark That Punishes Plausible-Sounding Strategy

What makes this benchmark different

The most important result

Why plausible models still lose money

Why Claude is winning

What people should take away from this

Ellis Crosby

Related Articles

How We Built ROASBench to Feel Like Real Growth Work

Claude vs Gemini vs GPT in a 12-Month Marketing Simulation

LiteLLM alternatives for 2026

Ready to Optimize Your AI Prompts?