Back to Blog
AI Claude Gemini GPT Benchmarks

Claude vs Gemini vs GPT in a 12-Month Marketing Simulation

Ellis Crosby
4 min read
Claude vs Gemini vs GPT in a 12-Month Marketing Simulation

If you want a cleaner comparison of frontier models than the usual vibe-based screenshots, ROAS Bench is a good place to look.

The benchmark puts models in charge of a DTC skincare brand for twelve months and scores them on business outcomes plus behavioral quality. They have to allocate budget, choose channels, plan creative, manage offers, react to outcomes, and iterate over time.

That means the benchmark is not asking, “Which model can write the nicest marketing memo?”

It is asking, “Which model can make a sequence of decisions that does not destroy the business?”

Right now, that distinction is where the real separation shows up.

The scoreboard

At the time reflected on the live ROAS Bench page, the current standings among the major frontier models shown are:

Claude Opus 4.6

  • overall score: 40.61
  • average profit: $506,094
  • ROAS: 192.5%
  • positive months: 27 / 36
  • discount months: 0

Gemini 3.1 Pro Preview

  • overall score: 27.14
  • average profit: -$34,549
  • ROAS: 132.9%
  • positive months: 16 / 36

GPT-5.4

  • overall score: 18.39
  • average profit: -$250,461
  • ROAS: 103.2%
  • positive months: 8 / 36
  • discount months: 14

That spread is not subtle.

Claude is not just ahead. It is the only model on the board currently shown as profitable after the full twelve-month simulation.

Why Claude is ahead

According to the benchmark’s qualitative summary, Claude wins by doing something deceptively simple: it compounds discipline.

The write-up highlights a few recurring strengths:

  • CRM and remarketing stay active every month
  • discounting is avoided completely
  • account learning is preserved instead of repeatedly reset
  • creative is more persona-specific
  • the operating structure remains coherent as spend scales

That combination matters because the benchmark punishes exactly the kinds of mistakes many models make when they seem “smart” in isolation. Randomly clever ideas are not enough. In a year-long environment, coherence beats occasional brilliance.

The strongest Claude example on the page reads like an experienced operator rather than a hype machine. It protects margin, feeds remarketing pools, keeps Google capture alive, and scales into the holiday period without suddenly changing its identity.

That is not sexy. It is just what winning looks like in many real systems.

Why Gemini looks closer than its score suggests

Gemini is interesting because the page explicitly notes that it looks closer to viability than GPT-5.4 does.

Its problem is not that it sees nothing. Its problem is that execution quality still leaks value.

The benchmark summary says Gemini’s budgeting is directionally smarter than its final score alone might imply, but the copy tends to be too generic and the account still triggers too many learning resets, especially around remarketing and mid-course corrections.

That is a useful distinction.

A model can understand the rough shape of the right answer and still fail because:

  • the creative lacks specificity
  • the audience logic is too generic
  • the system keeps getting reset at the wrong times
  • the benchmark environment is punishing weak follow-through

In other words, “closer to the right strategy” is not the same as “good enough to compound profitably.”

Why GPT-5.4 is the most instructive result

GPT-5.4 may be the most interesting result on the page because it illustrates a broader truth about modern LLMs: persuasive reasoning is not the same thing as profitable reasoning.

The benchmark summary says GPT-5.4 often reads as strategically plausible in isolation, but the run-level data shows weak compounding. That is a sharp diagnosis, and it should resonate with almost anyone who has evaluated AI systems seriously.

The model is not obviously clueless. It can generate plans that sound good. But across the full simulation, it does not produce the economic discipline needed to support the spend pattern it chooses.

The page points to recurring failure modes like:

  • broad demand-generation spend without durable payoff
  • repeated saturation in search, shopping, and remarketing
  • serviceable budgeting and targeting that still are not strong enough to overcome weak compounding
  • ad copy that sounds polished but is commercially generic

That last point matters a lot. Generic marketing language is not just aesthetically weak. In this benchmark, it translates into poor persona response and weaker downstream economics.

The deeper story is not model personality

It would be easy to reduce these results to brand stereotypes:

  • Claude is more disciplined
  • Gemini is more uneven
  • GPT is more polished than grounded

There is some truth in that framing, but the deeper lesson is about eval design.

ROAS Bench creates a setting where models must:

  • reason across time
  • absorb imperfect feedback
  • manage finite resources
  • protect long-term health
  • avoid self-inflicted resets

That is exactly where real capability differences become harder to hide.

Single-turn tests often compress frontier models closer together than reality deserves. A sequential benchmark with economic consequences stretches them back out.

What this means for buyers and builders

If you are choosing models for real workflows, these results are a reminder to separate “sounds expert” from “stays effective under state.”

If you are building agents, the message is even sharper: your system should be judged on trajectory, not on isolated answers.

The best model in ROAS Bench is not the one with the prettiest one-month plan.

It is the one that can:

  • keep the account coherent
  • avoid overreacting
  • protect pricing power
  • scale without constant self-sabotage

That is a much more useful definition of intelligence.

And it is probably a preview of where the next generation of serious evals is going.

Ellis Crosby

Related Articles

LiteLLM alternatives for 2026

LiteLLM alternatives for 2026

If you’re looking for LiteLLM alternatives, you’re usually trying to solve one of two problems: * you need a Python library that makes it easy to switch between LLM providers * you need an AI gateway / routing layer that handles fallbacks, caching, observability, and control That split matters, because the best LiteLLM alternative depends on which problem you actually have. Recent context: On March 24, 2026, LiteLLM disclosed a supply-chain incident affecting malicious PyPI releases 1.82.7 

Read More

Ready to Optimize Your AI Prompts?

Start testing and improving your prompts with Spring Prompt's professional tools.

Get Started Free