If you want a cleaner comparison of frontier models than the usual vibe-based screenshots, ROAS Bench is a good place to look.
The benchmark puts models in charge of a DTC skincare brand for twelve months and scores them on business outcomes plus behavioral quality. They have to allocate budget, choose channels, plan creative, manage offers, react to outcomes, and iterate over time.
That means the benchmark is not asking, “Which model can write the nicest marketing memo?”
It is asking, “Which model can make a sequence of decisions that does not destroy the business?”
Right now, that distinction is where the real separation shows up.
The scoreboard
At the time reflected on the live ROAS Bench page, the current standings among the major frontier models shown are:
Claude Opus 4.6
- overall score:
40.61 - average profit:
$506,094 - ROAS:
192.5% - positive months:
27 / 36 - discount months:
0
Gemini 3.1 Pro Preview
- overall score:
27.14 - average profit:
-$34,549 - ROAS:
132.9% - positive months:
16 / 36
GPT-5.4
- overall score:
18.39 - average profit:
-$250,461 - ROAS:
103.2% - positive months:
8 / 36 - discount months:
14
That spread is not subtle.
Claude is not just ahead. It is the only model on the board currently shown as profitable after the full twelve-month simulation.
Why Claude is ahead
According to the benchmark’s qualitative summary, Claude wins by doing something deceptively simple: it compounds discipline.
The write-up highlights a few recurring strengths:
- CRM and remarketing stay active every month
- discounting is avoided completely
- account learning is preserved instead of repeatedly reset
- creative is more persona-specific
- the operating structure remains coherent as spend scales
That combination matters because the benchmark punishes exactly the kinds of mistakes many models make when they seem “smart” in isolation. Randomly clever ideas are not enough. In a year-long environment, coherence beats occasional brilliance.
The strongest Claude example on the page reads like an experienced operator rather than a hype machine. It protects margin, feeds remarketing pools, keeps Google capture alive, and scales into the holiday period without suddenly changing its identity.
That is not sexy. It is just what winning looks like in many real systems.
Why Gemini looks closer than its score suggests
Gemini is interesting because the page explicitly notes that it looks closer to viability than GPT-5.4 does.
Its problem is not that it sees nothing. Its problem is that execution quality still leaks value.
The benchmark summary says Gemini’s budgeting is directionally smarter than its final score alone might imply, but the copy tends to be too generic and the account still triggers too many learning resets, especially around remarketing and mid-course corrections.
That is a useful distinction.
A model can understand the rough shape of the right answer and still fail because:
- the creative lacks specificity
- the audience logic is too generic
- the system keeps getting reset at the wrong times
- the benchmark environment is punishing weak follow-through
In other words, “closer to the right strategy” is not the same as “good enough to compound profitably.”
Why GPT-5.4 is the most instructive result
GPT-5.4 may be the most interesting result on the page because it illustrates a broader truth about modern LLMs: persuasive reasoning is not the same thing as profitable reasoning.
The benchmark summary says GPT-5.4 often reads as strategically plausible in isolation, but the run-level data shows weak compounding. That is a sharp diagnosis, and it should resonate with almost anyone who has evaluated AI systems seriously.
The model is not obviously clueless. It can generate plans that sound good. But across the full simulation, it does not produce the economic discipline needed to support the spend pattern it chooses.
The page points to recurring failure modes like:
- broad demand-generation spend without durable payoff
- repeated saturation in search, shopping, and remarketing
- serviceable budgeting and targeting that still are not strong enough to overcome weak compounding
- ad copy that sounds polished but is commercially generic
That last point matters a lot. Generic marketing language is not just aesthetically weak. In this benchmark, it translates into poor persona response and weaker downstream economics.
The deeper story is not model personality
It would be easy to reduce these results to brand stereotypes:
- Claude is more disciplined
- Gemini is more uneven
- GPT is more polished than grounded
There is some truth in that framing, but the deeper lesson is about eval design.
ROAS Bench creates a setting where models must:
- reason across time
- absorb imperfect feedback
- manage finite resources
- protect long-term health
- avoid self-inflicted resets
That is exactly where real capability differences become harder to hide.
Single-turn tests often compress frontier models closer together than reality deserves. A sequential benchmark with economic consequences stretches them back out.
What this means for buyers and builders
If you are choosing models for real workflows, these results are a reminder to separate “sounds expert” from “stays effective under state.”
If you are building agents, the message is even sharper: your system should be judged on trajectory, not on isolated answers.
The best model in ROAS Bench is not the one with the prettiest one-month plan.
It is the one that can:
- keep the account coherent
- avoid overreacting
- protect pricing power
- scale without constant self-sabotage
That is a much more useful definition of intelligence.
And it is probably a preview of where the next generation of serious evals is going.