Most marketing AI benchmarks test whether a model can say smart-sounding things once.
We wanted to test whether it could operate.
That is the reason ROASBench exists.
ROASBench was built to feel less like a prompt demo and more like an actual growth environment: budget constraints, channel tradeoffs, audience saturation, creative quality, retention dynamics, delayed feedback, and month-by-month consequences.
The short version is simple:
We built ROASBench to be as realistic as we could make it based on real operator experience.
It came from real operating work
The simulation is heavily informed by our sister agency, Incremento, which focuses on tech-enabled business management and does a lot of work across performance marketing and marketing analytics.
That matters because ROASBench was not designed in a vacuum by people guessing what marketing probably looks like. It was shaped by the kinds of decisions, failure modes, and tradeoffs that show up in real accounts.
Two people were especially important in shaping that realism:
- Ellis, CTO at Incremento and Spring Prompt, previously worked across performance marketing, marketing analytics, engineering, and data science at THG and Blinkist.
- Vivien, CMO at Incremento, is a fractional CMO who has managed close to
$1Bin ad spend over her career, with experience at THG, Rocket Internet, and brands later acquired by Bayer.
That background influenced a lot of the benchmark’s design.
Not in the sense of copying any one client or account.
In the sense of knowing what tends to break, what tends to compound, and what “plausible but wrong” looks like when money is actually on the line.
We started from a believable business, not an abstract task
A benchmark like this only becomes useful if the underlying business feels commercially coherent.
So instead of creating a vague “run marketing for a company” task, we built a specific DTC operating environment:
- one premium-but-accessible skincare brand
- one hero product
- a realistic price point
- realistic gross margin
- a defined geography
- a fixed annual budget
- a twelve-month planning horizon
- seasonality and market shocks
That is why the benchmark centers on Northstar Skin, selling a $68 barrier repair serum at strong gross margins.
We chose a setup like that because it creates the kinds of tensions real marketers deal with all the time. Skincare is useful here because trust matters, creative matters, search intent matters, CRM matters, discounting is tempting, and repeat behavior matters.
In other words, it is a good domain for exposing whether a model can balance short-term conversion pressure with long-term account health.
We modeled channels the way operators actually think about them
ROASBench gives the model control over six channels:
- Meta prospecting
- Google Search
- Google Shopping
- TikTok
- Email / CRM
- Remarketing
Each channel has different economics and different behavioral rules.
That sounds obvious, but it is where a lot of toy simulations go wrong. In real performance marketing, channels are not interchangeable budget buckets. Search behaves differently from TikTok. CRM behaves differently from paid social. Remarketing only works if you actually have a warm pool to harvest.
So we gave channels different:
- CPM, CTR, and CVR baselines
- budget-share limits
- fatigue rates
- learning reset thresholds
- campaign-type defaults
- targeting constraints
We also modeled some operator realities that get missed in simplified benchmarks.
For example, email is not treated like a normal media-buying channel. The benchmark interprets email budget more like CRM send intensity, because real email cost is low and the limiting factor is list quality, cadence, and audience size, not auction spend.
Remarketing is also constrained by actual audience availability. The model cannot pretend that retargeting scale is infinite, because it is not.
We made the system remember
This is the biggest design choice in ROASBench.
The benchmark is not twelve disconnected prompts. It is one business with memory.
A model’s decisions roll forward through persistent state, including:
- budget remaining
- customer base
- email list size
- warm audience pool
- brand momentum
- offer fatigue
- reinvested budget
- channel memory
That means the model has to live with its own behavior.
If it discounts too aggressively, future months get worse.
If it thrashes budget allocation, learning resets hurt efficiency.
If it over-spends into finite audiences, saturation shows up.
If it supports demand generation and trust-building coherently, organic and direct effects can improve over time.
This was a deliberate attempt to capture something that real growth operators learn early: the system remembers what you did.
A benchmark that resets the world every turn misses the hardest part of the job.
We built personas around commercial tradeoffs, not just demographics
Another thing we wanted to avoid was fake persona realism.
A lot of benchmarks create personas that read like slideware. They are descriptive, but not economically useful.
So in ROASBench, personas differ on the dimensions that actually matter commercially:
- price sensitivity
- trust requirements
- repeat propensity
- audience size
- growth potential
- brand fit
- competitive intensity
- customer value
That creates realistic tradeoffs.
Some audiences are easier to wake up but lower quality.
Some are smaller, more demanding, and more expensive to win, but much more valuable if you do.
That is much closer to real growth work than pretending every customer segment is just another creative brief.
We made realism come from both rules and judgment
ROASBench is not just a spreadsheet with random noise.
It combines deterministic system rules with an audience panel layer.
The rules handle the mechanics:
- budget pacing
- reach capacity
- saturation
- auction pressure
- discounts affecting margin and returns
- channel fatigue
- reinvestment
- state updates
- organic carryover
The audience panel handles a different question:
How would different kinds of customers likely react to this targeting and this creative?
That matters because performance marketing is not only about media math. It is also about whether the message fits the audience.
So we assess things like copy quality, persona fit, trust, attention, conversion likelihood, and fatigue at the persona-channel level, then feed those signals back into business outcomes.
If the judge response is incomplete, the system falls back to deterministic heuristics instead of collapsing. That gives the benchmark some resilience while still preserving the basic audience-reaction logic.
We intentionally hid perfect information
One of the easiest ways to make a benchmark unrealistic is to give the model too much clarity.
Real operators do not get a magical dashboard that says:
“The ingredient researcher disliked your vague creative but the value seeker liked the discount, so adjust budget by exactly 17% next month.”
They get imperfect evidence.
So ROASBench gives the model operating metrics, state summaries, market notes, prior-month context, and compressed working memory. It does not hand over the hidden truth in an easy format.
That was important to us because a big part of real performance marketing is inference.
You are trying to read signal from noisy outcomes.
You are deciding whether to hold, scale, or change course without ever having perfect visibility.
That is one of the main reasons ROASBench can separate polished models from genuinely useful ones.
We modeled the boring things that actually kill performance
When people imagine realistic marketing simulation, they often jump straight to “creative strategy.”
Creative matters, but a lot of real account damage comes from much less glamorous problems.
That is why ROASBench explicitly models failure modes like:
- learning resets from abrupt changes
- audience saturation
- over-discounting
- promo dependency
- channel over-concentration
- weak targeting
- weak copy
- high CAC
- ignored remarketing when warm demand exists
- under-investment in months where the model should be building demand
These are the kinds of mistakes that make a plan look active while the economics quietly fall apart.
From our perspective, those are exactly the mistakes a realistic benchmark should punish.
We let organic carryover matter
Another realism choice was to avoid making the entire world purely paid-media deterministic.
In real businesses, good paid decisions can strengthen brand demand over time. Bad decisions can do the opposite.
So ROASBench includes system-generated organic social, organic search, and direct traffic that react gradually to what the model has been doing.
That means upper-funnel support, trust-building, repeat behavior, and brand momentum can produce delayed upside.
It also means the model cannot directly buy its way into every outcome.
That was important because a lot of real growth work is about creating the conditions for later efficiency, not just harvesting intent that already exists.
We scored both outcomes and operator behavior
If you only score revenue, the benchmark becomes easy to game.
If you only score elegance, it stops being a business benchmark.
So ROASBench blends several layers:
- business performance
- behavioral quality
- persona response
- planning quality
That lets the benchmark reward actual economic performance while still recognizing whether the model behaved like a disciplined operator or a chaotic one.
A model that gets lucky for one month but burns the account structure should not look like a winner.
A model that protects margin, avoids panic, keeps CRM alive, respects audience constraints, and compounds over time should.
We tried to make it reproducible, not just realistic
There is always a tradeoff in simulation design between realism and comparability.
We wanted both.
So ROASBench uses a fixed seeded world, structured monthly inputs, explicit state transitions, stable benchmark rules, and repeat runs per model.
That helps keep the benchmark grounded enough to feel like real growth work, while still being controlled enough to compare models fairly.
Without that, you do not have a benchmark.
You just have a story.
The point of ROASBench
The real goal was never to build a perfect digital twin of one account.
It was to build a benchmark that captures the shape of real operating work:
- incomplete feedback
- economic tradeoffs
- channel differences
- memory
- compounding
- delayed consequences
- the gap between plausible strategy and profitable strategy
That is why ROASBench looks the way it does.
It is a synthetic environment, but it is trying to test a very real capability:
Can the model run a growth system without slowly damaging it?
That question is heavily informed by what we have seen through Incremento, and by the lived operating experience Ellis and Vivien bring to the table.
We think that is what makes the benchmark useful.
It does not just ask whether the model can talk like a marketer.
It asks whether it can behave like one.