Back to Blog

AI Benchmarks Performance Marketing Growth Evals

How We Built ROASBench to Feel Like Real Growth Work

Ellis Crosby

Published March 25, 2026

6 min read

How We Built ROASBench to Feel Like Real Growth Work

Most marketing AI benchmarks test whether a model can say smart-sounding things once.

We wanted to test whether it could operate.

That is the reason ROASBench exists.

ROASBench was built to feel less like a prompt demo and more like an actual growth environment: budget constraints, channel tradeoffs, audience saturation, creative quality, retention dynamics, delayed feedback, and month-by-month consequences.

The short version is simple:

We built ROASBench to be as realistic as we could make it based on real operator experience.

It came from real operating work

The simulation is heavily informed by our sister agency, Incremento, which focuses on tech-enabled business management and does a lot of work across performance marketing and marketing analytics.

That matters because ROASBench was not designed in a vacuum by people guessing what marketing probably looks like. It was shaped by the kinds of decisions, failure modes, and tradeoffs that show up in real accounts.

Two people were especially important in shaping that realism:

Ellis, CTO at Incremento and Spring Prompt, previously worked across performance marketing, marketing analytics, engineering, and data science at THG and Blinkist.
Vivien, CMO at Incremento, is a fractional CMO who has managed close to $1B in ad spend over her career, with experience at THG, Rocket Internet, and brands later acquired by Bayer.

That background influenced a lot of the benchmark’s design.

Not in the sense of copying any one client or account.

In the sense of knowing what tends to break, what tends to compound, and what “plausible but wrong” looks like when money is actually on the line.

We started from a believable business, not an abstract task

A benchmark like this only becomes useful if the underlying business feels commercially coherent.

So instead of creating a vague “run marketing for a company” task, we built a specific DTC operating environment:

one premium-but-accessible skincare brand
one hero product
a realistic price point
realistic gross margin
a defined geography
a fixed annual budget
a twelve-month planning horizon
seasonality and market shocks

That is why the benchmark centers on Northstar Skin, selling a $68 barrier repair serum at strong gross margins.

We chose a setup like that because it creates the kinds of tensions real marketers deal with all the time. Skincare is useful here because trust matters, creative matters, search intent matters, CRM matters, discounting is tempting, and repeat behavior matters.

In other words, it is a good domain for exposing whether a model can balance short-term conversion pressure with long-term account health.

We modeled channels the way operators actually think about them

ROASBench gives the model control over six channels:

Meta prospecting
Google Search
Google Shopping
TikTok
Email / CRM
Remarketing

Each channel has different economics and different behavioral rules.

That sounds obvious, but it is where a lot of toy simulations go wrong. In real performance marketing, channels are not interchangeable budget buckets. Search behaves differently from TikTok. CRM behaves differently from paid social. Remarketing only works if you actually have a warm pool to harvest.

So we gave channels different:

CPM, CTR, and CVR baselines
budget-share limits
fatigue rates
learning reset thresholds
campaign-type defaults
targeting constraints

We also modeled some operator realities that get missed in simplified benchmarks.

For example, email is not treated like a normal media-buying channel. The benchmark interprets email budget more like CRM send intensity, because real email cost is low and the limiting factor is list quality, cadence, and audience size, not auction spend.

Remarketing is also constrained by actual audience availability. The model cannot pretend that retargeting scale is infinite, because it is not.

We made the system remember

This is the biggest design choice in ROASBench.

The benchmark is not twelve disconnected prompts. It is one business with memory.

A model’s decisions roll forward through persistent state, including:

budget remaining
customer base
email list size
warm audience pool
brand momentum
offer fatigue
reinvested budget
channel memory

That means the model has to live with its own behavior.

If it discounts too aggressively, future months get worse.

If it thrashes budget allocation, learning resets hurt efficiency.

If it over-spends into finite audiences, saturation shows up.

If it supports demand generation and trust-building coherently, organic and direct effects can improve over time.

This was a deliberate attempt to capture something that real growth operators learn early: the system remembers what you did.

A benchmark that resets the world every turn misses the hardest part of the job.

We built personas around commercial tradeoffs, not just demographics

Another thing we wanted to avoid was fake persona realism.

A lot of benchmarks create personas that read like slideware. They are descriptive, but not economically useful.

So in ROASBench, personas differ on the dimensions that actually matter commercially:

price sensitivity
trust requirements
repeat propensity
audience size
growth potential
brand fit
competitive intensity
customer value

That creates realistic tradeoffs.

Some audiences are easier to wake up but lower quality.

Some are smaller, more demanding, and more expensive to win, but much more valuable if you do.

That is much closer to real growth work than pretending every customer segment is just another creative brief.

We made realism come from both rules and judgment

ROASBench is not just a spreadsheet with random noise.

It combines deterministic system rules with an audience panel layer.

The rules handle the mechanics:

budget pacing
reach capacity
saturation
auction pressure
discounts affecting margin and returns
channel fatigue
reinvestment
state updates
organic carryover

The audience panel handles a different question:

How would different kinds of customers likely react to this targeting and this creative?

That matters because performance marketing is not only about media math. It is also about whether the message fits the audience.

So we assess things like copy quality, persona fit, trust, attention, conversion likelihood, and fatigue at the persona-channel level, then feed those signals back into business outcomes.

If the judge response is incomplete, the system falls back to deterministic heuristics instead of collapsing. That gives the benchmark some resilience while still preserving the basic audience-reaction logic.

We intentionally hid perfect information

One of the easiest ways to make a benchmark unrealistic is to give the model too much clarity.

Real operators do not get a magical dashboard that says:

“The ingredient researcher disliked your vague creative but the value seeker liked the discount, so adjust budget by exactly 17% next month.”

They get imperfect evidence.

So ROASBench gives the model operating metrics, state summaries, market notes, prior-month context, and compressed working memory. It does not hand over the hidden truth in an easy format.

That was important to us because a big part of real performance marketing is inference.

You are trying to read signal from noisy outcomes.

You are deciding whether to hold, scale, or change course without ever having perfect visibility.

That is one of the main reasons ROASBench can separate polished models from genuinely useful ones.

We modeled the boring things that actually kill performance

When people imagine realistic marketing simulation, they often jump straight to “creative strategy.”

Creative matters, but a lot of real account damage comes from much less glamorous problems.

That is why ROASBench explicitly models failure modes like:

learning resets from abrupt changes
audience saturation
over-discounting
promo dependency
channel over-concentration
weak targeting
weak copy
high CAC
ignored remarketing when warm demand exists
under-investment in months where the model should be building demand

These are the kinds of mistakes that make a plan look active while the economics quietly fall apart.

From our perspective, those are exactly the mistakes a realistic benchmark should punish.

We let organic carryover matter

Another realism choice was to avoid making the entire world purely paid-media deterministic.

In real businesses, good paid decisions can strengthen brand demand over time. Bad decisions can do the opposite.

So ROASBench includes system-generated organic social, organic search, and direct traffic that react gradually to what the model has been doing.

That means upper-funnel support, trust-building, repeat behavior, and brand momentum can produce delayed upside.

It also means the model cannot directly buy its way into every outcome.

That was important because a lot of real growth work is about creating the conditions for later efficiency, not just harvesting intent that already exists.

We scored both outcomes and operator behavior

If you only score revenue, the benchmark becomes easy to game.

If you only score elegance, it stops being a business benchmark.

So ROASBench blends several layers:

business performance
behavioral quality
persona response
planning quality

That lets the benchmark reward actual economic performance while still recognizing whether the model behaved like a disciplined operator or a chaotic one.

A model that gets lucky for one month but burns the account structure should not look like a winner.

A model that protects margin, avoids panic, keeps CRM alive, respects audience constraints, and compounds over time should.

We tried to make it reproducible, not just realistic

There is always a tradeoff in simulation design between realism and comparability.

We wanted both.

So ROASBench uses a fixed seeded world, structured monthly inputs, explicit state transitions, stable benchmark rules, and repeat runs per model.

That helps keep the benchmark grounded enough to feel like real growth work, while still being controlled enough to compare models fairly.

Without that, you do not have a benchmark.

You just have a story.

The point of ROASBench

The real goal was never to build a perfect digital twin of one account.

It was to build a benchmark that captures the shape of real operating work:

incomplete feedback
economic tradeoffs
channel differences
memory
compounding
delayed consequences
the gap between plausible strategy and profitable strategy

That is why ROASBench looks the way it does.

It is a synthetic environment, but it is trying to test a very real capability:

Can the model run a growth system without slowly damaging it?

That question is heavily informed by what we have seen through Incremento, and by the lived operating experience Ellis and Vivien bring to the table.

We think that is what makes the benchmark useful.

It does not just ask whether the model can talk like a marketer.

It asks whether it can behave like one.

How We Built ROASBench to Feel Like Real Growth Work

It came from real operating work

We started from a believable business, not an abstract task

We modeled channels the way operators actually think about them

We made the system remember

We built personas around commercial tradeoffs, not just demographics

We made realism come from both rules and judgment

We intentionally hid perfect information

We modeled the boring things that actually kill performance

We let organic carryover matter

We scored both outcomes and operator behavior

We tried to make it reproducible, not just realistic

The point of ROASBench

Ellis Crosby

Related Articles

Ready to Optimize Your AI Prompts?

How We Built ROASBench to Feel Like Real Growth Work

It came from real operating work

We started from a believable business, not an abstract task

We modeled channels the way operators actually think about them

We made the system remember

We built personas around commercial tradeoffs, not just demographics

We made realism come from both rules and judgment

We intentionally hid perfect information

We modeled the boring things that actually kill performance

We let organic carryover matter

We scored both outcomes and operator behavior

We tried to make it reproducible, not just realistic

The point of ROASBench

Ellis Crosby

Related Articles

Claude vs Gemini vs GPT in a 12-Month Marketing Simulation

LiteLLM alternatives for 2026

Why Most LLMs Still Can't Run Growth

Ready to Optimize Your AI Prompts?