Back to Blog

AI Growth LLMs Performance Marketing Evals

Why Most LLMs Still Can't Run Growth

Ellis Crosby

Published March 23, 2026

4 min read

One of the most useful things about ROAS Bench is that it reveals a simple truth:

Growth is not a one-shot intelligence test.

It is a compounding system.

That sounds obvious if you have ever run paid acquisition seriously, but it gets lost in AI discourse all the time. A model gives a good answer once, so people assume it can do the job. ROAS Bench is valuable because it tests the part that actually matters: whether the model can keep making decent decisions after its earlier decisions have already changed the terrain.

That is the real work.

Growth gets harder because the system remembers

In ROAS Bench, the model is not solving twelve isolated prompts. It is operating the same business over twelve connected months.

Its choices affect:

budget remaining
customer base
email list size
warm audience pools
brand momentum
offer fatigue
channel memory

That design forces models to deal with something many benchmarks abstract away: history.

Real growth systems remember what you did.

If you overuse discounting, the account remembers. If you blow out remarketing against a small audience pool, the account remembers. If you slash budgets or rebuild structure too aggressively, platform learning remembers. If you use generic creative against trust-sensitive personas, customers remember.

This is why so many AI systems look smarter in slides than they do in production. Production is where the system accumulates memory.

The benchmark’s mechanics are quietly brutal

ROAS Bench includes several constraints that make it especially good at surfacing weak operators:

1. Learning resets

Abrupt reallocations can damage efficiency. So the model is punished for panicky changes that look decisive but destroy continuity.

2. Audience saturation

Warm pools and high-intent auctions are finite. You cannot just keep spending harder forever and expect the same economics.

3. Offer fatigue

Short-term promo behavior can make later months worse. The model has to decide when a conversion boost is worth the downstream cost.

4. Persona tradeoffs

Some audiences are easy but low-value. Others are lucrative but competitive and sensitive to tone, proof, and creative quality.

5. Incomplete feedback

The model does not get perfect hidden labels about why something worked. It has to infer from business outcomes and state summaries.

That combination is exactly why the benchmark feels real. It is not testing whether the model knows marketing vocabulary. It is testing whether the model can navigate delayed consequences.

Why the winning pattern looks boring

The current leader on the page, Claude Opus 4.6, is not described as winning through genius creative theatrics or wild strategic invention.

It wins by staying disciplined:

keep CRM and remarketing on every month
avoid discounting
scale more coherently
maintain account structure
write more persona-specific creative

That is the pattern people underestimate in AI.

In complex operating environments, the edge often comes from not breaking what is already working.

The benchmark summary explicitly says Claude is “doing the boring but important things well.” That line is more profound than it sounds. Many models can generate an exciting strategic pivot. Fewer can preserve compounding.

Why GPT-style competence is not enough

The benchmark’s commentary on GPT-5.4 is especially interesting because it captures the modern AI problem perfectly: the model sounds strategically plausible, but the business results do not compound.

That gap matters a lot.

If a model feels credible in a meeting but repeatedly pushes spend into channels that do not create durable payoff, it is not an operator. It is a persuasive simulator of one.

That is not useless. It can still brainstorm, structure options, and generate first drafts. But it is not the same as trustworthy autonomous execution.

ROAS Bench suggests that, today, many models still confuse activity with progress:

more spend without enough durable return
broad acquisition without enough downstream payoff
generic creative without enough persona fit
reactive changes that trigger resets rather than learning

Those are not cosmetic mistakes. Those are system-killing mistakes.

The benchmark is really about economic judgment

Underneath the marketing wrapper, ROAS Bench is testing a more general capability: economic judgment under delayed feedback.

Can the model:

protect margin instead of chasing vanity revenue?
pace budget over time?
keep retention channels alive while building future demand?
choose when not to change things?
trade short-term wins against long-term health?

That is the same shape of problem you see in lots of real businesses. Marketing just happens to make it vivid because the feedback loops are easier to understand.

This is why benchmarks like this are more useful than generic “agent” demos. They expose the difference between local competence and global competence.

Local competence says, “This monthly plan sounds good.”

Global competence says, “This sequence of decisions leaves the business healthier six months from now.”

Those are very different abilities.

What founders and operators should do with this

The practical takeaway is not “AI is bad at growth.” The practical takeaway is more specific:

Use models for leverage, but be careful giving them systems where compounding mistakes are expensive.

Today, LLMs can be genuinely useful for:

generating hypotheses
framing tests
translating operator instincts into structured plans
drafting persona-specific messaging
summarizing outcomes and tradeoffs

But ROAS Bench is a reminder that full-loop autonomous growth execution is still a much higher bar than polished planning.

The models that will matter most are not the ones that sound smartest on day one.

They are the ones that can avoid digging a hole by month six.

That is what growth has always rewarded.

AI is finally being tested on the same standard.

Why Most LLMs Still Can't Run Growth

Growth gets harder because the system remembers

The benchmark’s mechanics are quietly brutal

1. Learning resets

2. Audience saturation

3. Offer fatigue

4. Persona tradeoffs

5. Incomplete feedback

Why the winning pattern looks boring

Why GPT-style competence is not enough

The benchmark is really about economic judgment

What founders and operators should do with this

Ellis Crosby

Related Articles

Ready to Optimize Your AI Prompts?

Why Most LLMs Still Can't Run Growth

Growth gets harder because the system remembers

The benchmark’s mechanics are quietly brutal

1. Learning resets

2. Audience saturation

3. Offer fatigue

4. Persona tradeoffs

5. Incomplete feedback

Why the winning pattern looks boring

Why GPT-style competence is not enough

The benchmark is really about economic judgment

What founders and operators should do with this

Ellis Crosby

Related Articles

How We Built ROASBench to Feel Like Real Growth Work

Claude vs Gemini vs GPT in a 12-Month Marketing Simulation

LiteLLM alternatives for 2026

Ready to Optimize Your AI Prompts?