Back to Blog
AI Growth LLMs Performance Marketing Evals

Why Most LLMs Still Can't Run Growth

Ellis Crosby
4 min read
Why Most LLMs Still Can't Run Growth

One of the most useful things about ROAS Bench is that it reveals a simple truth:

Growth is not a one-shot intelligence test.

It is a compounding system.

That sounds obvious if you have ever run paid acquisition seriously, but it gets lost in AI discourse all the time. A model gives a good answer once, so people assume it can do the job. ROAS Bench is valuable because it tests the part that actually matters: whether the model can keep making decent decisions after its earlier decisions have already changed the terrain.

That is the real work.

Growth gets harder because the system remembers

In ROAS Bench, the model is not solving twelve isolated prompts. It is operating the same business over twelve connected months.

Its choices affect:

  • budget remaining
  • customer base
  • email list size
  • warm audience pools
  • brand momentum
  • offer fatigue
  • channel memory

That design forces models to deal with something many benchmarks abstract away: history.

Real growth systems remember what you did.

If you overuse discounting, the account remembers. If you blow out remarketing against a small audience pool, the account remembers. If you slash budgets or rebuild structure too aggressively, platform learning remembers. If you use generic creative against trust-sensitive personas, customers remember.

This is why so many AI systems look smarter in slides than they do in production. Production is where the system accumulates memory.

The benchmark’s mechanics are quietly brutal

ROAS Bench includes several constraints that make it especially good at surfacing weak operators:

1. Learning resets

Abrupt reallocations can damage efficiency. So the model is punished for panicky changes that look decisive but destroy continuity.

2. Audience saturation

Warm pools and high-intent auctions are finite. You cannot just keep spending harder forever and expect the same economics.

3. Offer fatigue

Short-term promo behavior can make later months worse. The model has to decide when a conversion boost is worth the downstream cost.

4. Persona tradeoffs

Some audiences are easy but low-value. Others are lucrative but competitive and sensitive to tone, proof, and creative quality.

5. Incomplete feedback

The model does not get perfect hidden labels about why something worked. It has to infer from business outcomes and state summaries.

That combination is exactly why the benchmark feels real. It is not testing whether the model knows marketing vocabulary. It is testing whether the model can navigate delayed consequences.

Why the winning pattern looks boring

The current leader on the page, Claude Opus 4.6, is not described as winning through genius creative theatrics or wild strategic invention.

It wins by staying disciplined:

  • keep CRM and remarketing on every month
  • avoid discounting
  • scale more coherently
  • maintain account structure
  • write more persona-specific creative

That is the pattern people underestimate in AI.

In complex operating environments, the edge often comes from not breaking what is already working.

The benchmark summary explicitly says Claude is “doing the boring but important things well.” That line is more profound than it sounds. Many models can generate an exciting strategic pivot. Fewer can preserve compounding.

Why GPT-style competence is not enough

The benchmark’s commentary on GPT-5.4 is especially interesting because it captures the modern AI problem perfectly: the model sounds strategically plausible, but the business results do not compound.

That gap matters a lot.

If a model feels credible in a meeting but repeatedly pushes spend into channels that do not create durable payoff, it is not an operator. It is a persuasive simulator of one.

That is not useless. It can still brainstorm, structure options, and generate first drafts. But it is not the same as trustworthy autonomous execution.

ROAS Bench suggests that, today, many models still confuse activity with progress:

  • more spend without enough durable return
  • broad acquisition without enough downstream payoff
  • generic creative without enough persona fit
  • reactive changes that trigger resets rather than learning

Those are not cosmetic mistakes. Those are system-killing mistakes.

The benchmark is really about economic judgment

Underneath the marketing wrapper, ROAS Bench is testing a more general capability: economic judgment under delayed feedback.

Can the model:

  • protect margin instead of chasing vanity revenue?
  • pace budget over time?
  • keep retention channels alive while building future demand?
  • choose when not to change things?
  • trade short-term wins against long-term health?

That is the same shape of problem you see in lots of real businesses. Marketing just happens to make it vivid because the feedback loops are easier to understand.

This is why benchmarks like this are more useful than generic “agent” demos. They expose the difference between local competence and global competence.

Local competence says, “This monthly plan sounds good.”

Global competence says, “This sequence of decisions leaves the business healthier six months from now.”

Those are very different abilities.

What founders and operators should do with this

The practical takeaway is not “AI is bad at growth.” The practical takeaway is more specific:

Use models for leverage, but be careful giving them systems where compounding mistakes are expensive.

Today, LLMs can be genuinely useful for:

  • generating hypotheses
  • framing tests
  • translating operator instincts into structured plans
  • drafting persona-specific messaging
  • summarizing outcomes and tradeoffs

But ROAS Bench is a reminder that full-loop autonomous growth execution is still a much higher bar than polished planning.

The models that will matter most are not the ones that sound smartest on day one.

They are the ones that can avoid digging a hole by month six.

That is what growth has always rewarded.

AI is finally being tested on the same standard.

Ellis Crosby

Related Articles

LiteLLM alternatives for 2026

LiteLLM alternatives for 2026

If you’re looking for LiteLLM alternatives, you’re usually trying to solve one of two problems: * you need a Python library that makes it easy to switch between LLM providers * you need an AI gateway / routing layer that handles fallbacks, caching, observability, and control That split matters, because the best LiteLLM alternative depends on which problem you actually have. Recent context: On March 24, 2026, LiteLLM disclosed a supply-chain incident affecting malicious PyPI releases 1.82.7 

Read More

Ready to Optimize Your AI Prompts?

Start testing and improving your prompts with Spring Prompt's professional tools.

Get Started Free