Back to Blog
Evals ROASBench

Sonnet 5 gave the sharpest marketing analysis we've ever tested. It also lost money every single time.

Ellis Crosby
3 min read
Claude Sonnet 5 on ROASBench: highest business reasoning, lowest profit
Claude Sonnet 5 on ROASBench: highest business reasoning, lowest profit

We put Claude Sonnet 5 through ROASBench the day it dropped. It failed spectacularly, but in the most interesting way we've seen from any model.

Quick context: ROASBench is a 12-month simulation. The model plays the performance marketer for a DTC e-commerce brand. Every month it picks channels, sets budgets, writes the creative, and reacts to what comes back: customers, revenue, refunds, audience saturation. We don't score it on how good the plan sounds. We score it on the money it actually makes.

The paradox

Sonnet 5 posted the highest strategic-reasoning score of any model we've ever tested. Its monthly write-ups are genuinely sharp. It reads the data, spots the patterns, names the problems.

And it lost money on every single run. Thinking off, medium, high, all of them, deep in the red. It's the first model we've benchmarked where performance had essentially no correlation with reasoning quality.

Every dot is a model. Sonnet 5 (the coloured dots) sits alone bottom-right: best business reasoning on the board, negative profit.
Every dot is a model. Sonnet 5 (the coloured dots) sits alone bottom-right: best business reasoning on the board, negative profit.

That's the whole story in one chart. The field trends up and to the right: sharper analysis, more profit. Sonnet 5's three dots sit alone in the bottom-right. Best analysis on the board, worst returns.

The pattern: on, off, on, off

Watching the months play out, the behaviour was wild. It would blow the budget across all six channels one month and lose ~$50k, then go completely dark for three or four months, then do it again. On, off, on, off, for a whole year. Nothing ever compounded.

Cumulative profit over 12 months. All three reasoning tiers slide into the red; high crashes hardest.
Cumulative profit over 12 months. All three reasoning tiers slide into the red; high crashes hardest.

It diagnosed the problem perfectly, then did the opposite

Here's the part that blew my mind. In month 9 of one run, it wrote (correctly) that Search and Remarketing were the only channels consistently above breakeven, and that Meta and TikTok were dragging everything negative. Textbook analysis:

"Google Search and Remarketing were the only channels with ROI consistently above 1.0 ... while Meta/TikTok/Shopping dragged blended ROI negative."

Then, in that same month, it funded Meta and TikTok anyway. It identified the winning move and did the exact opposite of it.

Why the stop-start is fatal

Anyone who's run paid knows consistency is the whole game. Every time you go dark and restart, the ad platforms' learning resets. You're paying to re-train the algorithm from scratch. Steady spend on what works compounds. Sonnet 5 never let anything run long enough to work, and kept resetting its own progress.

More reasoning made it worse

This is the bit I can't stop thinking about. More thinking didn't help, it hurt. Thinking off lost $127k. Thinking high lost $333k. The extra reasoning didn't find a better strategy; it produced more elaborate, more confident course-corrections, and talked the model out of the boring, correct answer.

Average profit by reasoning tier. More thinking, deeper losses.
Average profit by reasoning tier. More thinking, deeper losses.

The actual lesson

Older, less flashy models did fine here. Opus 4.6 just committed to a couple of channels and stayed comfortably profitable. So the takeaway isn't "Sonnet 5 is bad". It's excellent at plenty of things, and this is one narrow simulation, not a verdict.

The takeaway is this: it sounds completely correct the entire way down. Confident, well-structured, cites the right numbers, and quietly makes rookie mistakes underneath. Fluent isn't the same as right. A good reminder to verify what a model actually did, not how sharp it sounds doing it.

See it for yourself

ROASBench is one of the evals we run at Spring Prompt. The whole idea is to score models on real outcomes instead of how good their answers sound, because as this run shows, those two things can point in completely opposite directions.

Dig into the full ROASBench leaderboard (all 29 models, every reasoning tier): springprompt.com/evals/roas-bench

Want outcome-based evals for your own use case? We're building exactly that. Join the waitlist and we'll be in touch.

Ellis Crosby

Related Articles

LiteLLM alternatives for 2026

LiteLLM alternatives for 2026

If you’re looking for LiteLLM alternatives, you’re usually trying to solve one of two problems: * you need a Python library that makes it easy to switch between LLM providers * you need an AI gateway / routing layer that handles fallbacks, caching, observability, and control That split matters, because the best LiteLLM alternative depends on which problem you actually have. Recent context: On March 24, 2026, LiteLLM disclosed a supply-chain incident affecting malicious PyPI releases 1.82.7 

Read More

Ready to Optimize Your AI Prompts?

Start testing and improving your prompts with Spring Prompt's professional tools.

Join waitlist