We put Claude Sonnet 5 through ROASBench the day it dropped. It failed spectacularly, but in the most interesting way we've seen from any model.
Quick context: ROASBench is a 12-month simulation. The model plays the performance marketer for a DTC e-commerce brand. Every month it picks channels, sets budgets, writes the creative, and reacts to what comes back: customers, revenue, refunds, audience saturation. We don't score it on how good the plan sounds. We score it on the money it actually makes.
The paradox
Sonnet 5 posted the highest strategic-reasoning score of any model we've ever tested. Its monthly write-ups are genuinely sharp. It reads the data, spots the patterns, names the problems.
And it lost money on every single run. Thinking off, medium, high, all of them, deep in the red. It's the first model we've benchmarked where performance had essentially no correlation with reasoning quality.

That's the whole story in one chart. The field trends up and to the right: sharper analysis, more profit. Sonnet 5's three dots sit alone in the bottom-right. Best analysis on the board, worst returns.
The pattern: on, off, on, off
Watching the months play out, the behaviour was wild. It would blow the budget across all six channels one month and lose ~$50k, then go completely dark for three or four months, then do it again. On, off, on, off, for a whole year. Nothing ever compounded.

It diagnosed the problem perfectly, then did the opposite
Here's the part that blew my mind. In month 9 of one run, it wrote (correctly) that Search and Remarketing were the only channels consistently above breakeven, and that Meta and TikTok were dragging everything negative. Textbook analysis:
"Google Search and Remarketing were the only channels with ROI consistently above 1.0 ... while Meta/TikTok/Shopping dragged blended ROI negative."
Then, in that same month, it funded Meta and TikTok anyway. It identified the winning move and did the exact opposite of it.
Why the stop-start is fatal
Anyone who's run paid knows consistency is the whole game. Every time you go dark and restart, the ad platforms' learning resets. You're paying to re-train the algorithm from scratch. Steady spend on what works compounds. Sonnet 5 never let anything run long enough to work, and kept resetting its own progress.
More reasoning made it worse
This is the bit I can't stop thinking about. More thinking didn't help, it hurt. Thinking off lost $127k. Thinking high lost $333k. The extra reasoning didn't find a better strategy; it produced more elaborate, more confident course-corrections, and talked the model out of the boring, correct answer.

The actual lesson
Older, less flashy models did fine here. Opus 4.6 just committed to a couple of channels and stayed comfortably profitable. So the takeaway isn't "Sonnet 5 is bad". It's excellent at plenty of things, and this is one narrow simulation, not a verdict.
The takeaway is this: it sounds completely correct the entire way down. Confident, well-structured, cites the right numbers, and quietly makes rookie mistakes underneath. Fluent isn't the same as right. A good reminder to verify what a model actually did, not how sharp it sounds doing it.
See it for yourself
ROASBench is one of the evals we run at Spring Prompt. The whole idea is to score models on real outcomes instead of how good their answers sound, because as this run shows, those two things can point in completely opposite directions.
Dig into the full ROASBench leaderboard (all 29 models, every reasoning tier): springprompt.com/evals/roas-bench
Want outcome-based evals for your own use case? We're building exactly that. Join the waitlist and we'll be in touch.