Confirm Action

Are you sure you want to proceed?

Live benchmark

ROASBench

A hard-mode performance marketing simulation for LLMs. Models act as the marketer for a DTC skincare brand, choose channels, plan spend, write creative angles, react to results, and live with the consequences for 12 months.

12-month simulation 3 repeats per model 6 controllable channels 8 shopper personas

Top cumulative profitability

Anthropic: Claude Opus 4.6

$506,094

Current leader

Anthropic: Claude Opus 4.6

Avg score 40.61

Profitable after 12 months

6 models

Anthropic: Claude Opus 4.6, Anthropic: Claude Opus 4.6 · High, OpenAI: GPT-5.5 Pro

Lead over #2

+2.80 points

vs. Anthropic: Claude Opus 4.6 · High

Closest profit challenger

Anthropic: Claude Opus 4.6 · High

$411,580 after 12 months

Money balance over time

Average cumulative profit by month

Average across all completed runs for each participant.

Score vs. cost per run

Quality is not evenly priced

Average benchmark score against average main-model API cost per run.

Monthly contribution profit

Who stabilizes fastest?

Average monthly contribution profit across completed runs.

Average score

ROASBench leaderboard

Models ranked by average primary score (highest first). Values above each bar are the mean score; whiskers show standard deviation across completed runs when more than one run exists.

Leaderboard

Current standings

Sorted by average benchmark score. Tap a row for sub-scores and detail.

# Model Score ROAS
1

Anthropic: Claude Opus 4.6

Leads the pack by compounding a coherent plan: retention channels stay funded, discounting stays rare, and changes are absorbed without constant learning resets. Averaged 40.61 across 3 completed run(s); contribution profit $506,094; ROAS 192.5%. Avg first month cumulative contribution profit turns positive: ~7.0.

40.61
192.5%

Sub-scores

Business
35.70
Behavior
27.28
Planning
59.06
Persona
56.75

Economics

Cost / run
$0.99
Spend
$1,244,940
Revenue
$2,402,318
Runs
3
1st profitable mo (avg)
~7.0

Avg profit: $506,094

Diagnosis

Converts strategy into durable economics

The model averages 40.61 score and $506,094 contribution profit. Its strongest dimension is planning (59.1).

Trajectory

Negative months: 3 / 12

Final cumulative profit: $506,094

Worst month: M1 ($-30,242)

Relative position

Vs leader: +0.00 score pts

Vs median: +22.22 score pts

Strongest: planning (59.1)

Weakest: behavior (27.3)

Strength: Stable iteration, persona-aware creative, disciplined CRM and remarketing.

Watch: Still hits saturation and high CAC when scaling search and broad demand.

2

Anthropic: Claude Opus 4.6 · High

Leads the pack by compounding a coherent plan: retention channels stay funded, discounting stays rare, and changes are absorbed without constant learning resets. Averaged 37.81 across 3 completed run(s); contribution profit $411,580; ROAS 182.1%. Avg first month cumulative contribution profit turns positive: ~8.0.

37.81
182.1%

Sub-scores

Business
32.01
Behavior
26.67
Planning
57.23
Persona
53.37

Economics

Cost / run
$0.94
Spend
$1,202,115
Revenue
$2,219,539
Runs
3
1st profitable mo (avg)
~8.0

Avg profit: $411,580

Diagnosis

Converts strategy into durable economics

The model averages 37.81 score and $411,580 contribution profit. Its strongest dimension is planning (57.2).

Trajectory

Negative months: 3 / 12

Final cumulative profit: $411,580

Worst month: M1 ($-29,462)

Relative position

Vs leader: -2.80 score pts

Vs median: +19.42 score pts

Strongest: planning (57.2)

Weakest: behavior (26.7)

Strength: Stable iteration, persona-aware creative, disciplined CRM and remarketing.

Watch: Still hits saturation and high CAC when scaling search and broad demand.

3

OpenAI: GPT-5.5 Pro

Ranked #3 of 23 with an average benchmark score of 37.78 across 3 run(s). Sub-scores are strongest on planning (55.6) and weakest on business (30.1). Average contribution profit $397,813 and ROAS 184.9%. On average, cumulative contribution profit first turns positive around month 7.3.

37.78
184.9%

Sub-scores

Business
30.15
Behavior
31.92
Planning
55.63
Persona
53.56

Economics

Cost / run
$27.97
Spend
$1,177,483
Revenue
$2,177,446
Runs
3
1st profitable mo (avg)
~7.3

Avg profit: $397,813

Diagnosis

Converts strategy into durable economics

The model averages 37.78 score and $397,813 contribution profit. Its strongest dimension is planning (55.6).

Trajectory

Negative months: 2 / 12

Final cumulative profit: $397,813

Worst month: M1 ($-29,816)

Relative position

Vs leader: -2.83 score pts

Vs median: +19.39 score pts

Strongest: planning (55.6)

Weakest: business (30.1)

Strength: Relative edge: planning (55.6).

Watch: Relative gap: business (30.1).

4

Qwen: Qwen3.7 Max

Ranked #4 of 23 with an average benchmark score of 33.56 across 3 run(s). Sub-scores are strongest on planning (54.4) and weakest on business (20.7). Average contribution profit $131,537 and ROAS 154.4%. On average, cumulative contribution profit first turns positive around month 10.7.

33.56
154.4%

Sub-scores

Business
20.70
Behavior
36.64
Planning
54.38
Persona
51.96

Economics

Cost / run
$0.00
Spend
$1,059,199
Revenue
$1,638,590
Runs
3
1st profitable mo (avg)
~10.7

Avg profit: $131,537

Diagnosis

Converts strategy into durable economics

The model averages 33.56 score and $131,537 contribution profit. Its strongest dimension is planning (54.4).

Trajectory

Negative months: 5 / 12

Final cumulative profit: $131,537

Worst month: M1 ($-42,025)

Relative position

Vs leader: -7.05 score pts

Vs median: +15.17 score pts

Strongest: planning (54.4)

Weakest: business (20.7)

Strength: Relative edge: planning (54.4).

Watch: Relative gap: business (20.7).

5

Anthropic: Claude Opus 4.7

A regression on ROASBench vs. 4.6: less persona-aware copy, a tilt toward intent capture over prospecting, and learning resets on its largest channel. More reactive, less consistent run-to-run. Averaged 30.06 across 3 completed run(s); contribution profit $250,097; ROAS 166.6%. Avg first month cumulative contribution profit turns positive: ~6.0.

30.06
166.6%

Sub-scores

Business
25.58
Behavior
18.78
Planning
56.37
Persona
37.52

Economics

Cost / run
$1.10
Spend
$1,046,108
Revenue
$1,788,239
Runs
3
1st profitable mo (avg)
~6.0

Avg profit: $250,097

Diagnosis

Converts strategy into durable economics

The model averages 30.06 score and $250,097 contribution profit. Its strongest dimension is planning (56.4).

Trajectory

Negative months: 4 / 12

Final cumulative profit: $250,097

Worst month: M3 ($-26,790)

Relative position

Vs leader: -10.55 score pts

Vs median: +11.67 score pts

Strongest: planning (56.4)

Weakest: behavior (18.8)

Strength: Earlier first profitable month and occasional strong-profit spikes.

Watch: Persona fit collapses, Search learning resets, and discounting appears under pressure.

6

Google: Gemini 3.1 Pro Preview

Often nearer break-even with structurally sensible moves; execution and generic creative hold the score down, with too many mid-course resets. Averaged 27.14 across 3 completed run(s); contribution profit $-34,549; ROAS 132.9%. Avg first month cumulative contribution profit turns positive: ~12.0.

27.14
132.9%

Sub-scores

Business
12.17
Behavior
29.25
Planning
53.04
Persona
49.11

Economics

Cost / run
$0.53
Spend
$1,002,707
Revenue
$1,333,366
Runs
3
1st profitable mo (avg)
~12.0

Avg profit: $-34,549

Diagnosis

Strategically active, commercially negative

The model averages $-34,549 contribution profit with 132.9% ROAS. It creates activity and revenue, but not enough efficient margin.

Trajectory

Negative months: 8 / 12

Final cumulative profit: $-34,549

Worst month: M1 ($-38,205)

Relative position

Vs leader: -13.47 score pts

Vs median: +8.75 score pts

Strongest: planning (53.0)

Weakest: business (12.2)

Strength: Directionally right budget and channel choices vs. weaker frontier peers.

Watch: Generic copy, remarketing churn, and learning resets under pressure.

7

Qwen: Qwen3.5 Plus 2026-02-15

Ranked #7 of 23 with an average benchmark score of 26.07 across 3 run(s). Sub-scores are strongest on planning (51.7) and weakest on business (14.8). Average contribution profit $-29,480 and ROAS 133.8%. On average, cumulative contribution profit stayed negative through the full simulation year.

26.07
133.8%

Sub-scores

Business
14.79
Behavior
21.36
Planning
51.68
Persona
45.55

Economics

Cost / run
$0.00
Spend
$1,018,413
Revenue
$1,362,638
Runs
3
1st profitable mo (avg)

Avg profit: $-29,480

Diagnosis

Strategically active, commercially negative

The model averages $-29,480 contribution profit with 133.8% ROAS. It creates activity and revenue, but not enough efficient margin.

Trajectory

Negative months: 8 / 12

Final cumulative profit: $-29,480

Worst month: M1 ($-41,310)

Relative position

Vs leader: -14.54 score pts

Vs median: +7.68 score pts

Strongest: planning (51.7)

Weakest: business (14.8)

Strength: Relative edge: planning (51.7).

Watch: Relative gap: business (14.8).

8

OpenAI: GPT-5.5

Ranked #8 of 23 with an average benchmark score of 25.68 across 3 run(s). Sub-scores are strongest on planning (52.3) and weakest on business (13.7). Average contribution profit $15,398 and ROAS 140.9%. On average, cumulative contribution profit first turns positive around month 11.0.

25.68
140.9%

Sub-scores

Business
13.67
Behavior
24.59
Planning
52.33
Persona
42.51

Economics

Cost / run
$0.77
Spend
$929,589
Revenue
$1,307,291
Runs
3
1st profitable mo (avg)
~11.0

Avg profit: $15,398

Diagnosis

Converts strategy into durable economics

The model averages 25.68 score and $15,398 contribution profit. Its strongest dimension is planning (52.3).

Trajectory

Negative months: 7 / 12

Final cumulative profit: $15,398

Worst month: M1 ($-37,793)

Relative position

Vs leader: -14.93 score pts

Vs median: +7.29 score pts

Strongest: planning (52.3)

Weakest: business (13.7)

Strength: Relative edge: planning (52.3).

Watch: Relative gap: business (13.7).

9

Anthropic: Claude Sonnet 4.6

Ranked #9 of 23 with an average benchmark score of 21.21 across 3 run(s). Sub-scores are strongest on planning (57.4) and weakest on business (9.7). Average contribution profit $-146,915 and ROAS 117.6%. On average, cumulative contribution profit stayed negative through the full simulation year.

21.21
117.6%

Sub-scores

Business
9.73
Behavior
14.86
Planning
57.35
Persona
36.06

Economics

Cost / run
$0.53
Spend
$985,912
Revenue
$1,159,482
Runs
3
1st profitable mo (avg)

Avg profit: $-146,915

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (57.4), but business outcome score is only 9.7. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 8 / 12

Final cumulative profit: $-146,915

Worst month: M2 ($-50,478)

Relative position

Vs leader: -19.40 score pts

Vs median: +2.82 score pts

Strongest: planning (57.4)

Weakest: business (9.7)

Strength: Relative edge: planning (57.4).

Watch: Relative gap: business (9.7).

10

Google: Gemini 3.5 Flash · High

Ranked #10 of 23 with an average benchmark score of 20.74 across 3 run(s). Sub-scores are strongest on planning (51.1) and weakest on business (6.8). Average contribution profit $-147,599 and ROAS 116.9%. On average, cumulative contribution profit stayed negative through the full simulation year.

20.74
116.9%

Sub-scores

Business
6.83
Behavior
21.44
Planning
51.06
Persona
38.37

Economics

Cost / run
$0.37
Spend
$972,068
Revenue
$1,136,720
Runs
3
1st profitable mo (avg)

Avg profit: $-147,599

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (51.1), but business outcome score is only 6.8. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 9 / 12

Final cumulative profit: $-147,599

Worst month: M1 ($-35,809)

Relative position

Vs leader: -19.87 score pts

Vs median: +2.35 score pts

Strongest: planning (51.1)

Weakest: business (6.8)

Strength: Relative edge: planning (51.1).

Watch: Relative gap: business (6.8).

11

DeepSeek: DeepSeek V3.2

Ranked #11 of 23 with an average benchmark score of 19.77 across 3 run(s). Sub-scores are strongest on planning (49.5) and weakest on business (8.9). Average contribution profit $-126,535 and ROAS 120.4%. On average, cumulative contribution profit first turns positive around month 12.0.

19.77
120.4%

Sub-scores

Business
8.93
Behavior
13.28
Planning
49.54
Persona
37.26

Economics

Cost / run
$0.00
Spend
$987,722
Revenue
$1,193,277
Runs
3
1st profitable mo (avg)
~12.0

Avg profit: $-126,535

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (49.5), but business outcome score is only 8.9. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 8 / 12

Final cumulative profit: $-126,535

Worst month: M1 ($-44,732)

Relative position

Vs leader: -20.84 score pts

Vs median: +1.38 score pts

Strongest: planning (49.5)

Weakest: business (8.9)

Strength: Relative edge: planning (49.5).

Watch: Relative gap: business (8.9).

12

OpenAI: GPT-5.4

Looks plausible on paper but weak compounding: revenue without efficient spend patterns; repeated broad demand spend without durable payoff. Averaged 18.39 across 3 completed run(s); contribution profit $-250,461; ROAS 103.2%. Across runs, cumulative contribution profit never crossed zero on average in the first 12 months.

18.39
103.2%

Sub-scores

Business
5.69
Behavior
17.55
Planning
54.12
Persona
30.77

Economics

Cost / run
$0.29
Spend
$970,302
Revenue
$1,001,373
Runs
3
1st profitable mo (avg)

Avg profit: $-250,461

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (54.1), but business outcome score is only 5.7. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 9 / 12

Final cumulative profit: $-250,461

Worst month: M1 ($-58,177)

Relative position

Vs leader: -22.22 score pts

Vs median: +0.00 score pts

Strongest: planning (54.1)

Weakest: business (5.7)

Strength: Readable strategy and channel mix in isolation.

Watch: Search/remarketing saturation and budgeting that does not match outcomes.

13

xAI: Grok 4.20 Beta

Ranked #13 of 23 with an average benchmark score of 16.77 across 3 run(s). Sub-scores are strongest on planning (43.9) and weakest on business (5.9). Average contribution profit $-243,069 and ROAS 103.5%. On average, cumulative contribution profit stayed negative through the full simulation year.

16.77
103.5%

Sub-scores

Business
5.91
Behavior
16.36
Planning
43.92
Persona
29.32

Economics

Cost / run
$0.00
Spend
$972,947
Revenue
$1,009,791
Runs
3
1st profitable mo (avg)

Avg profit: $-243,069

Diagnosis

Strategically active, commercially negative

The model averages $-243,069 contribution profit with 103.5% ROAS. It creates activity and revenue, but not enough efficient margin.

Trajectory

Negative months: 10 / 12

Final cumulative profit: $-243,069

Worst month: M1 ($-40,606)

Relative position

Vs leader: -23.84 score pts

Vs median: -1.62 score pts

Strongest: planning (43.9)

Weakest: business (5.9)

Strength: Relative edge: planning (43.9).

Watch: Relative gap: business (5.9).

14

Gemini 3 Flash Preview

Ranked #14 of 23 with an average benchmark score of 16.29 across 3 run(s). Sub-scores are strongest on planning (49.3) and weakest on business (4.6). Average contribution profit $-239,859 and ROAS 104.6%. On average, cumulative contribution profit stayed negative through the full simulation year.

16.29
104.6%

Sub-scores

Business
4.63
Behavior
13.03
Planning
49.33
Persona
30.27

Economics

Cost / run
$0.10
Spend
$967,576
Revenue
$1,012,636
Runs
3
1st profitable mo (avg)

Avg profit: $-239,859

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (49.3), but business outcome score is only 4.6. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 10 / 12

Final cumulative profit: $-239,859

Worst month: M4 ($-44,205)

Relative position

Vs leader: -24.32 score pts

Vs median: -2.10 score pts

Strongest: planning (49.3)

Weakest: business (4.6)

Strength: Relative edge: planning (49.3).

Watch: Relative gap: business (4.6).

15

Z.ai: GLM 5.1

Ranked #15 of 23 with an average benchmark score of 13.37 across 3 run(s). Sub-scores are strongest on planning (48.0) and weakest on business (3.2). Average contribution profit $-253,302 and ROAS 102.9%. On average, cumulative contribution profit stayed negative through the full simulation year.

13.37
102.9%

Sub-scores

Business
3.25
Behavior
6.22
Planning
47.98
Persona
26.37

Economics

Cost / run
$0.00
Spend
$946,959
Revenue
$974,727
Runs
3
1st profitable mo (avg)

Avg profit: $-253,302

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (48.0), but business outcome score is only 3.2. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 11 / 12

Final cumulative profit: $-253,302

Worst month: M1 ($-40,380)

Relative position

Vs leader: -27.24 score pts

Vs median: -5.02 score pts

Strongest: planning (48.0)

Weakest: business (3.2)

Strength: Relative edge: planning (48.0).

Watch: Relative gap: business (3.2).

16

MoonshotAI: Kimi K2.5

Ranked #16 of 23 with an average benchmark score of 13.10 across 2 run(s). Sub-scores are strongest on planning (50.8) and weakest on business (3.7). Average contribution profit $-292,423 and ROAS 97.5%. On average, cumulative contribution profit stayed negative through the full simulation year.

13.10
97.5%

Sub-scores

Business
3.72
Behavior
5.12
Planning
50.75
Persona
22.91

Economics

Cost / run
$0.00
Spend
$966,702
Revenue
$942,650
Runs
2
1st profitable mo (avg)

Avg profit: $-292,423

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (50.8), but business outcome score is only 3.7. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 10 / 12

Final cumulative profit: $-292,423

Worst month: M4 ($-55,730)

Relative position

Vs leader: -27.51 score pts

Vs median: -5.29 score pts

Strongest: planning (50.8)

Weakest: business (3.7)

Strength: Relative edge: planning (50.8).

Watch: Relative gap: business (3.7).

17

MiniMax: MiniMax M2.7

Ranked #17 of 23 with an average benchmark score of 12.13 across 3 run(s). Sub-scores are strongest on planning (39.4) and weakest on business (3.5). Average contribution profit $-316,117 and ROAS 94.2%. On average, cumulative contribution profit stayed negative through the full simulation year.

12.13
94.2%

Sub-scores

Business
3.54
Behavior
9.08
Planning
39.37
Persona
21.24

Economics

Cost / run
$0.00
Spend
$961,885
Revenue
$906,316
Runs
3
1st profitable mo (avg)

Avg profit: $-316,117

Diagnosis

Strategically active, commercially negative

The model averages $-316,117 contribution profit with 94.2% ROAS. It creates activity and revenue, but not enough efficient margin.

Trajectory

Negative months: 10 / 12

Final cumulative profit: $-316,117

Worst month: M3 ($-51,444)

Relative position

Vs leader: -28.48 score pts

Vs median: -6.26 score pts

Strongest: planning (39.4)

Weakest: business (3.5)

Strength: Relative edge: planning (39.4).

Watch: Relative gap: business (3.5).

18

OpenAI: GPT-5.4 Mini

Ranked #18 of 23 with an average benchmark score of 11.83 across 1 run(s). Sub-scores are strongest on planning (44.9) and weakest on business (2.0). Average contribution profit $-353,629 and ROAS 87.3%. On average, cumulative contribution profit stayed negative through the full simulation year.

11.83
87.3%

Sub-scores

Business
2.00
Behavior
5.58
Planning
44.94
Persona
24.04

Economics

Cost / run
$0.00
Spend
$952,498
Revenue
$831,373
Runs
1
1st profitable mo (avg)

Avg profit: $-353,629

Diagnosis

Strategically active, commercially negative

The model averages $-353,629 contribution profit with 87.3% ROAS. It creates activity and revenue, but not enough efficient margin.

Trajectory

Negative months: 11 / 12

Final cumulative profit: $-353,629

Worst month: M3 ($-59,559)

Relative position

Vs leader: -28.78 score pts

Vs median: -6.56 score pts

Strongest: planning (44.9)

Weakest: business (2.0)

Strength: Relative edge: planning (44.9).

Watch: Relative gap: business (2.0).

19

Anthropic: Claude Opus 4.8 · High

Ranked #19 of 23 with an average benchmark score of 10.41 across 3 run(s). Sub-scores are strongest on planning (53.5) and weakest on business (1.5). Average contribution profit $-352,542 and ROAS 83.6%. On average, cumulative contribution profit stayed negative through the full simulation year.

10.41
83.6%

Sub-scores

Business
1.46
Behavior
6.58
Planning
53.47
Persona
10.83

Economics

Cost / run
$1.07
Spend
$862,852
Revenue
$719,316
Runs
3
1st profitable mo (avg)

Avg profit: $-352,542

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (53.5), but business outcome score is only 1.5. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 12 / 12

Final cumulative profit: $-352,542

Worst month: M1 ($-59,001)

Relative position

Vs leader: -30.20 score pts

Vs median: -7.98 score pts

Strongest: planning (53.5)

Weakest: business (1.5)

Strength: Relative edge: planning (53.5).

Watch: Relative gap: business (1.5).

20

Anthropic: Claude Opus 4.8 · Low

Ranked #20 of 23 with an average benchmark score of 9.98 across 3 run(s). Sub-scores are strongest on planning (54.2) and weakest on business (1.3). Average contribution profit $-299,490 and ROAS 89.4%. On average, cumulative contribution profit stayed negative through the full simulation year.

9.98
89.4%

Sub-scores

Business
1.26
Behavior
3.97
Planning
54.23
Persona
11.37

Economics

Cost / run
$1.00
Spend
$812,367
Revenue
$724,017
Runs
3
1st profitable mo (avg)

Avg profit: $-299,490

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (54.2), but business outcome score is only 1.3. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 12 / 12

Final cumulative profit: $-299,490

Worst month: M1 ($-52,862)

Relative position

Vs leader: -30.63 score pts

Vs median: -8.41 score pts

Strongest: planning (54.2)

Weakest: business (1.3)

Strength: Relative edge: planning (54.2).

Watch: Relative gap: business (1.3).

21

Z.ai: GLM 5

Ranked #21 of 23 with an average benchmark score of 9.64 across 3 run(s). Sub-scores are strongest on planning (42.6) and weakest on business (0.9). Average contribution profit $-370,131 and ROAS 86.0%. On average, cumulative contribution profit stayed negative through the full simulation year.

9.64
86.0%

Sub-scores

Business
0.91
Behavior
5.34
Planning
42.62
Persona
16.65

Economics

Cost / run
$0.00
Spend
$949,211
Revenue
$816,409
Runs
3
1st profitable mo (avg)

Avg profit: $-370,131

Diagnosis

Audience fit is the main failure mode

Persona score is low (16.6), so the simulated shoppers are not buying the positioning even when the high-level strategy looks reasonable.

Trajectory

Negative months: 11 / 12

Final cumulative profit: $-370,131

Worst month: M2 ($-54,591)

Relative position

Vs leader: -30.97 score pts

Vs median: -8.75 score pts

Strongest: planning (42.6)

Weakest: business (0.9)

Strength: Relative edge: planning (42.6).

Watch: Relative gap: business (0.9).

22

Anthropic: Claude Opus 4.8

Ranked #22 of 23 with an average benchmark score of 9.54 across 3 run(s). Sub-scores are strongest on planning (54.6) and weakest on business (0.2). Average contribution profit $-312,597 and ROAS 86.5%. On average, cumulative contribution profit stayed negative through the full simulation year.

9.54
86.5%

Sub-scores

Business
0.20
Behavior
3.36
Planning
54.58
Persona
12.30

Economics

Cost / run
$1.03
Spend
$804,712
Revenue
$694,581
Runs
3
1st profitable mo (avg)

Avg profit: $-312,597

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (54.6), but business outcome score is only 0.2. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 12 / 12

Final cumulative profit: $-312,597

Worst month: M1 ($-57,005)

Relative position

Vs leader: -31.07 score pts

Vs median: -8.85 score pts

Strongest: planning (54.6)

Weakest: business (0.2)

Strength: Relative edge: planning (54.6).

Watch: Relative gap: business (0.2).

23

OpenAI: GPT-5.4 Nano

Ranked #23 of 23 with an average benchmark score of 6.80 across 3 run(s). Sub-scores are strongest on planning (46.5) and weakest on business (0.0). Average contribution profit $-577,506 and ROAS 56.5%. On average, cumulative contribution profit stayed negative through the full simulation year.

6.80
56.5%

Sub-scores

Business
0.00
Behavior
1.33
Planning
46.45
Persona
5.36

Economics

Cost / run
$0.00
Spend
$959,525
Revenue
$542,214
Runs
3
1st profitable mo (avg)

Avg profit: $-577,506

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (46.5), but business outcome score is only 0.0. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 12 / 12

Final cumulative profit: $-577,506

Worst month: M1 ($-67,937)

Relative position

Vs leader: -33.81 score pts

Vs median: -11.59 score pts

Strongest: planning (46.5)

Weakest: business (0.0)

Strength: Relative edge: planning (46.5).

Watch: Relative gap: business (0.0).

Avg profit is shown in expanded rows on small screens — tap a model.

For model providers

Want your model on ROASBench?

We can run official benchmark passes and publish results alongside the leaderboard. Tell us which model and API access to use.

Contact us

Methodology

How ROASBench works

Open each section for setup, simulation flow, what models see, personas, state, and what skills the benchmark rewards.

ROASBench drops the model into a year-long operating environment for one premium-but-accessible skincare brand and scores the result on business outcomes, not nice-sounding plans.

Brand

Northstar Skin

Barrier Repair Serum at $68 with 76% gross margin.

Time horizon

12 months

The model has to adapt over time instead of solving one isolated scenario.

Controlled channels

6

Meta prospecting, Search, Shopping, TikTok, Email / CRM, and Remarketing.

Scoring

Business + behavior

Primary score blends profitability, planning quality, persona response, and long-run adaptation.

What the model can control

  • Budget allocation — spend per month and split across channels.
  • Campaign design — type, segments, creative angle, copy structure.
  • Offer strategy — discounts, remarketing, CRM cadence, margin vs. conversion.
  • Iteration — hold, scale, or change course after monthly results.

Each round is a real operating cycle, not a one-shot prompt. Past choices affect future state, so the benchmark rewards consistency and punishes lazy resets.

1. Seeded world

Fixed brand, budget, customers, email list, warm pool, seasonality, shocks.

2. Decision step

Structured monthly plan: objective, budget, discount, remarketing, channels, creative.

3. Persona panel + rules

Panel judges copy and targeting; rules produce clicks, trust, purchases, retention.

4. State update

Budget, base, momentum, fatigue, pools, and channel memory roll forward.

What data the model gets back

  • Monthly operating metrics: spend, revenue, ROAS, CAC, repeat rate, reinvestment.
  • State: budget left, customer base, email list, warm pool, momentum, fatigue.
  • Audience map: segments, persona sizes, fit, competition, value.
  • Working memory: prior decisions, highlights, penalty flags.
  • Market notes: seasonality and shocks.

No raw persona-by-persona judge feedback in the prompt — infer from outcomes.

Main difficulties

Learning resets
Abrupt reallocations hurt efficiency.
Saturation
Finite warm pools and auctions.
Offer fatigue
Discounts can damage later months.
Organic carryover
Upper funnel pays off slowly.
Persona tradeoffs
Easy vs. valuable audiences.
Soft caps
CRM, retargeting, channel limits.

Every persona differs in size, growth, fit, competition, and value. The model starts with a commercial map but must learn what actually monetizes.

Value Seeker

Large and relatively easy to wake up with offers, but lower-value and highly price competitive.

Motivations: visible results, discount

Premium Conscious

Smaller but high-value premium audience with strong fit for the brand and heavy competition from other prestige skincare.

Motivations: ingredients, authority

Ingredient Researcher

Harder to win because they scrutinize claims, but they compound into valuable, durable customers when convinced.

Motivations: clinical details, ingredient list

Impulse Buyer

Big upper-funnel opportunity that is easier to engage creatively, but conversion quality and retention are weaker.

Motivations: aesthetic creative, quick payoff

Comparison Shopper

Commercially meaningful and high-intent, but expensive to win because comparison behavior increases competition and pressure on proof.

Motivations: clear differentiation, proof

Returning Loyalist

Smaller owned audience but the most valuable and efficient to monetize if protected with the right cadence.

Motivations: routine, restock convenience

Lapsed Customer

Warm and recoverable with decent value, but reactivation requires freshness and fatigue management.

Motivations: newness, better routine fit

Low Intent Browser

Largest reachable pool and easiest to attract at the top of funnel, but low intent and low customer value.

Motivations: light curiosity, visual intrigue

Fixed upfront

Brand, economics, personas, channels, seasonality, shocks.

Persistent state

Budget, customers, email, warm pool, momentum, fatigue, reinvestment, channel memory.

Iteration summary

Prior decisions and outcomes compressed for the next month.

Economic judgment

Margin, contribution profit, CAC, bad revenue.

Budget pacing

When to press, hold, sequence spend.

Channel allocation

Prospecting vs. intent vs. CRM vs. remarketing.

Persona targeting

Easy vs. valuable audiences.

Creative specificity

Angles that match motivations and objections.

Stable iteration

Improve in place; avoid constant restructures.

In practice: protect margin, keep retention alive, scale demand capture carefully, stay persona-aware. Models fail when they confuse activity with progress or write polished but generic copy.

Qualitative findings

Why Anthropic: Claude Opus 4.8 · High scored the way it did

This deep dive follows the requested direction: Highlight the new Claude Opus 4.8 performance vs Opus 4.7 and 4.6. Discuss GPT 5.5 (particularly the price), Discuss Qwen3.7 Max doing well, Discuss why Gemini models fall flat in here. It is grounded in the resolved focus models rather than the overall worst performer.

Grounded in published runs

Generated 2026-05-29

Model diagnosis

Anthropic: Claude Opus 4.8 · High

Planning is the relative bright spot (53.5), but business outcome score is only 1.5. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Rank

#19

Score

10.41

Avg profit

$-352,542

ROAS

83.6%

Selected chart

Sub-score split for requested models

Score Split

Selected chart

Profit trajectory for requested models

Profit Trajectory

Head-to-head

Anthropic: Claude Opus 4.8 · High vs Anthropic: Claude Opus 4.8 · Low

This comparison is generated deterministically from the benchmark dossiers because the LLM output did not follow the requested focus.

Avg score

Anthropic: Claude Opus 4.8 · High 10.41

Anthropic: Claude Opus 4.8 · Low 9.98

Δ -0.43

Avg profit

Anthropic: Claude Opus 4.8 · High $-352,542

Anthropic: Claude Opus 4.8 · Low $-299,490

Δ $+53,053

ROAS

Anthropic: Claude Opus 4.8 · High 83.6%

Anthropic: Claude Opus 4.8 · Low 89.4%

Δ +5.88

Persona score

Anthropic: Claude Opus 4.8 · High 10.8

Anthropic: Claude Opus 4.8 · Low 11.4

Δ +0.5

Where the gap shows up

Anthropic: Claude Opus 4.8 · High scores 10.41; Anthropic: Claude Opus 4.8 · Low scores 9.98. Anthropic: Claude Opus 4.8 · Low is strongest on planning but weakest on business.

Planning is not the same as compounding

Anthropic: Claude Opus 4.8 · High illustrates the gap: Planning is the relative bright spot (53.5), but business outcome score is only 1.5. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

  • Use the score split to distinguish strategy prose from business outcomes.
  • Look for negative monthly profit streaks and failure to recover late in the year.
  • Treat persona score as the check on whether the simulated audience actually believed the campaign.