ROASBench

A hard-mode performance marketing simulation for LLMs. Models act as the marketer for a DTC skincare brand, choose channels, plan spend, write creative angles, react to results, and live with the consequences for 12 months.

12-month simulation 3 repeats per model 6 controllable channels 8 shopper personas

Quality is not evenly priced

Average benchmark score against average main-model API cost per run (log scale, cost decreasing to the right — up and right is better). Models above the dashed trend line score better than their price predicts, and thin lines join variants of the same model. Models without tracked cost data are not shown.

Model

Score

ROAS

Anthropic: Claude Fable 5 · Medium

The top no-thinking run: slightly ahead of Opus 4.6 on score and ROAS by pairing strong planning with credible persona fit, even though raw profit is a little lower. Averaged 47.55 across 3 completed run(s); contribution profit $648,530; ROAS 216.4%. Avg first month cumulative contribution profit turns positive: ~5.0.

47.55

216.4%

Sub-scores

Business: 44.85
Behavior: 36.47
Planning: 60.39
Persona: 58.78

Economics

Cost / run: $1.82
Spend: $1,124,864
Revenue: $2,432,395
Runs: 3
1st profitable mo (avg): ~5.0

Avg profit: $648,530

Diagnosis

Converts strategy into durable economics

The model averages 47.55 score and $648,530 contribution profit. Its strongest dimension is planning (60.4).

Trajectory

Negative months: 2 / 12

Final cumulative profit: $648,530

Worst month: M1 ($-20,435)

Relative position

Vs leader: +0.00 score pts

Vs median: +21.48 score pts

Strongest: planning (60.4)

Weakest: behavior (36.5)

Strength: Best average score, early break-even, strong planning, and efficient ROAS.

Watch: Behavior still wobbles: high-CAC months, saturation, and learning resets show up.

Anthropic: Claude Fable 5 · High

High thinking is the new overall ROASBench leader: a modest score lift over the no-thinking run, but a larger profit and persona-fit gain with earlier break-even. Averaged 43.70 across 3 completed run(s); contribution profit $560,767; ROAS 197.3%. Avg first month cumulative contribution profit turns positive: ~5.3.

43.70

197.3%

Sub-scores

Business: 38.22
Behavior: 33.19
Planning: 60.43
Persona: 59.42

Economics

Cost / run: $2.18
Spend: $1,272,595
Revenue: $2,514,838
Runs: 3
1st profitable mo (avg): ~5.3

Avg profit: $560,767

Diagnosis

Converts strategy into durable economics

The model averages 43.70 score and $560,767 contribution profit. Its strongest dimension is planning (60.4).

Trajectory

Negative months: 2 / 12

Final cumulative profit: $560,767

Worst month: M1 ($-26,764)

Relative position

Vs leader: -3.85 score pts

Vs median: +17.63 score pts

Strongest: planning (60.4)

Weakest: behavior (33.2)

Strength: Best overall score, higher profit, stronger persona response, and earlier payback.

Watch: Still shows behavior risk under scale: high-CAC pressure and saturation penalties remain.

Anthropic: Claude Fable 5

The top no-thinking run: slightly ahead of Opus 4.6 on score and ROAS by pairing strong planning with credible persona fit, even though raw profit is a little lower. Averaged 42.96 across 3 completed run(s); contribution profit $478,541; ROAS 194.1%. Avg first month cumulative contribution profit turns positive: ~6.0.

42.96

194.1%

Sub-scores

Business: 38.71
Behavior: 32.14
Planning: 59.58
Persona: 55.72

Economics

Cost / run: $3.05
Spend: $1,163,644
Revenue: $2,254,999
Runs: 3
1st profitable mo (avg): ~6.0

Avg profit: $478,541

Diagnosis

Converts strategy into durable economics

The model averages 42.96 score and $478,541 contribution profit. Its strongest dimension is planning (59.6).

Trajectory

Negative months: 3 / 12

Final cumulative profit: $478,541

Worst month: M1 ($-23,151)

Relative position

Vs leader: -4.59 score pts

Vs median: +16.89 score pts

Strongest: planning (59.6)

Weakest: behavior (32.1)

Strength: Best average score, early break-even, strong planning, and efficient ROAS.

Watch: Behavior still wobbles: high-CAC months, saturation, and learning resets show up.

Anthropic: Claude Opus 4.6

Leads the pack by compounding a coherent plan: retention channels stay funded, discounting stays rare, and changes are absorbed without constant learning resets. Averaged 40.61 across 3 completed run(s); contribution profit $506,094; ROAS 192.5%. Avg first month cumulative contribution profit turns positive: ~7.0.

40.61

192.5%

Sub-scores

Business: 35.70
Behavior: 27.28
Planning: 59.06
Persona: 56.75

Economics

Cost / run: $0.99
Spend: $1,244,940
Revenue: $2,402,318
Runs: 3
1st profitable mo (avg): ~7.0

Avg profit: $506,094

Diagnosis

Converts strategy into durable economics

The model averages 40.61 score and $506,094 contribution profit. Its strongest dimension is planning (59.1).

Trajectory

Negative months: 3 / 12

Final cumulative profit: $506,094

Worst month: M1 ($-30,242)

Relative position

Vs leader: -6.94 score pts

Vs median: +14.54 score pts

Strongest: planning (59.1)

Weakest: behavior (27.3)

Strength: Stable iteration, persona-aware creative, disciplined CRM and remarketing.

Watch: Still hits saturation and high CAC when scaling search and broad demand.

OpenAI: GPT-5.6 Sol · Medium

Ranked #5 of 57 with an average benchmark score of 39.46 across 3 run(s). Sub-scores are strongest on planning (55.6) and weakest on business (32.3). Average contribution profit $356,717 and ROAS 184.9%. On average, cumulative contribution profit first turns positive around month 7.0.

39.46

184.9%

Sub-scores

Business: 32.33
Behavior: 37.45
Planning: 55.57
Persona: 50.73

Economics

Cost / run: $0.80
Spend: $1,014,300
Revenue: $1,882,054
Runs: 3
1st profitable mo (avg): ~7.0

Avg profit: $356,717

Diagnosis

Converts strategy into durable economics

The model averages 39.46 score and $356,717 contribution profit. Its strongest dimension is planning (55.6).

Trajectory

Negative months: 2 / 12

Final cumulative profit: $356,717

Worst month: M1 ($-35,231)

Relative position

Vs leader: -8.09 score pts

Vs median: +13.39 score pts

Strongest: planning (55.6)

Weakest: business (32.3)

Strength: Relative edge: planning (55.6).

Watch: Relative gap: business (32.3).

MoonshotAI: Kimi K3

Ranked #6 of 57 with an average benchmark score of 38.48 across 3 run(s). Sub-scores are strongest on planning (58.5) and weakest on behavior (27.3). Average contribution profit $448,354 and ROAS 189.3%. On average, cumulative contribution profit first turns positive around month 7.7.

38.48

189.3%

Sub-scores

Business: 32.35
Behavior: 27.28
Planning: 58.52
Persona: 54.58

Economics

Cost / run: $1.20
Spend: $1,192,615
Revenue: $2,257,865
Runs: 3
1st profitable mo (avg): ~7.7

Avg profit: $448,354

Diagnosis

Converts strategy into durable economics

The model averages 38.48 score and $448,354 contribution profit. Its strongest dimension is planning (58.5).

Trajectory

Negative months: 3 / 12

Final cumulative profit: $448,354

Worst month: M1 ($-27,716)

Relative position

Vs leader: -9.07 score pts

Vs median: +12.41 score pts

Strongest: planning (58.5)

Weakest: behavior (27.3)

Strength: Relative edge: planning (58.5).

Watch: Relative gap: behavior (27.3).

OpenAI: GPT-5.6 Terra · Max

Ranked #7 of 57 with an average benchmark score of 38.09 across 3 run(s). Sub-scores are strongest on planning (56.1) and weakest on business (30.7). Average contribution profit $334,530 and ROAS 177.4%. On average, cumulative contribution profit first turns positive around month 6.7.

38.09

177.4%

Sub-scores

Business: 30.73
Behavior: 34.44
Planning: 56.15
Persona: 50.55

Economics

Cost / run: $0.67
Spend: $1,135,313
Revenue: $2,016,468
Runs: 3
1st profitable mo (avg): ~6.7

Avg profit: $334,530

Diagnosis

Converts strategy into durable economics

The model averages 38.09 score and $334,530 contribution profit. Its strongest dimension is planning (56.1).

Trajectory

Negative months: 3 / 12

Final cumulative profit: $334,530

Worst month: M1 ($-28,849)

Relative position

Vs leader: -9.46 score pts

Vs median: +12.02 score pts

Strongest: planning (56.1)

Weakest: business (30.7)

Strength: Relative edge: planning (56.1).

Watch: Relative gap: business (30.7).

OpenAI: GPT-5.6 Terra · Medium

Ranked #8 of 57 with an average benchmark score of 37.90 across 3 run(s). Sub-scores are strongest on planning (54.5) and weakest on behavior (23.7). Average contribution profit $266,011 and ROAS 184.0%. On average, cumulative contribution profit first turns positive around month 7.7.

37.90

184.0%

Sub-scores

Business: 36.97
Behavior: 23.72
Planning: 54.53
Persona: 45.13

Economics

Cost / run: $0.34
Spend: $791,803
Revenue: $1,454,384
Runs: 3
1st profitable mo (avg): ~7.7

Avg profit: $266,011

Diagnosis

Converts strategy into durable economics

The model averages 37.90 score and $266,011 contribution profit. Its strongest dimension is planning (54.5).

Trajectory

Negative months: 3 / 12

Final cumulative profit: $266,011

Worst month: M1 ($-40,594)

Relative position

Vs leader: -9.65 score pts

Vs median: +11.83 score pts

Strongest: planning (54.5)

Weakest: behavior (23.7)

Strength: Relative edge: planning (54.5).

Watch: Relative gap: behavior (23.7).

Anthropic: Claude Opus 4.6 · High

Leads the pack by compounding a coherent plan: retention channels stay funded, discounting stays rare, and changes are absorbed without constant learning resets. Averaged 37.81 across 3 completed run(s); contribution profit $411,580; ROAS 182.1%. Avg first month cumulative contribution profit turns positive: ~8.0.

37.81

182.1%

Sub-scores

Business: 32.01
Behavior: 26.67
Planning: 57.23
Persona: 53.37

Economics

Cost / run: $0.94
Spend: $1,202,115
Revenue: $2,219,539
Runs: 3
1st profitable mo (avg): ~8.0

Avg profit: $411,580

Diagnosis

Converts strategy into durable economics

The model averages 37.81 score and $411,580 contribution profit. Its strongest dimension is planning (57.2).

Trajectory

Negative months: 3 / 12

Final cumulative profit: $411,580

Worst month: M1 ($-29,462)

Relative position

Vs leader: -9.74 score pts

Vs median: +11.74 score pts

Strongest: planning (57.2)

Weakest: behavior (26.7)

Strength: Stable iteration, persona-aware creative, disciplined CRM and remarketing.

Watch: Still hits saturation and high CAC when scaling search and broad demand.

OpenAI: GPT-5.5 Pro

Ranked #10 of 57 with an average benchmark score of 37.78 across 3 run(s). Sub-scores are strongest on planning (55.6) and weakest on business (30.1). Average contribution profit $397,813 and ROAS 184.9%. On average, cumulative contribution profit first turns positive around month 7.3.

37.78

184.9%

Sub-scores

Business: 30.15
Behavior: 31.92
Planning: 55.63
Persona: 53.56

Economics

Cost / run: $27.97
Spend: $1,177,483
Revenue: $2,177,446
Runs: 3
1st profitable mo (avg): ~7.3

Avg profit: $397,813

Diagnosis

Converts strategy into durable economics

The model averages 37.78 score and $397,813 contribution profit. Its strongest dimension is planning (55.6).

Trajectory

Negative months: 2 / 12

Final cumulative profit: $397,813

Worst month: M1 ($-29,816)

Relative position

Vs leader: -9.77 score pts

Vs median: +11.71 score pts

Strongest: planning (55.6)

Weakest: business (30.1)

Strength: Relative edge: planning (55.6).

Watch: Relative gap: business (30.1).

OpenAI: GPT-5.6 Sol · High

Ranked #11 of 57 with an average benchmark score of 37.53 across 3 run(s). Sub-scores are strongest on planning (56.4) and weakest on business (29.1). Average contribution profit $260,679 and ROAS 176.4%. On average, cumulative contribution profit first turns positive around month 7.0.

37.53

176.4%

Sub-scores

Business: 29.15
Behavior: 34.50
Planning: 56.36
Persona: 51.65

Economics

Cost / run: $1.12
Spend: $916,427
Revenue: $1,614,331
Runs: 3
1st profitable mo (avg): ~7.0

Avg profit: $260,679

Diagnosis

Converts strategy into durable economics

The model averages 37.53 score and $260,679 contribution profit. Its strongest dimension is planning (56.4).

Trajectory

Negative months: 2 / 12

Final cumulative profit: $260,679

Worst month: M1 ($-28,417)

Relative position

Vs leader: -10.02 score pts

Vs median: +11.46 score pts

Strongest: planning (56.4)

Weakest: business (29.1)

Strength: Relative edge: planning (56.4).

Watch: Relative gap: business (29.1).

OpenAI: GPT-5.6 Sol · Max

Ranked #12 of 57 with an average benchmark score of 36.40 across 3 run(s). Sub-scores are strongest on planning (56.4) and weakest on business (25.9). Average contribution profit $231,057 and ROAS 171.1%. On average, cumulative contribution profit first turns positive around month 8.3.

36.40

171.1%

Sub-scores

Business: 25.92
Behavior: 36.22
Planning: 56.41
Persona: 52.34

Economics

Cost / run: $1.53
Spend: $950,270
Revenue: $1,623,775
Runs: 3
1st profitable mo (avg): ~8.3

Avg profit: $231,057

Diagnosis

Converts strategy into durable economics

The model averages 36.40 score and $231,057 contribution profit. Its strongest dimension is planning (56.4).

Trajectory

Negative months: 3 / 12

Final cumulative profit: $231,057

Worst month: M1 ($-33,176)

Relative position

Vs leader: -11.15 score pts

Vs median: +10.33 score pts

Strongest: planning (56.4)

Weakest: business (25.9)

Strength: Relative edge: planning (56.4).

Watch: Relative gap: business (25.9).

OpenAI: GPT-5.6 Sol

Ranked #13 of 57 with an average benchmark score of 36.31 across 3 run(s). Sub-scores are strongest on planning (55.3) and weakest on business (26.6). Average contribution profit $261,326 and ROAS 175.1%. On average, cumulative contribution profit first turns positive around month 7.7.

36.31

175.1%

Sub-scores

Business: 26.59
Behavior: 36.61
Planning: 55.28
Persona: 50.32

Economics

Cost / run: $0.76
Spend: $1,001,045
Revenue: $1,736,046
Runs: 3
1st profitable mo (avg): ~7.7

Avg profit: $261,326

Diagnosis

Converts strategy into durable economics

The model averages 36.31 score and $261,326 contribution profit. Its strongest dimension is planning (55.3).

Trajectory

Negative months: 2 / 12

Final cumulative profit: $261,326

Worst month: M1 ($-28,495)

Relative position

Vs leader: -11.24 score pts

Vs median: +10.24 score pts

Strongest: planning (55.3)

Weakest: business (26.6)

Strength: Relative edge: planning (55.3).

Watch: Relative gap: business (26.6).

Anthropic: Claude Sonnet 5 · High

Ranked #14 of 57 with an average benchmark score of 35.93 across 3 run(s). Sub-scores are strongest on business (47.2) and weakest on behavior (5.5). Average contribution profit $-332,923 and ROAS 79.3%. On average, cumulative contribution profit first turns positive around month 1.0.

35.93

79.3%

Sub-scores

Business: 47.22
Behavior: 5.53
Planning: 37.10
Persona: 37.60

Economics

Cost / run: $0.34
Spend: $753,456
Revenue: $589,963
Runs: 3
1st profitable mo (avg): ~1.0

Avg profit: $-332,923

Diagnosis

Strategically active, commercially negative

The model averages $-332,923 contribution profit with 79.3% ROAS. It creates activity and revenue, but not enough efficient margin.

Trajectory

Negative months: 12 / 12

Final cumulative profit: $-332,923

Worst month: M12 ($-121,437)

Relative position

Vs leader: -11.62 score pts

Vs median: +9.86 score pts

Strongest: business (47.2)

Weakest: behavior (5.5)

Strength: Relative edge: business (47.2).

Watch: Relative gap: behavior (5.5).

OpenAI: GPT-5.6 Terra · High

Ranked #15 of 57 with an average benchmark score of 35.46 across 3 run(s). Sub-scores are strongest on planning (56.0) and weakest on business (28.5). Average contribution profit $258,237 and ROAS 174.8%. On average, cumulative contribution profit first turns positive around month 8.0.

35.46

174.8%

Sub-scores

Business: 28.54
Behavior: 28.97
Planning: 56.04
Persona: 48.14

Economics

Cost / run: $0.42
Spend: $938,566
Revenue: $1,642,839
Runs: 3
1st profitable mo (avg): ~8.0

Avg profit: $258,237

Diagnosis

Converts strategy into durable economics

The model averages 35.46 score and $258,237 contribution profit. Its strongest dimension is planning (56.0).

Trajectory

Negative months: 3 / 12

Final cumulative profit: $258,237

Worst month: M1 ($-39,894)

Relative position

Vs leader: -12.09 score pts

Vs median: +9.39 score pts

Strongest: planning (56.0)

Weakest: business (28.5)

Strength: Relative edge: planning (56.0).

Watch: Relative gap: business (28.5).

OpenAI: GPT-5.6 Luna · High

Ranked #16 of 57 with an average benchmark score of 35.34 across 3 run(s). Sub-scores are strongest on planning (52.5) and weakest on business (26.4). Average contribution profit $238,356 and ROAS 168.4%. On average, cumulative contribution profit first turns positive around month 10.0.

35.34

168.4%

Sub-scores

Business: 26.45
Behavior: 32.86
Planning: 52.55
Persona: 51.30

Economics

Cost / run: $0.25
Spend: $1,103,728
Revenue: $1,859,012
Runs: 3
1st profitable mo (avg): ~10.0

Avg profit: $238,356

Diagnosis

Converts strategy into durable economics

The model averages 35.34 score and $238,356 contribution profit. Its strongest dimension is planning (52.5).

Trajectory

Negative months: 4 / 12

Final cumulative profit: $238,356

Worst month: M1 ($-43,693)

Relative position

Vs leader: -12.21 score pts

Vs median: +9.27 score pts

Strongest: planning (52.5)

Weakest: business (26.4)

Strength: Relative edge: planning (52.5).

Watch: Relative gap: business (26.4).

OpenAI: GPT-5.6 Luna · Max

Ranked #17 of 57 with an average benchmark score of 34.71 across 3 run(s). Sub-scores are strongest on planning (52.9) and weakest on business (25.4). Average contribution profit $246,328 and ROAS 169.0%. On average, cumulative contribution profit first turns positive around month 9.0.

34.71

169.0%

Sub-scores

Business: 25.41
Behavior: 34.61
Planning: 52.86
Persona: 48.52

Economics

Cost / run: $0.41
Spend: $1,102,563
Revenue: $1,863,880
Runs: 3
1st profitable mo (avg): ~9.0

Avg profit: $246,328

Diagnosis

Converts strategy into durable economics

The model averages 34.71 score and $246,328 contribution profit. Its strongest dimension is planning (52.9).

Trajectory

Negative months: 4 / 12

Final cumulative profit: $246,328

Worst month: M1 ($-33,761)

Relative position

Vs leader: -12.84 score pts

Vs median: +8.64 score pts

Strongest: planning (52.9)

Weakest: business (25.4)

Strength: Relative edge: planning (52.9).

Watch: Relative gap: business (25.4).

Anthropic: Claude Sonnet 5 · Medium

Ranked #18 of 57 with an average benchmark score of 34.20 across 3 run(s). Sub-scores are strongest on business (42.5) and weakest on behavior (6.7). Average contribution profit $-140,485 and ROAS 106.5%. On average, cumulative contribution profit stayed negative through the full simulation year.

34.20

106.5%

Sub-scores

Business: 42.55
Behavior: 6.72
Planning: 37.90
Persona: 39.08

Economics

Cost / run: $0.38
Spend: $617,288
Revenue: $666,935
Runs: 3
1st profitable mo (avg): —

Avg profit: $-140,485

Diagnosis

Strategically active, commercially negative

The model averages $-140,485 contribution profit with 106.5% ROAS. It creates activity and revenue, but not enough efficient margin.

Trajectory

Negative months: 9 / 12

Final cumulative profit: $-140,485

Worst month: M2 ($-33,252)

Relative position

Vs leader: -13.35 score pts

Vs median: +8.13 score pts

Strongest: business (42.5)

Weakest: behavior (6.7)

Strength: Relative edge: business (42.5).

Watch: Relative gap: behavior (6.7).

Qwen: Qwen3.7 Max

Ranked #19 of 57 with an average benchmark score of 33.56 across 3 run(s). Sub-scores are strongest on planning (54.4) and weakest on business (20.7). Average contribution profit $131,537 and ROAS 154.4%. On average, cumulative contribution profit first turns positive around month 10.7.

33.56

154.4%

Sub-scores

Business: 20.70
Behavior: 36.64
Planning: 54.38
Persona: 51.96

Economics

Cost / run: $0.25
Spend: $1,059,199
Revenue: $1,638,590
Runs: 3
1st profitable mo (avg): ~10.7

Avg profit: $131,537

Diagnosis

Converts strategy into durable economics

The model averages 33.56 score and $131,537 contribution profit. Its strongest dimension is planning (54.4).

Trajectory

Negative months: 5 / 12

Final cumulative profit: $131,537

Worst month: M1 ($-42,025)

Relative position

Vs leader: -13.99 score pts

Vs median: +7.49 score pts

Strongest: planning (54.4)

Weakest: business (20.7)

Strength: Relative edge: planning (54.4).

Watch: Relative gap: business (20.7).

xAI: Grok 4.5 · Medium

Ranked #20 of 57 with an average benchmark score of 32.95 across 3 run(s). Sub-scores are strongest on planning (57.2) and weakest on business (23.9). Average contribution profit $209,084 and ROAS 163.1%. On average, cumulative contribution profit first turns positive around month 9.3.

32.95

163.1%

Sub-scores

Business: 23.94
Behavior: 25.53
Planning: 57.24
Persona: 50.05

Economics

Cost / run: $0.32
Spend: $1,095,476
Revenue: $1,789,523
Runs: 3
1st profitable mo (avg): ~9.3

Avg profit: $209,084

Diagnosis

Converts strategy into durable economics

The model averages 32.95 score and $209,084 contribution profit. Its strongest dimension is planning (57.2).

Trajectory

Negative months: 4 / 12

Final cumulative profit: $209,084

Worst month: M1 ($-35,986)

Relative position

Vs leader: -14.60 score pts

Vs median: +6.88 score pts

Strongest: planning (57.2)

Weakest: business (23.9)

Strength: Relative edge: planning (57.2).

Watch: Relative gap: business (23.9).

xAI: Grok 4.5 · Max

Ranked #21 of 57 with an average benchmark score of 32.44 across 3 run(s). Sub-scores are strongest on planning (57.0) and weakest on behavior (20.4). Average contribution profit $242,973 and ROAS 166.8%. On average, cumulative contribution profit first turns positive around month 9.3.

32.44

166.8%

Sub-scores

Business: 25.63
Behavior: 20.42
Planning: 57.00
Persona: 48.33

Economics

Cost / run: $0.32
Spend: $1,117,404
Revenue: $1,866,164
Runs: 3
1st profitable mo (avg): ~9.3

Avg profit: $242,973

Diagnosis

Converts strategy into durable economics

The model averages 32.44 score and $242,973 contribution profit. Its strongest dimension is planning (57.0).

Trajectory

Negative months: 5 / 12

Final cumulative profit: $242,973

Worst month: M1 ($-31,720)

Relative position

Vs leader: -15.11 score pts

Vs median: +6.37 score pts

Strongest: planning (57.0)

Weakest: behavior (20.4)

Strength: Relative edge: planning (57.0).

Watch: Relative gap: behavior (20.4).

xAI: Grok 4.5

Ranked #22 of 57 with an average benchmark score of 32.42 across 3 run(s). Sub-scores are strongest on planning (57.1) and weakest on behavior (22.8). Average contribution profit $204,735 and ROAS 162.8%. On average, cumulative contribution profit first turns positive around month 10.0.

32.42

162.8%

Sub-scores

Business: 24.40
Behavior: 22.84
Planning: 57.14
Persona: 48.83

Economics

Cost / run: $0.32
Spend: $1,091,423
Revenue: $1,778,146
Runs: 3
1st profitable mo (avg): ~10.0

Avg profit: $204,735

Diagnosis

Converts strategy into durable economics

The model averages 32.42 score and $204,735 contribution profit. Its strongest dimension is planning (57.1).

Trajectory

Negative months: 5 / 12

Final cumulative profit: $204,735

Worst month: M3 ($-29,591)

Relative position

Vs leader: -15.13 score pts

Vs median: +6.35 score pts

Strongest: planning (57.1)

Weakest: behavior (22.8)

Strength: Relative edge: planning (57.1).

Watch: Relative gap: behavior (22.8).

Anthropic: Claude Opus 4.7

A regression on ROASBench vs. 4.6: less persona-aware copy, a tilt toward intent capture over prospecting, and learning resets on its largest channel. More reactive, less consistent run-to-run. Averaged 30.06 across 3 completed run(s); contribution profit $250,097; ROAS 166.6%. Avg first month cumulative contribution profit turns positive: ~6.0.

30.06

166.6%

Sub-scores

Business: 25.58
Behavior: 18.78
Planning: 56.37
Persona: 37.52

Economics

Cost / run: $1.10
Spend: $1,046,108
Revenue: $1,788,239
Runs: 3
1st profitable mo (avg): ~6.0

Avg profit: $250,097

Diagnosis

Converts strategy into durable economics

The model averages 30.06 score and $250,097 contribution profit. Its strongest dimension is planning (56.4).

Trajectory

Negative months: 4 / 12

Final cumulative profit: $250,097

Worst month: M3 ($-26,790)

Relative position

Vs leader: -17.49 score pts

Vs median: +3.99 score pts

Strongest: planning (56.4)

Weakest: behavior (18.8)

Strength: Earlier first profitable month and occasional strong-profit spikes.

Watch: Persona fit collapses, Search learning resets, and discounting appears under pressure.

Anthropic: Claude Sonnet 5

Ranked #24 of 57 with an average benchmark score of 29.47 across 3 run(s). Sub-scores are strongest on planning (42.6) and weakest on behavior (10.6). Average contribution profit $-127,409 and ROAS 115.1%. On average, cumulative contribution profit first turns positive around month 6.5.

29.47

115.1%

Sub-scores

Business: 29.76
Behavior: 10.64
Planning: 42.57
Persona: 40.87

Economics

Cost / run: $0.37
Spend: $780,509
Revenue: $912,083
Runs: 3
1st profitable mo (avg): ~6.5

Avg profit: $-127,409

Diagnosis

Strategically active, commercially negative

The model averages $-127,409 contribution profit with 115.1% ROAS. It creates activity and revenue, but not enough efficient margin.

Trajectory

Negative months: 9 / 12

Final cumulative profit: $-127,409

Worst month: M9 ($-44,856)

Relative position

Vs leader: -18.08 score pts

Vs median: +3.40 score pts

Strongest: planning (42.6)

Weakest: behavior (10.6)

Strength: Relative edge: planning (42.6).

Watch: Relative gap: behavior (10.6).

OpenAI: GPT-5.6 Terra

Ranked #25 of 57 with an average benchmark score of 29.08 across 3 run(s). Sub-scores are strongest on planning (54.6) and weakest on behavior (17.8). Average contribution profit $87,087 and ROAS 153.4%. On average, cumulative contribution profit first turns positive around month 9.5.

29.08

153.4%

Sub-scores

Business: 23.47
Behavior: 17.80
Planning: 54.57
Persona: 40.19

Economics

Cost / run: $0.33
Spend: $674,303
Revenue: $1,046,683
Runs: 3
1st profitable mo (avg): ~9.5

Avg profit: $87,087

Diagnosis

Converts strategy into durable economics

The model averages 29.08 score and $87,087 contribution profit. Its strongest dimension is planning (54.6).

Trajectory

Negative months: 4 / 12

Final cumulative profit: $87,087

Worst month: M1 ($-39,593)

Relative position

Vs leader: -18.47 score pts

Vs median: +3.01 score pts

Strongest: planning (54.6)

Weakest: behavior (17.8)

Strength: Relative edge: planning (54.6).

Watch: Relative gap: behavior (17.8).

OpenAI: GPT-5.6 Luna

Ranked #26 of 57 with an average benchmark score of 27.32 across 3 run(s). Sub-scores are strongest on planning (51.4) and weakest on business (15.5). Average contribution profit $-17,236 and ROAS 135.8%. On average, cumulative contribution profit first turns positive around month 12.0.

27.32

135.8%

Sub-scores

Business: 15.52
Behavior: 27.03
Planning: 51.38
Persona: 44.37

Economics

Cost / run: $0.15
Spend: $1,033,163
Revenue: $1,403,108
Runs: 3
1st profitable mo (avg): ~12.0

Avg profit: $-17,236

Diagnosis

Strategically active, commercially negative

The model averages $-17,236 contribution profit with 135.8% ROAS. It creates activity and revenue, but not enough efficient margin.

Trajectory

Negative months: 7 / 12

Final cumulative profit: $-17,236

Worst month: M1 ($-59,549)

Relative position

Vs leader: -20.23 score pts

Vs median: +1.25 score pts

Strongest: planning (51.4)

Weakest: business (15.5)

Strength: Relative edge: planning (51.4).

Watch: Relative gap: business (15.5).

Google: Gemini 3.1 Pro Preview

Often nearer break-even with structurally sensible moves; execution and generic creative hold the score down, with too many mid-course resets. Averaged 27.14 across 3 completed run(s); contribution profit $-34,549; ROAS 132.9%. Avg first month cumulative contribution profit turns positive: ~12.0.

27.14

132.9%

Sub-scores

Business: 12.17
Behavior: 29.25
Planning: 53.04
Persona: 49.11

Economics

Cost / run: $0.53
Spend: $1,002,707
Revenue: $1,333,366
Runs: 3
1st profitable mo (avg): ~12.0

Avg profit: $-34,549

Diagnosis

Strategically active, commercially negative

The model averages $-34,549 contribution profit with 132.9% ROAS. It creates activity and revenue, but not enough efficient margin.

Trajectory

Negative months: 8 / 12

Final cumulative profit: $-34,549

Worst month: M1 ($-38,205)

Relative position

Vs leader: -20.41 score pts

Vs median: +1.07 score pts

Strongest: planning (53.0)

Weakest: business (12.2)

Strength: Directionally right budget and channel choices vs. weaker frontier peers.

Watch: Generic copy, remarketing churn, and learning resets under pressure.

OpenAI: GPT-5.6 Luna · Medium

Ranked #28 of 57 with an average benchmark score of 26.48 across 3 run(s). Sub-scores are strongest on planning (50.6) and weakest on business (16.4). Average contribution profit $39,991 and ROAS 143.8%. On average, cumulative contribution profit first turns positive around month 11.0.

26.48

143.8%

Sub-scores

Business: 16.42
Behavior: 22.97
Planning: 50.64
Persona: 42.24

Economics

Cost / run: $0.15
Spend: $1,038,018
Revenue: $1,495,167
Runs: 3
1st profitable mo (avg): ~11.0

Avg profit: $39,991

Diagnosis

Converts strategy into durable economics

The model averages 26.48 score and $39,991 contribution profit. Its strongest dimension is planning (50.6).

Trajectory

Negative months: 6 / 12

Final cumulative profit: $39,991

Worst month: M2 ($-42,996)

Relative position

Vs leader: -21.07 score pts

Vs median: +0.41 score pts

Strongest: planning (50.6)

Weakest: business (16.4)

Strength: Relative edge: planning (50.6).

Watch: Relative gap: business (16.4).

Qwen: Qwen3.5 Plus 2026-02-15

Ranked #29 of 57 with an average benchmark score of 26.07 across 3 run(s). Sub-scores are strongest on planning (51.7) and weakest on business (14.8). Average contribution profit $-29,480 and ROAS 133.8%. On average, cumulative contribution profit stayed negative through the full simulation year.

26.07

133.8%

Sub-scores

Business: 14.79
Behavior: 21.36
Planning: 51.68
Persona: 45.55

Economics

Cost / run: $0.10
Spend: $1,018,413
Revenue: $1,362,638
Runs: 3
1st profitable mo (avg): —

Avg profit: $-29,480

Diagnosis

Strategically active, commercially negative

The model averages $-29,480 contribution profit with 133.8% ROAS. It creates activity and revenue, but not enough efficient margin.

Trajectory

Negative months: 8 / 12

Final cumulative profit: $-29,480

Worst month: M1 ($-41,310)

Relative position

Vs leader: -21.48 score pts

Vs median: +0.00 score pts

Strongest: planning (51.7)

Weakest: business (14.8)

Strength: Relative edge: planning (51.7).

Watch: Relative gap: business (14.8).

OpenAI: GPT-5.5

Ranked #30 of 57 with an average benchmark score of 25.68 across 3 run(s). Sub-scores are strongest on planning (52.3) and weakest on business (13.7). Average contribution profit $15,398 and ROAS 140.9%. On average, cumulative contribution profit first turns positive around month 11.0.

25.68

140.9%

Sub-scores

Business: 13.67
Behavior: 24.59
Planning: 52.33
Persona: 42.51

Economics

Cost / run: $0.77
Spend: $929,589
Revenue: $1,307,291
Runs: 3
1st profitable mo (avg): ~11.0

Avg profit: $15,398

Diagnosis

Converts strategy into durable economics

The model averages 25.68 score and $15,398 contribution profit. Its strongest dimension is planning (52.3).

Trajectory

Negative months: 7 / 12

Final cumulative profit: $15,398

Worst month: M1 ($-37,793)

Relative position

Vs leader: -21.87 score pts

Vs median: -0.39 score pts

Strongest: planning (52.3)

Weakest: business (13.7)

Strength: Relative edge: planning (52.3).

Watch: Relative gap: business (13.7).

Anthropic: Claude Sonnet 4.6

Ranked #31 of 57 with an average benchmark score of 21.21 across 3 run(s). Sub-scores are strongest on planning (57.4) and weakest on business (9.7). Average contribution profit $-146,915 and ROAS 117.6%. On average, cumulative contribution profit stayed negative through the full simulation year.

21.21

117.6%

Sub-scores

Business: 9.73
Behavior: 14.86
Planning: 57.35
Persona: 36.06

Economics

Cost / run: $0.53
Spend: $985,912
Revenue: $1,159,482
Runs: 3
1st profitable mo (avg): —

Avg profit: $-146,915

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (57.4), but business outcome score is only 9.7. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 8 / 12

Final cumulative profit: $-146,915

Worst month: M2 ($-50,478)

Relative position

Vs leader: -26.34 score pts

Vs median: -4.86 score pts

Strongest: planning (57.4)

Weakest: business (9.7)

Strength: Relative edge: planning (57.4).

Watch: Relative gap: business (9.7).

Google: Gemini 3.5 Flash · High

Ranked #32 of 57 with an average benchmark score of 20.74 across 3 run(s). Sub-scores are strongest on planning (51.1) and weakest on business (6.8). Average contribution profit $-147,599 and ROAS 116.9%. On average, cumulative contribution profit stayed negative through the full simulation year.

20.74

116.9%

Sub-scores

Business: 6.83
Behavior: 21.44
Planning: 51.06
Persona: 38.37

Economics

Cost / run: $0.37
Spend: $972,068
Revenue: $1,136,720
Runs: 3
1st profitable mo (avg): —

Avg profit: $-147,599

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (51.1), but business outcome score is only 6.8. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 9 / 12

Final cumulative profit: $-147,599

Worst month: M1 ($-35,809)

Relative position

Vs leader: -26.81 score pts

Vs median: -5.33 score pts

Strongest: planning (51.1)

Weakest: business (6.8)

Strength: Relative edge: planning (51.1).

Watch: Relative gap: business (6.8).

DeepSeek: DeepSeek V3.2

Ranked #33 of 57 with an average benchmark score of 19.77 across 3 run(s). Sub-scores are strongest on planning (49.5) and weakest on business (8.9). Average contribution profit $-126,535 and ROAS 120.4%. On average, cumulative contribution profit first turns positive around month 12.0.

19.77

120.4%

Sub-scores

Business: 8.93
Behavior: 13.28
Planning: 49.54
Persona: 37.26

Economics

Cost / run: $0.01
Spend: $987,722
Revenue: $1,193,277
Runs: 3
1st profitable mo (avg): ~12.0

Avg profit: $-126,535

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (49.5), but business outcome score is only 8.9. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 8 / 12

Final cumulative profit: $-126,535

Worst month: M1 ($-44,732)

Relative position

Vs leader: -27.78 score pts

Vs median: -6.30 score pts

Strongest: planning (49.5)

Weakest: business (8.9)

Strength: Relative edge: planning (49.5).

Watch: Relative gap: business (8.9).

MoonshotAI: Kimi K2.6 · Medium

Ranked #34 of 57 with an average benchmark score of 19.73 across 3 run(s). Sub-scores are strongest on planning (54.5) and weakest on business (10.0). Average contribution profit $-140,116 and ROAS 119.3%. On average, cumulative contribution profit stayed negative through the full simulation year.

19.73

119.3%

Sub-scores

Business: 10.02
Behavior: 10.64
Planning: 54.54
Persona: 33.58

Economics

Cost / run: $0.33
Spend: $999,691
Revenue: $1,193,169
Runs: 3
1st profitable mo (avg): —

Avg profit: $-140,116

Diagnosis

Strategically active, commercially negative

The model averages $-140,116 contribution profit with 119.3% ROAS. It creates activity and revenue, but not enough efficient margin.

Trajectory

Negative months: 8 / 12

Final cumulative profit: $-140,116

Worst month: M2 ($-43,659)

Relative position

Vs leader: -27.82 score pts

Vs median: -6.34 score pts

Strongest: planning (54.5)

Weakest: business (10.0)

Strength: Relative edge: planning (54.5).

Watch: Relative gap: business (10.0).

MiniMax: MiniMax M3

Ranked #35 of 57 with an average benchmark score of 18.98 across 3 run(s). Sub-scores are strongest on planning (54.1) and weakest on business (9.0). Average contribution profit $-122,807 and ROAS 121.4%. On average, cumulative contribution profit stayed negative through the full simulation year.

18.98

121.4%

Sub-scores

Business: 9.02
Behavior: 13.50
Planning: 54.08
Persona: 29.35

Economics

Cost / run: $0.08
Spend: $978,979
Revenue: $1,190,032
Runs: 3
1st profitable mo (avg): —

Avg profit: $-122,807

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (54.1), but business outcome score is only 9.0. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 8 / 12

Final cumulative profit: $-122,807

Worst month: M4 ($-42,442)

Relative position

Vs leader: -28.57 score pts

Vs median: -7.09 score pts

Strongest: planning (54.1)

Weakest: business (9.0)

Strength: Relative edge: planning (54.1).

Watch: Relative gap: business (9.0).

OpenAI: GPT-5.4

Looks plausible on paper but weak compounding: revenue without efficient spend patterns; repeated broad demand spend without durable payoff. Averaged 18.39 across 3 completed run(s); contribution profit $-250,461; ROAS 103.2%. Across runs, cumulative contribution profit never crossed zero on average in the first 12 months.

18.39

103.2%

Sub-scores

Business: 5.69
Behavior: 17.55
Planning: 54.12
Persona: 30.77

Economics

Cost / run: $0.29
Spend: $970,302
Revenue: $1,001,373
Runs: 3
1st profitable mo (avg): —

Avg profit: $-250,461

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (54.1), but business outcome score is only 5.7. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 9 / 12

Final cumulative profit: $-250,461

Worst month: M1 ($-58,177)

Relative position

Vs leader: -29.16 score pts

Vs median: -7.68 score pts

Strongest: planning (54.1)

Weakest: business (5.7)

Strength: Readable strategy and channel mix in isolation.

Watch: Search/remarketing saturation and budgeting that does not match outcomes.

xAI: Grok 4.20 Beta

Ranked #37 of 57 with an average benchmark score of 16.77 across 3 run(s). Sub-scores are strongest on planning (43.9) and weakest on business (5.9). Average contribution profit $-243,069 and ROAS 103.5%. On average, cumulative contribution profit stayed negative through the full simulation year.

16.77

103.5%

Sub-scores

Business: 5.91
Behavior: 16.36
Planning: 43.92
Persona: 29.32

Economics

Cost / run: $0.13
Spend: $972,947
Revenue: $1,009,791
Runs: 3
1st profitable mo (avg): —

Avg profit: $-243,069

Diagnosis

Strategically active, commercially negative

The model averages $-243,069 contribution profit with 103.5% ROAS. It creates activity and revenue, but not enough efficient margin.

Trajectory

Negative months: 10 / 12

Final cumulative profit: $-243,069

Worst month: M1 ($-40,606)

Relative position

Vs leader: -30.78 score pts

Vs median: -9.30 score pts

Strongest: planning (43.9)

Weakest: business (5.9)

Strength: Relative edge: planning (43.9).

Watch: Relative gap: business (5.9).

MiniMax: MiniMax M3 · High

Ranked #38 of 57 with an average benchmark score of 16.73 across 3 run(s). Sub-scores are strongest on planning (55.9) and weakest on business (7.2). Average contribution profit $-201,100 and ROAS 110.2%. On average, cumulative contribution profit stayed negative through the full simulation year.

16.73

110.2%

Sub-scores

Business: 7.23
Behavior: 10.30
Planning: 55.85
Persona: 24.17

Economics

Cost / run: $0.10
Spend: $982,170
Revenue: $1,086,683
Runs: 3
1st profitable mo (avg): —

Avg profit: $-201,100

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (55.9), but business outcome score is only 7.2. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 9 / 12

Final cumulative profit: $-201,100

Worst month: M2 ($-43,664)

Relative position

Vs leader: -30.82 score pts

Vs median: -9.34 score pts

Strongest: planning (55.9)

Weakest: business (7.2)

Strength: Relative edge: planning (55.9).

Watch: Relative gap: business (7.2).

MiniMax: MiniMax M3 · Medium

Ranked #39 of 57 with an average benchmark score of 16.42 across 3 run(s). Sub-scores are strongest on planning (54.4) and weakest on behavior (6.2). Average contribution profit $-204,996 and ROAS 107.7%. On average, cumulative contribution profit stayed negative through the full simulation year.

16.42

107.7%

Sub-scores

Business: 7.03
Behavior: 6.20
Planning: 54.39
Persona: 28.54

Economics

Cost / run: $0.08
Spend: $949,308
Revenue: $1,034,124
Runs: 3
1st profitable mo (avg): —

Avg profit: $-204,996

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (54.4), but business outcome score is only 7.0. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 10 / 12

Final cumulative profit: $-204,996

Worst month: M1 ($-42,192)

Relative position

Vs leader: -31.13 score pts

Vs median: -9.65 score pts

Strongest: planning (54.4)

Weakest: behavior (6.2)

Strength: Relative edge: planning (54.4).

Watch: Relative gap: behavior (6.2).

Google: Gemini 3 Flash Preview

Ranked #40 of 57 with an average benchmark score of 16.29 across 3 run(s). Sub-scores are strongest on planning (49.3) and weakest on business (4.6). Average contribution profit $-239,859 and ROAS 104.6%. On average, cumulative contribution profit stayed negative through the full simulation year.

16.29

104.6%

Sub-scores

Business: 4.63
Behavior: 13.03
Planning: 49.33
Persona: 30.27

Economics

Cost / run: $0.10
Spend: $967,576
Revenue: $1,012,636
Runs: 3
1st profitable mo (avg): —

Avg profit: $-239,859

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (49.3), but business outcome score is only 4.6. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 10 / 12

Final cumulative profit: $-239,859

Worst month: M4 ($-44,205)

Relative position

Vs leader: -31.26 score pts

Vs median: -9.78 score pts

Strongest: planning (49.3)

Weakest: business (4.6)

Strength: Relative edge: planning (49.3).

Watch: Relative gap: business (4.6).

Z.ai: GLM 5.2 · High

Ranked #41 of 57 with an average benchmark score of 14.19 across 3 run(s). Sub-scores are strongest on planning (49.2) and weakest on business (3.9). Average contribution profit $-283,971 and ROAS 98.9%. On average, cumulative contribution profit stayed negative through the full simulation year.

14.19

98.9%

Sub-scores

Business: 3.86
Behavior: 5.53
Planning: 49.19
Persona: 29.20

Economics

Cost / run: $0.18
Spend: $965,181
Revenue: $954,675
Runs: 3
1st profitable mo (avg): —

Avg profit: $-283,971

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (49.2), but business outcome score is only 3.9. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 10 / 12

Final cumulative profit: $-283,971

Worst month: M2 ($-53,768)

Relative position

Vs leader: -33.36 score pts

Vs median: -11.88 score pts

Strongest: planning (49.2)

Weakest: business (3.9)

Strength: Relative edge: planning (49.2).

Watch: Relative gap: business (3.9).

MoonshotAI: Kimi K2.6

Ranked #42 of 57 with an average benchmark score of 14.14 across 3 run(s). Sub-scores are strongest on planning (53.4) and weakest on business (4.1). Average contribution profit $-276,146 and ROAS 99.6%. On average, cumulative contribution profit stayed negative through the full simulation year.

14.14

99.6%

Sub-scores

Business: 4.14
Behavior: 6.72
Planning: 53.36
Persona: 23.98

Economics

Cost / run: $0.37
Spend: $967,322
Revenue: $963,786
Runs: 3
1st profitable mo (avg): —

Avg profit: $-276,146

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (53.4), but business outcome score is only 4.1. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 10 / 12

Final cumulative profit: $-276,146

Worst month: M3 ($-47,805)

Relative position

Vs leader: -33.41 score pts

Vs median: -11.93 score pts

Strongest: planning (53.4)

Weakest: business (4.1)

Strength: Relative edge: planning (53.4).

Watch: Relative gap: business (4.1).

Z.ai: GLM 5.1

Ranked #43 of 57 with an average benchmark score of 13.37 across 3 run(s). Sub-scores are strongest on planning (48.0) and weakest on business (3.2). Average contribution profit $-253,302 and ROAS 102.9%. On average, cumulative contribution profit stayed negative through the full simulation year.

13.37

102.9%

Sub-scores

Business: 3.25
Behavior: 6.22
Planning: 47.98
Persona: 26.37

Economics

Cost / run: $0.27
Spend: $946,959
Revenue: $974,727
Runs: 3
1st profitable mo (avg): —

Avg profit: $-253,302

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (48.0), but business outcome score is only 3.2. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 11 / 12

Final cumulative profit: $-253,302

Worst month: M1 ($-40,380)

Relative position

Vs leader: -34.18 score pts

Vs median: -12.70 score pts

Strongest: planning (48.0)

Weakest: business (3.2)

Strength: Relative edge: planning (48.0).

Watch: Relative gap: business (3.2).

MoonshotAI: Kimi K2.6 · High

Ranked #44 of 57 with an average benchmark score of 13.28 across 3 run(s). Sub-scores are strongest on planning (50.3) and weakest on business (4.9). Average contribution profit $-345,122 and ROAS 90.0%. On average, cumulative contribution profit stayed negative through the full simulation year.

13.28

90.0%

Sub-scores

Business: 4.94
Behavior: 5.25
Planning: 50.32
Persona: 20.66

Economics

Cost / run: $0.39
Spend: $964,218
Revenue: $868,117
Runs: 3
1st profitable mo (avg): —

Avg profit: $-345,122

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (50.3), but business outcome score is only 4.9. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 11 / 12

Final cumulative profit: $-345,122

Worst month: M2 ($-53,286)

Relative position

Vs leader: -34.27 score pts

Vs median: -12.79 score pts

Strongest: planning (50.3)

Weakest: business (4.9)

Strength: Relative edge: planning (50.3).

Watch: Relative gap: business (4.9).

Z.ai: GLM 5.2 · Medium

Ranked #45 of 57 with an average benchmark score of 13.26 across 3 run(s). Sub-scores are strongest on planning (48.3) and weakest on business (3.0). Average contribution profit $-289,969 and ROAS 96.9%. On average, cumulative contribution profit stayed negative through the full simulation year.

13.26

96.9%

Sub-scores

Business: 3.00
Behavior: 8.20
Planning: 48.30
Persona: 24.04

Economics

Cost / run: $0.17
Spend: $930,295
Revenue: $903,624
Runs: 3
1st profitable mo (avg): —

Avg profit: $-289,969

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (48.3), but business outcome score is only 3.0. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 11 / 12

Final cumulative profit: $-289,969

Worst month: M2 ($-57,508)

Relative position

Vs leader: -34.29 score pts

Vs median: -12.81 score pts

Strongest: planning (48.3)

Weakest: business (3.0)

Strength: Relative edge: planning (48.3).

Watch: Relative gap: business (3.0).

MoonshotAI: Kimi K2.5

Ranked #46 of 57 with an average benchmark score of 13.10 across 2 run(s). Sub-scores are strongest on planning (50.8) and weakest on business (3.7). Average contribution profit $-292,423 and ROAS 97.5%. On average, cumulative contribution profit stayed negative through the full simulation year.

13.10

97.5%

Sub-scores

Business: 3.72
Behavior: 5.12
Planning: 50.75
Persona: 22.91

Economics

Cost / run: $0.17
Spend: $966,702
Revenue: $942,650
Runs: 2
1st profitable mo (avg): —

Avg profit: $-292,423

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (50.8), but business outcome score is only 3.7. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 10 / 12

Final cumulative profit: $-292,423

Worst month: M4 ($-55,730)

Relative position

Vs leader: -34.45 score pts

Vs median: -12.97 score pts

Strongest: planning (50.8)

Weakest: business (3.7)

Strength: Relative edge: planning (50.8).

Watch: Relative gap: business (3.7).

MiniMax: MiniMax M2.7

Ranked #47 of 57 with an average benchmark score of 12.13 across 3 run(s). Sub-scores are strongest on planning (39.4) and weakest on business (3.5). Average contribution profit $-316,117 and ROAS 94.2%. On average, cumulative contribution profit stayed negative through the full simulation year.

12.13

94.2%

Sub-scores

Business: 3.54
Behavior: 9.08
Planning: 39.37
Persona: 21.24

Economics

Cost / run: $0.12
Spend: $961,885
Revenue: $906,316
Runs: 3
1st profitable mo (avg): —

Avg profit: $-316,117

Diagnosis

Strategically active, commercially negative

The model averages $-316,117 contribution profit with 94.2% ROAS. It creates activity and revenue, but not enough efficient margin.

Trajectory

Negative months: 10 / 12

Final cumulative profit: $-316,117

Worst month: M3 ($-51,444)

Relative position

Vs leader: -35.42 score pts

Vs median: -13.94 score pts

Strongest: planning (39.4)

Weakest: business (3.5)

Strength: Relative edge: planning (39.4).

Watch: Relative gap: business (3.5).

DeepSeek: DeepSeek V4 Pro

Ranked #48 of 57 with an average benchmark score of 12.11 across 3 run(s). Sub-scores are strongest on planning (47.3) and weakest on business (2.2). Average contribution profit $-370,744 and ROAS 85.7%. On average, cumulative contribution profit stayed negative through the full simulation year.

12.11

85.7%

Sub-scores

Business: 2.18
Behavior: 6.81
Planning: 47.32
Persona: 22.08

Economics

Cost / run: $0.05
Spend: $958,962
Revenue: $821,997
Runs: 3
1st profitable mo (avg): —

Avg profit: $-370,744

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (47.3), but business outcome score is only 2.2. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 11 / 12

Final cumulative profit: $-370,744

Worst month: M1 ($-66,004)

Relative position

Vs leader: -35.44 score pts

Vs median: -13.96 score pts

Strongest: planning (47.3)

Weakest: business (2.2)

Strength: Relative edge: planning (47.3).

Watch: Relative gap: business (2.2).

OpenAI: GPT-5.4 Mini

Ranked #49 of 57 with an average benchmark score of 11.83 across 1 run(s). Sub-scores are strongest on planning (44.9) and weakest on business (2.0). Average contribution profit $-353,629 and ROAS 87.3%. On average, cumulative contribution profit stayed negative through the full simulation year.

11.83

87.3%

Sub-scores

Business: 2.00
Behavior: 5.58
Planning: 44.94
Persona: 24.04

Economics

Cost / run: $0.09
Spend: $952,498
Revenue: $831,373
Runs: 1
1st profitable mo (avg): —

Avg profit: $-353,629

Diagnosis

Strategically active, commercially negative

The model averages $-353,629 contribution profit with 87.3% ROAS. It creates activity and revenue, but not enough efficient margin.

Trajectory

Negative months: 11 / 12

Final cumulative profit: $-353,629

Worst month: M3 ($-59,559)

Relative position

Vs leader: -35.72 score pts

Vs median: -14.24 score pts

Strongest: planning (44.9)

Weakest: business (2.0)

Strength: Relative edge: planning (44.9).

Watch: Relative gap: business (2.0).

DeepSeek: DeepSeek V4 Pro · Medium

Ranked #50 of 57 with an average benchmark score of 11.26 across 3 run(s). Sub-scores are strongest on planning (46.7) and weakest on business (1.6). Average contribution profit $-344,923 and ROAS 89.4%. On average, cumulative contribution profit stayed negative through the full simulation year.

11.26

89.4%

Sub-scores

Business: 1.62
Behavior: 6.00
Planning: 46.73
Persona: 20.21

Economics

Cost / run: $0.05
Spend: $957,839
Revenue: $856,170
Runs: 3
1st profitable mo (avg): —

Avg profit: $-344,923

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (46.7), but business outcome score is only 1.6. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 11 / 12

Final cumulative profit: $-344,923

Worst month: M3 ($-46,263)

Relative position

Vs leader: -36.29 score pts

Vs median: -14.81 score pts

Strongest: planning (46.7)

Weakest: business (1.6)

Strength: Relative edge: planning (46.7).

Watch: Relative gap: business (1.6).

Z.ai: GLM 5.2

Ranked #51 of 57 with an average benchmark score of 10.85 across 3 run(s). Sub-scores are strongest on planning (45.5) and weakest on business (1.6). Average contribution profit $-362,680 and ROAS 87.7%. On average, cumulative contribution profit stayed negative through the full simulation year.

10.85

87.7%

Sub-scores

Business: 1.64
Behavior: 4.58
Planning: 45.51
Persona: 20.31

Economics

Cost / run: $0.13
Spend: $958,844
Revenue: $840,846
Runs: 3
1st profitable mo (avg): —

Avg profit: $-362,680

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (45.5), but business outcome score is only 1.6. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 11 / 12

Final cumulative profit: $-362,680

Worst month: M2 ($-54,815)

Relative position

Vs leader: -36.70 score pts

Vs median: -15.22 score pts

Strongest: planning (45.5)

Weakest: business (1.6)

Strength: Relative edge: planning (45.5).

Watch: Relative gap: business (1.6).

DeepSeek: DeepSeek V4 Pro · High

Ranked #52 of 57 with an average benchmark score of 10.62 across 3 run(s). Sub-scores are strongest on planning (45.7) and weakest on business (1.7). Average contribution profit $-363,896 and ROAS 86.7%. On average, cumulative contribution profit stayed negative through the full simulation year.

10.62

86.7%

Sub-scores

Business: 1.69
Behavior: 3.53
Planning: 45.73
Persona: 19.89

Economics

Cost / run: $0.05
Spend: $959,186
Revenue: $831,752
Runs: 3
1st profitable mo (avg): —

Avg profit: $-363,896

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (45.7), but business outcome score is only 1.7. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 11 / 12

Final cumulative profit: $-363,896

Worst month: M6 ($-54,462)

Relative position

Vs leader: -36.93 score pts

Vs median: -15.45 score pts

Strongest: planning (45.7)

Weakest: business (1.7)

Strength: Relative edge: planning (45.7).

Watch: Relative gap: business (1.7).

Anthropic: Claude Opus 4.8 · High

Ranked #53 of 57 with an average benchmark score of 10.41 across 3 run(s). Sub-scores are strongest on planning (53.5) and weakest on business (1.5). Average contribution profit $-352,542 and ROAS 83.6%. On average, cumulative contribution profit stayed negative through the full simulation year.

10.41

83.6%

Sub-scores

Business: 1.46
Behavior: 6.58
Planning: 53.47
Persona: 10.83

Economics

Cost / run: $1.07
Spend: $862,852
Revenue: $719,316
Runs: 3
1st profitable mo (avg): —

Avg profit: $-352,542

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (53.5), but business outcome score is only 1.5. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 12 / 12

Final cumulative profit: $-352,542

Worst month: M1 ($-59,001)

Relative position

Vs leader: -37.14 score pts

Vs median: -15.66 score pts

Strongest: planning (53.5)

Weakest: business (1.5)

Strength: Relative edge: planning (53.5).

Watch: Relative gap: business (1.5).

Anthropic: Claude Opus 4.8 · Low

Ranked #54 of 57 with an average benchmark score of 9.98 across 3 run(s). Sub-scores are strongest on planning (54.2) and weakest on business (1.3). Average contribution profit $-299,490 and ROAS 89.4%. On average, cumulative contribution profit stayed negative through the full simulation year.

9.98

89.4%

Sub-scores

Business: 1.26
Behavior: 3.97
Planning: 54.23
Persona: 11.37

Economics

Cost / run: $1.00
Spend: $812,367
Revenue: $724,017
Runs: 3
1st profitable mo (avg): —

Avg profit: $-299,490

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (54.2), but business outcome score is only 1.3. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 12 / 12

Final cumulative profit: $-299,490

Worst month: M1 ($-52,862)

Relative position

Vs leader: -37.57 score pts

Vs median: -16.09 score pts

Strongest: planning (54.2)

Weakest: business (1.3)

Strength: Relative edge: planning (54.2).

Watch: Relative gap: business (1.3).

Z.ai: GLM 5

Ranked #55 of 57 with an average benchmark score of 9.64 across 3 run(s). Sub-scores are strongest on planning (42.6) and weakest on business (0.9). Average contribution profit $-370,131 and ROAS 86.0%. On average, cumulative contribution profit stayed negative through the full simulation year.

9.64

86.0%

Sub-scores

Business: 0.91
Behavior: 5.34
Planning: 42.62
Persona: 16.65

Economics

Cost / run: $0.19
Spend: $949,211
Revenue: $816,409
Runs: 3
1st profitable mo (avg): —

Avg profit: $-370,131

Diagnosis

Audience fit is the main failure mode

Persona score is low (16.6), so the simulated shoppers are not buying the positioning even when the high-level strategy looks reasonable.

Trajectory

Negative months: 11 / 12

Final cumulative profit: $-370,131

Worst month: M2 ($-54,591)

Relative position

Vs leader: -37.91 score pts

Vs median: -16.43 score pts

Strongest: planning (42.6)

Weakest: business (0.9)

Strength: Relative edge: planning (42.6).

Watch: Relative gap: business (0.9).

Anthropic: Claude Opus 4.8

Ranked #56 of 57 with an average benchmark score of 9.54 across 3 run(s). Sub-scores are strongest on planning (54.6) and weakest on business (0.2). Average contribution profit $-312,597 and ROAS 86.5%. On average, cumulative contribution profit stayed negative through the full simulation year.

9.54

86.5%

Sub-scores

Business: 0.20
Behavior: 3.36
Planning: 54.58
Persona: 12.30

Economics

Cost / run: $1.03
Spend: $804,712
Revenue: $694,581
Runs: 3
1st profitable mo (avg): —

Avg profit: $-312,597

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (54.6), but business outcome score is only 0.2. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 12 / 12

Final cumulative profit: $-312,597

Worst month: M1 ($-57,005)

Relative position

Vs leader: -38.01 score pts

Vs median: -16.53 score pts

Strongest: planning (54.6)

Weakest: business (0.2)

Strength: Relative edge: planning (54.6).

Watch: Relative gap: business (0.2).

OpenAI: GPT-5.4 Nano

Ranked #57 of 57 with an average benchmark score of 6.80 across 3 run(s). Sub-scores are strongest on planning (46.5) and weakest on business (0.0). Average contribution profit $-577,506 and ROAS 56.5%. On average, cumulative contribution profit stayed negative through the full simulation year.

6.80

56.5%

Sub-scores

Business: 0.00
Behavior: 1.33
Planning: 46.45
Persona: 5.36

Economics

Cost / run: $0.02
Spend: $959,525
Revenue: $542,214
Runs: 3
1st profitable mo (avg): —

Avg profit: $-577,506

Diagnosis

Plans coherently, but the market does not reward the choices

Planning is the relative bright spot (46.5), but business outcome score is only 0.0. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.

Trajectory

Negative months: 12 / 12

Final cumulative profit: $-577,506

Worst month: M1 ($-67,937)

Relative position

Vs leader: -40.75 score pts

Vs median: -19.27 score pts

Strongest: planning (46.5)

Weakest: business (0.0)

Strength: Relative edge: planning (46.5).

Watch: Relative gap: business (0.0).

How ROASBench works

Open each section for setup, simulation flow, what models see, personas, state, and what skills the benchmark rewards.

ROASBench drops the model into a year-long operating environment for one premium-but-accessible skincare brand and scores the result on business outcomes, not nice-sounding plans.

Brand

Northstar Skin

Barrier Repair Serum at $68 with 76% gross margin.

Time horizon

12 months

The model has to adapt over time instead of solving one isolated scenario.

Controlled channels

Meta prospecting, Search, Shopping, TikTok, Email / CRM, and Remarketing.

Scoring

Business + behavior

Primary score blends profitability, planning quality, persona response, and long-run adaptation.

What the model can control

Budget allocation — spend per month and split across channels.
Campaign design — type, segments, creative angle, copy structure.
Offer strategy — discounts, remarketing, CRM cadence, margin vs. conversion.
Iteration — hold, scale, or change course after monthly results.

Each round is a real operating cycle, not a one-shot prompt. Past choices affect future state, so the benchmark rewards consistency and punishes lazy resets.

1. Seeded world

Fixed brand, budget, customers, email list, warm pool, seasonality, shocks.

2. Decision step

Structured monthly plan: objective, budget, discount, remarketing, channels, creative.

3. Persona panel + rules

Panel judges copy and targeting; rules produce clicks, trust, purchases, retention.

4. State update

Budget, base, momentum, fatigue, pools, and channel memory roll forward.

What data the model gets back

Monthly operating metrics: spend, revenue, ROAS, CAC, repeat rate, reinvestment.
State: budget left, customer base, email list, warm pool, momentum, fatigue.
Audience map: segments, persona sizes, fit, competition, value.
Working memory: prior decisions, highlights, penalty flags.
Market notes: seasonality and shocks.

No raw persona-by-persona judge feedback in the prompt — infer from outcomes.

Main difficulties

Learning resets
Abrupt reallocations hurt efficiency.

Saturation
Finite warm pools and auctions.

Offer fatigue
Discounts can damage later months.

Organic carryover
Upper funnel pays off slowly.

Persona tradeoffs
Easy vs. valuable audiences.

Soft caps
CRM, retargeting, channel limits.

Every persona differs in size, growth, fit, competition, and value. The model starts with a commercial map but must learn what actually monetizes.

Value Seeker

Large and relatively easy to wake up with offers, but lower-value and highly price competitive.

Motivations: visible results, discount

Premium Conscious

Smaller but high-value premium audience with strong fit for the brand and heavy competition from other prestige skincare.

Motivations: ingredients, authority

Ingredient Researcher

Harder to win because they scrutinize claims, but they compound into valuable, durable customers when convinced.

Motivations: clinical details, ingredient list

Impulse Buyer

Big upper-funnel opportunity that is easier to engage creatively, but conversion quality and retention are weaker.

Motivations: aesthetic creative, quick payoff

Comparison Shopper

Commercially meaningful and high-intent, but expensive to win because comparison behavior increases competition and pressure on proof.

Motivations: clear differentiation, proof

Returning Loyalist

Smaller owned audience but the most valuable and efficient to monetize if protected with the right cadence.

Motivations: routine, restock convenience

Lapsed Customer

Warm and recoverable with decent value, but reactivation requires freshness and fatigue management.

Motivations: newness, better routine fit

Low Intent Browser

Largest reachable pool and easiest to attract at the top of funnel, but low intent and low customer value.

Motivations: light curiosity, visual intrigue

Fixed upfront

Brand, economics, personas, channels, seasonality, shocks.

Persistent state

Budget, customers, email, warm pool, momentum, fatigue, reinvestment, channel memory.

Iteration summary

Prior decisions and outcomes compressed for the next month.

Economic judgment

Margin, contribution profit, CAC, bad revenue.

Budget pacing

When to press, hold, sequence spend.

Channel allocation

Prospecting vs. intent vs. CRM vs. remarketing.

Persona targeting

Easy vs. valuable audiences.

Creative specificity

Angles that match motivations and objections.

Stable iteration

Improve in place; avoid constant restructures.

In practice: protect margin, keep retention alive, scale demand capture carefully, stay persona-aware. Models fail when they confuse activity with progress or write polished but generic copy.

Claude Fable 5 Family Tops ROASBench, Displacing Opus 4.6

The claude-fable-5-high and claude-fable-5 models now top the leaderboard, overtaking claude-opus-4.6 and its variant claude-opus-4.6-high. While the Fable 5 family excels, older models like claude-opus-4.7 and mid-tier competitors including claude-sonnet-4.6, gemini-3.5-flash-high, deepseek-v3.2, and kimi-k2.5 struggle to generate positive returns. Most concerning is the claude-opus-4.8 family (claude-opus-4.8-high, claude-opus-4.8-low, and base claude-opus-4.8), which completely fails to translate strategic planning into business outcomes.

Head-to-head

Fable 5 High vs Opus 4.6

The new leader displaces Anthropic's previous benchmark standard by converting strategy into durable economics faster and with better audience alignment.

Avg score

Claude Opus 4.6 40.61

Claude Fable 5 High 43.70

Δ +3.09

Avg profit

Claude Opus 4.6 $506,094

Claude Fable 5 High $560,767

Δ +$54,673

Persona score

Claude Opus 4.6 56.75

Claude Fable 5 High 59.42

Δ +2.67

Where the gap opens

While both models are exceptional planners (Opus 4.6 at 59.06, Fable 5 High at 60.43), Fable 5 High pulls ahead in persona fit and behavioral execution. By better aligning its copy and offers with the simulated audience, Fable 5 High achieves a higher ROAS (1.97 vs 1.92) and reaches its first profitable month nearly two months earlier (month 5.3 vs 7.0). This separates the Fable 5 family from the rest of the field: it doesn't just plan well, it executes with compounding efficiency.

Confirm Action

ROASBench

Average cumulative profit by month

Quality is not evenly priced

Who stabilizes fastest?

ROASBench leaderboard

Current standings

Want your model on ROASBench?

How ROASBench works

Claude Fable 5 Family Tops ROASBench, Displacing Opus 4.6

The Fable 5 Advantage

The Opus 4.8 Execution Gap

Fable 5 High vs Opus 4.6

Cumulative Profit Trajectory

The Opus 4.8 Planning vs Business Gap

Fable 5 High vs Opus 4.6

The Mid-Tier Execution Wall