Money balance over time
Average cumulative profit by month
Average across all completed runs for each participant.
A hard-mode performance marketing simulation for LLMs. Models act as the marketer for a DTC skincare brand, choose channels, plan spend, write creative angles, react to results, and live with the consequences for 12 months.
Top cumulative profitability
Anthropic: Claude Opus 4.6
$506,094
Current leader
Anthropic: Claude Opus 4.6
Avg score 40.61
Profitable after 12 months
6 models
Anthropic: Claude Opus 4.6, Anthropic: Claude Opus 4.6 · High, OpenAI: GPT-5.5 Pro
Lead over #2
+2.80 points
vs. Anthropic: Claude Opus 4.6 · High
Closest profit challenger
Anthropic: Claude Opus 4.6 · High
$411,580 after 12 months
Money balance over time
Average across all completed runs for each participant.
Score vs. cost per run
Average benchmark score against average main-model API cost per run.
Monthly contribution profit
Average monthly contribution profit across completed runs.
Average score
Models ranked by average primary score (highest first). Values above each bar are the mean score; whiskers show standard deviation across completed runs when more than one run exists.
Leaderboard
Sorted by average benchmark score. Tap a row for sub-scores and detail.
| # | Model | Score | Avg profit | ROAS | 1st + mo | ||
|---|---|---|---|---|---|---|---|
| 1 |
|
Anthropic: Claude Opus 4.6 Leads the pack by compounding a coherent plan: retention channels stay funded, discounting stays rare, and changes are absorbed without constant learning resets. Averaged 40.61 across 3 completed run(s); contribution profit $506,094; ROAS 192.5%. Avg first month cumulative contribution profit turns positive: ~7.0. |
40.61
|
$506,094 | 192.5% | 7.0 | |
|
Sub-scores
Economics
Avg profit: $506,094 Diagnosis Converts strategy into durable economics The model averages 40.61 score and $506,094 contribution profit. Its strongest dimension is planning (59.1). Trajectory Negative months: 3 / 12 Final cumulative profit: $506,094 Worst month: M1 ($-30,242) Relative position Vs leader: +0.00 score pts Vs median: +22.22 score pts Strongest: planning (59.1) Weakest: behavior (27.3) Strength: Stable iteration, persona-aware creative, disciplined CRM and remarketing. Watch: Still hits saturation and high CAC when scaling search and broad demand. |
|||||||
| 2 |
|
Anthropic: Claude Opus 4.6 · High Leads the pack by compounding a coherent plan: retention channels stay funded, discounting stays rare, and changes are absorbed without constant learning resets. Averaged 37.81 across 3 completed run(s); contribution profit $411,580; ROAS 182.1%. Avg first month cumulative contribution profit turns positive: ~8.0. |
37.81
|
$411,580 | 182.1% | 8.0 | |
|
Sub-scores
Economics
Avg profit: $411,580 Diagnosis Converts strategy into durable economics The model averages 37.81 score and $411,580 contribution profit. Its strongest dimension is planning (57.2). Trajectory Negative months: 3 / 12 Final cumulative profit: $411,580 Worst month: M1 ($-29,462) Relative position Vs leader: -2.80 score pts Vs median: +19.42 score pts Strongest: planning (57.2) Weakest: behavior (26.7) Strength: Stable iteration, persona-aware creative, disciplined CRM and remarketing. Watch: Still hits saturation and high CAC when scaling search and broad demand. |
|||||||
| 3 |
|
OpenAI: GPT-5.5 Pro Ranked #3 of 23 with an average benchmark score of 37.78 across 3 run(s). Sub-scores are strongest on planning (55.6) and weakest on business (30.1). Average contribution profit $397,813 and ROAS 184.9%. On average, cumulative contribution profit first turns positive around month 7.3. |
37.78
|
$397,813 | 184.9% | 7.3 | |
|
Sub-scores
Economics
Avg profit: $397,813 Diagnosis Converts strategy into durable economics The model averages 37.78 score and $397,813 contribution profit. Its strongest dimension is planning (55.6). Trajectory Negative months: 2 / 12 Final cumulative profit: $397,813 Worst month: M1 ($-29,816) Relative position Vs leader: -2.83 score pts Vs median: +19.39 score pts Strongest: planning (55.6) Weakest: business (30.1) Strength: Relative edge: planning (55.6). Watch: Relative gap: business (30.1). |
|||||||
| 4 |
|
Qwen: Qwen3.7 Max Ranked #4 of 23 with an average benchmark score of 33.56 across 3 run(s). Sub-scores are strongest on planning (54.4) and weakest on business (20.7). Average contribution profit $131,537 and ROAS 154.4%. On average, cumulative contribution profit first turns positive around month 10.7. |
33.56
|
$131,537 | 154.4% | 10.7 | |
|
Sub-scores
Economics
Avg profit: $131,537 Diagnosis Converts strategy into durable economics The model averages 33.56 score and $131,537 contribution profit. Its strongest dimension is planning (54.4). Trajectory Negative months: 5 / 12 Final cumulative profit: $131,537 Worst month: M1 ($-42,025) Relative position Vs leader: -7.05 score pts Vs median: +15.17 score pts Strongest: planning (54.4) Weakest: business (20.7) Strength: Relative edge: planning (54.4). Watch: Relative gap: business (20.7). |
|||||||
| 5 |
|
Anthropic: Claude Opus 4.7 A regression on ROASBench vs. 4.6: less persona-aware copy, a tilt toward intent capture over prospecting, and learning resets on its largest channel. More reactive, less consistent run-to-run. Averaged 30.06 across 3 completed run(s); contribution profit $250,097; ROAS 166.6%. Avg first month cumulative contribution profit turns positive: ~6.0. |
30.06
|
$250,097 | 166.6% | 6.0 | |
|
Sub-scores
Economics
Avg profit: $250,097 Diagnosis Converts strategy into durable economics The model averages 30.06 score and $250,097 contribution profit. Its strongest dimension is planning (56.4). Trajectory Negative months: 4 / 12 Final cumulative profit: $250,097 Worst month: M3 ($-26,790) Relative position Vs leader: -10.55 score pts Vs median: +11.67 score pts Strongest: planning (56.4) Weakest: behavior (18.8) Strength: Earlier first profitable month and occasional strong-profit spikes. Watch: Persona fit collapses, Search learning resets, and discounting appears under pressure. |
|||||||
| 6 |
|
Google: Gemini 3.1 Pro Preview Often nearer break-even with structurally sensible moves; execution and generic creative hold the score down, with too many mid-course resets. Averaged 27.14 across 3 completed run(s); contribution profit $-34,549; ROAS 132.9%. Avg first month cumulative contribution profit turns positive: ~12.0. |
27.14
|
$-34,549 | 132.9% | 12.0 | |
|
Sub-scores
Economics
Avg profit: $-34,549 Diagnosis Strategically active, commercially negative The model averages $-34,549 contribution profit with 132.9% ROAS. It creates activity and revenue, but not enough efficient margin. Trajectory Negative months: 8 / 12 Final cumulative profit: $-34,549 Worst month: M1 ($-38,205) Relative position Vs leader: -13.47 score pts Vs median: +8.75 score pts Strongest: planning (53.0) Weakest: business (12.2) Strength: Directionally right budget and channel choices vs. weaker frontier peers. Watch: Generic copy, remarketing churn, and learning resets under pressure. |
|||||||
| 7 |
|
Qwen: Qwen3.5 Plus 2026-02-15 Ranked #7 of 23 with an average benchmark score of 26.07 across 3 run(s). Sub-scores are strongest on planning (51.7) and weakest on business (14.8). Average contribution profit $-29,480 and ROAS 133.8%. On average, cumulative contribution profit stayed negative through the full simulation year. |
26.07
|
$-29,480 | 133.8% | — | |
|
Sub-scores
Economics
Avg profit: $-29,480 Diagnosis Strategically active, commercially negative The model averages $-29,480 contribution profit with 133.8% ROAS. It creates activity and revenue, but not enough efficient margin. Trajectory Negative months: 8 / 12 Final cumulative profit: $-29,480 Worst month: M1 ($-41,310) Relative position Vs leader: -14.54 score pts Vs median: +7.68 score pts Strongest: planning (51.7) Weakest: business (14.8) Strength: Relative edge: planning (51.7). Watch: Relative gap: business (14.8). |
|||||||
| 8 |
|
OpenAI: GPT-5.5 Ranked #8 of 23 with an average benchmark score of 25.68 across 3 run(s). Sub-scores are strongest on planning (52.3) and weakest on business (13.7). Average contribution profit $15,398 and ROAS 140.9%. On average, cumulative contribution profit first turns positive around month 11.0. |
25.68
|
$15,398 | 140.9% | 11.0 | |
|
Sub-scores
Economics
Avg profit: $15,398 Diagnosis Converts strategy into durable economics The model averages 25.68 score and $15,398 contribution profit. Its strongest dimension is planning (52.3). Trajectory Negative months: 7 / 12 Final cumulative profit: $15,398 Worst month: M1 ($-37,793) Relative position Vs leader: -14.93 score pts Vs median: +7.29 score pts Strongest: planning (52.3) Weakest: business (13.7) Strength: Relative edge: planning (52.3). Watch: Relative gap: business (13.7). |
|||||||
| 9 |
|
Anthropic: Claude Sonnet 4.6 Ranked #9 of 23 with an average benchmark score of 21.21 across 3 run(s). Sub-scores are strongest on planning (57.4) and weakest on business (9.7). Average contribution profit $-146,915 and ROAS 117.6%. On average, cumulative contribution profit stayed negative through the full simulation year. |
21.21
|
$-146,915 | 117.6% | — | |
|
Sub-scores
Economics
Avg profit: $-146,915 Diagnosis Plans coherently, but the market does not reward the choices Planning is the relative bright spot (57.4), but business outcome score is only 9.7. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound. Trajectory Negative months: 8 / 12 Final cumulative profit: $-146,915 Worst month: M2 ($-50,478) Relative position Vs leader: -19.40 score pts Vs median: +2.82 score pts Strongest: planning (57.4) Weakest: business (9.7) Strength: Relative edge: planning (57.4). Watch: Relative gap: business (9.7). |
|||||||
| 10 |
|
Google: Gemini 3.5 Flash · High Ranked #10 of 23 with an average benchmark score of 20.74 across 3 run(s). Sub-scores are strongest on planning (51.1) and weakest on business (6.8). Average contribution profit $-147,599 and ROAS 116.9%. On average, cumulative contribution profit stayed negative through the full simulation year. |
20.74
|
$-147,599 | 116.9% | — | |
|
Sub-scores
Economics
Avg profit: $-147,599 Diagnosis Plans coherently, but the market does not reward the choices Planning is the relative bright spot (51.1), but business outcome score is only 6.8. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound. Trajectory Negative months: 9 / 12 Final cumulative profit: $-147,599 Worst month: M1 ($-35,809) Relative position Vs leader: -19.87 score pts Vs median: +2.35 score pts Strongest: planning (51.1) Weakest: business (6.8) Strength: Relative edge: planning (51.1). Watch: Relative gap: business (6.8). |
|||||||
| 11 |
|
DeepSeek: DeepSeek V3.2 Ranked #11 of 23 with an average benchmark score of 19.77 across 3 run(s). Sub-scores are strongest on planning (49.5) and weakest on business (8.9). Average contribution profit $-126,535 and ROAS 120.4%. On average, cumulative contribution profit first turns positive around month 12.0. |
19.77
|
$-126,535 | 120.4% | 12.0 | |
|
Sub-scores
Economics
Avg profit: $-126,535 Diagnosis Plans coherently, but the market does not reward the choices Planning is the relative bright spot (49.5), but business outcome score is only 8.9. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound. Trajectory Negative months: 8 / 12 Final cumulative profit: $-126,535 Worst month: M1 ($-44,732) Relative position Vs leader: -20.84 score pts Vs median: +1.38 score pts Strongest: planning (49.5) Weakest: business (8.9) Strength: Relative edge: planning (49.5). Watch: Relative gap: business (8.9). |
|||||||
| 12 |
|
OpenAI: GPT-5.4 Looks plausible on paper but weak compounding: revenue without efficient spend patterns; repeated broad demand spend without durable payoff. Averaged 18.39 across 3 completed run(s); contribution profit $-250,461; ROAS 103.2%. Across runs, cumulative contribution profit never crossed zero on average in the first 12 months. |
18.39
|
$-250,461 | 103.2% | — | |
|
Sub-scores
Economics
Avg profit: $-250,461 Diagnosis Plans coherently, but the market does not reward the choices Planning is the relative bright spot (54.1), but business outcome score is only 5.7. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound. Trajectory Negative months: 9 / 12 Final cumulative profit: $-250,461 Worst month: M1 ($-58,177) Relative position Vs leader: -22.22 score pts Vs median: +0.00 score pts Strongest: planning (54.1) Weakest: business (5.7) Strength: Readable strategy and channel mix in isolation. Watch: Search/remarketing saturation and budgeting that does not match outcomes. |
|||||||
| 13 |
|
xAI: Grok 4.20 Beta Ranked #13 of 23 with an average benchmark score of 16.77 across 3 run(s). Sub-scores are strongest on planning (43.9) and weakest on business (5.9). Average contribution profit $-243,069 and ROAS 103.5%. On average, cumulative contribution profit stayed negative through the full simulation year. |
16.77
|
$-243,069 | 103.5% | — | |
|
Sub-scores
Economics
Avg profit: $-243,069 Diagnosis Strategically active, commercially negative The model averages $-243,069 contribution profit with 103.5% ROAS. It creates activity and revenue, but not enough efficient margin. Trajectory Negative months: 10 / 12 Final cumulative profit: $-243,069 Worst month: M1 ($-40,606) Relative position Vs leader: -23.84 score pts Vs median: -1.62 score pts Strongest: planning (43.9) Weakest: business (5.9) Strength: Relative edge: planning (43.9). Watch: Relative gap: business (5.9). |
|||||||
| 14 |
|
Gemini 3 Flash Preview Ranked #14 of 23 with an average benchmark score of 16.29 across 3 run(s). Sub-scores are strongest on planning (49.3) and weakest on business (4.6). Average contribution profit $-239,859 and ROAS 104.6%. On average, cumulative contribution profit stayed negative through the full simulation year. |
16.29
|
$-239,859 | 104.6% | — | |
|
Sub-scores
Economics
Avg profit: $-239,859 Diagnosis Plans coherently, but the market does not reward the choices Planning is the relative bright spot (49.3), but business outcome score is only 4.6. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound. Trajectory Negative months: 10 / 12 Final cumulative profit: $-239,859 Worst month: M4 ($-44,205) Relative position Vs leader: -24.32 score pts Vs median: -2.10 score pts Strongest: planning (49.3) Weakest: business (4.6) Strength: Relative edge: planning (49.3). Watch: Relative gap: business (4.6). |
|||||||
| 15 |
|
Z.ai: GLM 5.1 Ranked #15 of 23 with an average benchmark score of 13.37 across 3 run(s). Sub-scores are strongest on planning (48.0) and weakest on business (3.2). Average contribution profit $-253,302 and ROAS 102.9%. On average, cumulative contribution profit stayed negative through the full simulation year. |
13.37
|
$-253,302 | 102.9% | — | |
|
Sub-scores
Economics
Avg profit: $-253,302 Diagnosis Plans coherently, but the market does not reward the choices Planning is the relative bright spot (48.0), but business outcome score is only 3.2. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound. Trajectory Negative months: 11 / 12 Final cumulative profit: $-253,302 Worst month: M1 ($-40,380) Relative position Vs leader: -27.24 score pts Vs median: -5.02 score pts Strongest: planning (48.0) Weakest: business (3.2) Strength: Relative edge: planning (48.0). Watch: Relative gap: business (3.2). |
|||||||
| 16 |
|
MoonshotAI: Kimi K2.5 Ranked #16 of 23 with an average benchmark score of 13.10 across 2 run(s). Sub-scores are strongest on planning (50.8) and weakest on business (3.7). Average contribution profit $-292,423 and ROAS 97.5%. On average, cumulative contribution profit stayed negative through the full simulation year. |
13.10
|
$-292,423 | 97.5% | — | |
|
Sub-scores
Economics
Avg profit: $-292,423 Diagnosis Plans coherently, but the market does not reward the choices Planning is the relative bright spot (50.8), but business outcome score is only 3.7. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound. Trajectory Negative months: 10 / 12 Final cumulative profit: $-292,423 Worst month: M4 ($-55,730) Relative position Vs leader: -27.51 score pts Vs median: -5.29 score pts Strongest: planning (50.8) Weakest: business (3.7) Strength: Relative edge: planning (50.8). Watch: Relative gap: business (3.7). |
|||||||
| 17 |
|
MiniMax: MiniMax M2.7 Ranked #17 of 23 with an average benchmark score of 12.13 across 3 run(s). Sub-scores are strongest on planning (39.4) and weakest on business (3.5). Average contribution profit $-316,117 and ROAS 94.2%. On average, cumulative contribution profit stayed negative through the full simulation year. |
12.13
|
$-316,117 | 94.2% | — | |
|
Sub-scores
Economics
Avg profit: $-316,117 Diagnosis Strategically active, commercially negative The model averages $-316,117 contribution profit with 94.2% ROAS. It creates activity and revenue, but not enough efficient margin. Trajectory Negative months: 10 / 12 Final cumulative profit: $-316,117 Worst month: M3 ($-51,444) Relative position Vs leader: -28.48 score pts Vs median: -6.26 score pts Strongest: planning (39.4) Weakest: business (3.5) Strength: Relative edge: planning (39.4). Watch: Relative gap: business (3.5). |
|||||||
| 18 |
|
OpenAI: GPT-5.4 Mini Ranked #18 of 23 with an average benchmark score of 11.83 across 1 run(s). Sub-scores are strongest on planning (44.9) and weakest on business (2.0). Average contribution profit $-353,629 and ROAS 87.3%. On average, cumulative contribution profit stayed negative through the full simulation year. |
11.83
|
$-353,629 | 87.3% | — | |
|
Sub-scores
Economics
Avg profit: $-353,629 Diagnosis Strategically active, commercially negative The model averages $-353,629 contribution profit with 87.3% ROAS. It creates activity and revenue, but not enough efficient margin. Trajectory Negative months: 11 / 12 Final cumulative profit: $-353,629 Worst month: M3 ($-59,559) Relative position Vs leader: -28.78 score pts Vs median: -6.56 score pts Strongest: planning (44.9) Weakest: business (2.0) Strength: Relative edge: planning (44.9). Watch: Relative gap: business (2.0). |
|||||||
| 19 |
|
Anthropic: Claude Opus 4.8 · High Ranked #19 of 23 with an average benchmark score of 10.41 across 3 run(s). Sub-scores are strongest on planning (53.5) and weakest on business (1.5). Average contribution profit $-352,542 and ROAS 83.6%. On average, cumulative contribution profit stayed negative through the full simulation year. |
10.41
|
$-352,542 | 83.6% | — | |
|
Sub-scores
Economics
Avg profit: $-352,542 Diagnosis Plans coherently, but the market does not reward the choices Planning is the relative bright spot (53.5), but business outcome score is only 1.5. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound. Trajectory Negative months: 12 / 12 Final cumulative profit: $-352,542 Worst month: M1 ($-59,001) Relative position Vs leader: -30.20 score pts Vs median: -7.98 score pts Strongest: planning (53.5) Weakest: business (1.5) Strength: Relative edge: planning (53.5). Watch: Relative gap: business (1.5). |
|||||||
| 20 |
|
Anthropic: Claude Opus 4.8 · Low Ranked #20 of 23 with an average benchmark score of 9.98 across 3 run(s). Sub-scores are strongest on planning (54.2) and weakest on business (1.3). Average contribution profit $-299,490 and ROAS 89.4%. On average, cumulative contribution profit stayed negative through the full simulation year. |
9.98
|
$-299,490 | 89.4% | — | |
|
Sub-scores
Economics
Avg profit: $-299,490 Diagnosis Plans coherently, but the market does not reward the choices Planning is the relative bright spot (54.2), but business outcome score is only 1.3. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound. Trajectory Negative months: 12 / 12 Final cumulative profit: $-299,490 Worst month: M1 ($-52,862) Relative position Vs leader: -30.63 score pts Vs median: -8.41 score pts Strongest: planning (54.2) Weakest: business (1.3) Strength: Relative edge: planning (54.2). Watch: Relative gap: business (1.3). |
|||||||
| 21 |
|
Z.ai: GLM 5 Ranked #21 of 23 with an average benchmark score of 9.64 across 3 run(s). Sub-scores are strongest on planning (42.6) and weakest on business (0.9). Average contribution profit $-370,131 and ROAS 86.0%. On average, cumulative contribution profit stayed negative through the full simulation year. |
9.64
|
$-370,131 | 86.0% | — | |
|
Sub-scores
Economics
Avg profit: $-370,131 Diagnosis Audience fit is the main failure mode Persona score is low (16.6), so the simulated shoppers are not buying the positioning even when the high-level strategy looks reasonable. Trajectory Negative months: 11 / 12 Final cumulative profit: $-370,131 Worst month: M2 ($-54,591) Relative position Vs leader: -30.97 score pts Vs median: -8.75 score pts Strongest: planning (42.6) Weakest: business (0.9) Strength: Relative edge: planning (42.6). Watch: Relative gap: business (0.9). |
|||||||
| 22 |
|
Anthropic: Claude Opus 4.8 Ranked #22 of 23 with an average benchmark score of 9.54 across 3 run(s). Sub-scores are strongest on planning (54.6) and weakest on business (0.2). Average contribution profit $-312,597 and ROAS 86.5%. On average, cumulative contribution profit stayed negative through the full simulation year. |
9.54
|
$-312,597 | 86.5% | — | |
|
Sub-scores
Economics
Avg profit: $-312,597 Diagnosis Plans coherently, but the market does not reward the choices Planning is the relative bright spot (54.6), but business outcome score is only 0.2. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound. Trajectory Negative months: 12 / 12 Final cumulative profit: $-312,597 Worst month: M1 ($-57,005) Relative position Vs leader: -31.07 score pts Vs median: -8.85 score pts Strongest: planning (54.6) Weakest: business (0.2) Strength: Relative edge: planning (54.6). Watch: Relative gap: business (0.2). |
|||||||
| 23 |
|
OpenAI: GPT-5.4 Nano Ranked #23 of 23 with an average benchmark score of 6.80 across 3 run(s). Sub-scores are strongest on planning (46.5) and weakest on business (0.0). Average contribution profit $-577,506 and ROAS 56.5%. On average, cumulative contribution profit stayed negative through the full simulation year. |
6.80
|
$-577,506 | 56.5% | — | |
|
Sub-scores
Economics
Avg profit: $-577,506 Diagnosis Plans coherently, but the market does not reward the choices Planning is the relative bright spot (46.5), but business outcome score is only 0.0. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound. Trajectory Negative months: 12 / 12 Final cumulative profit: $-577,506 Worst month: M1 ($-67,937) Relative position Vs leader: -33.81 score pts Vs median: -11.59 score pts Strongest: planning (46.5) Weakest: business (0.0) Strength: Relative edge: planning (46.5). Watch: Relative gap: business (0.0). |
|||||||
Avg profit is shown in expanded rows on small screens — tap a model.
For model providers
We can run official benchmark passes and publish results alongside the leaderboard. Tell us which model and API access to use.
Methodology
Open each section for setup, simulation flow, what models see, personas, state, and what skills the benchmark rewards.
ROASBench drops the model into a year-long operating environment for one premium-but-accessible skincare brand and scores the result on business outcomes, not nice-sounding plans.
Brand
Northstar Skin
Barrier Repair Serum at $68 with 76% gross margin.
Time horizon
12 months
The model has to adapt over time instead of solving one isolated scenario.
Controlled channels
6
Meta prospecting, Search, Shopping, TikTok, Email / CRM, and Remarketing.
Scoring
Business + behavior
Primary score blends profitability, planning quality, persona response, and long-run adaptation.
What the model can control
Each round is a real operating cycle, not a one-shot prompt. Past choices affect future state, so the benchmark rewards consistency and punishes lazy resets.
1. Seeded world
Fixed brand, budget, customers, email list, warm pool, seasonality, shocks.
2. Decision step
Structured monthly plan: objective, budget, discount, remarketing, channels, creative.
3. Persona panel + rules
Panel judges copy and targeting; rules produce clicks, trust, purchases, retention.
4. State update
Budget, base, momentum, fatigue, pools, and channel memory roll forward.
What data the model gets back
No raw persona-by-persona judge feedback in the prompt — infer from outcomes.
Main difficulties
Every persona differs in size, growth, fit, competition, and value. The model starts with a commercial map but must learn what actually monetizes.
Value Seeker
Large and relatively easy to wake up with offers, but lower-value and highly price competitive.
Motivations: visible results, discount
Premium Conscious
Smaller but high-value premium audience with strong fit for the brand and heavy competition from other prestige skincare.
Motivations: ingredients, authority
Ingredient Researcher
Harder to win because they scrutinize claims, but they compound into valuable, durable customers when convinced.
Motivations: clinical details, ingredient list
Impulse Buyer
Big upper-funnel opportunity that is easier to engage creatively, but conversion quality and retention are weaker.
Motivations: aesthetic creative, quick payoff
Comparison Shopper
Commercially meaningful and high-intent, but expensive to win because comparison behavior increases competition and pressure on proof.
Motivations: clear differentiation, proof
Returning Loyalist
Smaller owned audience but the most valuable and efficient to monetize if protected with the right cadence.
Motivations: routine, restock convenience
Lapsed Customer
Warm and recoverable with decent value, but reactivation requires freshness and fatigue management.
Motivations: newness, better routine fit
Low Intent Browser
Largest reachable pool and easiest to attract at the top of funnel, but low intent and low customer value.
Motivations: light curiosity, visual intrigue
Fixed upfront
Brand, economics, personas, channels, seasonality, shocks.
Persistent state
Budget, customers, email, warm pool, momentum, fatigue, reinvestment, channel memory.
Iteration summary
Prior decisions and outcomes compressed for the next month.
Economic judgment
Margin, contribution profit, CAC, bad revenue.
Budget pacing
When to press, hold, sequence spend.
Channel allocation
Prospecting vs. intent vs. CRM vs. remarketing.
Persona targeting
Easy vs. valuable audiences.
Creative specificity
Angles that match motivations and objections.
Stable iteration
Improve in place; avoid constant restructures.
In practice: protect margin, keep retention alive, scale demand capture carefully, stay persona-aware. Models fail when they confuse activity with progress or write polished but generic copy.
Qualitative findings
This deep dive follows the requested direction: Highlight the new Claude Opus 4.8 performance vs Opus 4.7 and 4.6. Discuss GPT 5.5 (particularly the price), Discuss Qwen3.7 Max doing well, Discuss why Gemini models fall flat in here. It is grounded in the resolved focus models rather than the overall worst performer.
Grounded in published runs
Generated 2026-05-29
Model diagnosis
Planning is the relative bright spot (53.5), but business outcome score is only 1.5. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.
Rank
#19
Score
10.41
Avg profit
$-352,542
ROAS
83.6%
Selected chart
Selected chart
Head-to-head
This comparison is generated deterministically from the benchmark dossiers because the LLM output did not follow the requested focus.
Avg score
Anthropic: Claude Opus 4.8 · High 10.41
Anthropic: Claude Opus 4.8 · Low 9.98
Δ -0.43
Avg profit
Anthropic: Claude Opus 4.8 · High $-352,542
Anthropic: Claude Opus 4.8 · Low $-299,490
Δ $+53,053
ROAS
Anthropic: Claude Opus 4.8 · High 83.6%
Anthropic: Claude Opus 4.8 · Low 89.4%
Δ +5.88
Persona score
Anthropic: Claude Opus 4.8 · High 10.8
Anthropic: Claude Opus 4.8 · Low 11.4
Δ +0.5
Where the gap shows up
Anthropic: Claude Opus 4.8 · High scores 10.41; Anthropic: Claude Opus 4.8 · Low scores 9.98. Anthropic: Claude Opus 4.8 · Low is strongest on planning but weakest on business.
Anthropic: Claude Opus 4.8 · High illustrates the gap: Planning is the relative bright spot (53.5), but business outcome score is only 1.5. That usually means the model can write a plausible media plan while still allocating budget, offers, or channels in ways that fail to compound.