Score vs. cost
Average task cost vs overall score
Each dot is one model. X axis is average cost per benchmark task, including model and judge cost; Y axis is average calibrated score.
business benchmark collection
Benchmarks for testing whether models can brief, prioritise, rewrite, and communicate in ways that reduce executive workload.
Which models reduce cognitive load without creating extra work or risky communication?
At a glance
Top model
claude-opus-4.8-low
82.0
Lowest cost / eval
glm-5.1
$0.0115
Median rank score
78.5
Last refresh
2026-06-02
Score vs. cost
Each dot is one model. X axis is average cost per benchmark task, including model and judge cost; Y axis is average calibrated score.
Overall ranking
Higher is better. Scores come from completed judged runs.
Benchmark heatmap
Cells are colored by rank within each benchmark: the top ten are split across greens, anything below the top ten is red.
| Rank | Model | Overall | Tactful Rewrite Test | Message Risk Review | Useful in Five Minutes | Priority Triage Test |
|---|---|---|---|---|---|---|
| 1 |
12 scored tests |
82.0 | 76.3 | 83.3 | 83.7 | 84.7 |
| 2 |
12 scored tests |
81.9 | 74.0 | 84.3 | 84.3 | 85.0 |
| 3 |
12 scored tests |
81.9 | 78.0 | 82.7 | 83.7 | 83.3 |
| 4 |
12 scored tests |
81.8 | 74.0 | 83.7 | 84.7 | 84.7 |
| 5 |
12 scored tests |
81.8 | 79.0 | 83.0 | 81.3 | 84.0 |
| 6 |
12 scored tests |
81.6 | 80.0 | 82.0 | 82.3 | 82.0 |
| 7 |
12 scored tests |
81.2 | 74.0 | 82.0 | 84.3 | 84.7 |
| 8 |
12 scored tests |
80.8 | 74.0 | 82.0 | 83.7 | 83.7 |
| 9 |
12 scored tests |
80.6 | 71.3 | 83.0 | 83.7 | 84.3 |
| 10 |
12 scored tests |
80.0 | 82.7 | 79.0 | 80.7 | 77.7 |
| 11 |
12 scored tests |
79.2 | 77.0 | 80.0 | 77.0 | 82.7 |
| 12 |
12 scored tests |
78.5 | 77.3 | 80.7 | 72.3 | 83.7 |
| 13 |
12 scored tests |
78.2 | 77.3 | 78.7 | 77.3 | 79.3 |
| 14 |
12 scored tests |
77.9 | 82.7 | 81.3 | 65.0 | 82.7 |
| 15 |
12 scored tests |
76.7 | 77.0 | 78.3 | 77.7 | 73.7 |
| 16 |
12 scored tests |
76.4 | 78.3 | 81.7 | 76.3 | 69.3 |
| 17 |
12 scored tests |
76.1 | 72.3 | 78.7 | 71.7 | 81.7 |
| 18 |
12 scored tests |
75.2 | 80.0 | 80.7 | 81.3 | 58.7 |
| 19 |
12 scored tests |
73.2 | 74.0 | 75.0 | 65.7 | 78.3 |
| 20 |
12 scored tests |
72.1 | 79.0 | 75.0 | 64.0 | 70.3 |
| 21 |
12 scored tests |
71.2 | 77.7 | 82.7 | 67.0 | 57.3 |
| 22 |
12 scored tests |
68.5 | 81.3 | 66.7 | 72.0 | 54.0 |
| 23 |
12 scored tests |
66.7 | 61.3 | 78.3 | 73.0 | 54.0 |
Full leaderboard
| Model | Score | Tests | Avg cost / task | Avg seconds / task | Frequent problems |
|---|---|---|---|---|---|
|
|
82.0 Strong | 12/12 | $0.0284 | 20.7s | Wrapper text Over word count |
|
|
81.92 Strong | 12/12 | $0.0292 | 23.1s | Wrapper text |
|
|
81.92 Strong | 12/12 | $0.0301 | 24.9s | Wrapper text |
|
|
81.83 Strong | 12/12 | $0.0295 | 30.7s | Over word count Wrapper text |
|
|
81.75 Strong | 12/12 | $0.0285 | 20.7s | Wrapper text |
|
|
81.58 Strong | 12/12 | $0.0126 | 50.9s | Unsupported invention |
|
|
81.25 Strong | 12/12 | $0.0284 | 19.4s | Wrapper text |
|
|
80.83 Strong | 12/12 | $0.0218 | 23.5s | Wrapper text |
|
|
80.58 Strong | 12/12 | $0.0295 | 30.0s | Wrapper text Over word count Unsupported invention |
|
|
80.0 Strong | 12/12 | $0.0272 | 19.8s | - |
|
|
79.17 Usable | 12/12 | $0.0132 | 46.9s | Unsupported invention |
|
|
78.5 Usable | 12/12 | $0.0156 | 17.1s | Unsupported invention Over word count Wrapper text |
|
|
78.17 Usable | 12/12 | $0.0203 | 19.5s | Wrapper text |
|
|
77.92 Usable | 12/12 | $0.2074 | 48.9s | Incomplete output Missing required element |
|
|
76.67 Usable | 12/12 | $0.0157 | 14.3s | Wrapper text |
|
|
76.42 Usable | 12/12 | $0.0261 | 17.9s | Incomplete output Unsupported invention Wrapper text |
|
|
76.08 Usable | 12/12 | $0.0121 | 45.6s | Unsupported invention Wrapper text Incomplete output Malformed output |
|
|
75.17 Usable | 12/12 | $0.0115 | 45.3s | Incomplete output Malformed output |
|
|
73.25 Usable | 12/12 | $0.0131 | 13.5s | Wrapper text Unsupported invention |
|
|
72.08 Usable | 12/12 | $0.0131 | 21.1s | Unsupported invention Unsafe or misleading |
|
|
71.17 Usable | 12/12 | $0.0119 | 53.9s | Incomplete output Unsupported invention Malformed output Unsafe or misleading |
|
|
68.5 Needs editing | 12/12 | $0.0134 | 46.5s | Incomplete output Missing required element Unsupported invention |
|
|
66.67 Needs editing | 12/12 | $0.0155 | 15.6s | Wrapper text Missing required element |
Test cases
Each row is one prompt, with score distributions, top and low performers, and the most frequent problems judges flagged.
| Test | Benchmark | Avg | Max | Min | Top model | Lowest model | Frequent problems |
|---|---|---|---|---|---|---|---|
|
Angry contractor follow-up exec_rewrite_001 |
Tactful Rewrite Test | 77.4 | 85.0 | 74.0 | gpt-5.5 · 85 | gemini-3.5-flash-high · 74 | Wrapper text ×13 Over word count ×2 |
|
Investor disagreement exec_rewrite_002 |
Tactful Rewrite Test | 74.3 | 82.0 | 52.0 | glm-5.1 · 82 | gpt-5.4-nano · 52 | Wrapper text ×6 Unsupported invention ×5 Incomplete output ×1 |
|
Client scope creep response exec_rewrite_003 |
Tactful Rewrite Test | 77.7 | 83.0 | 56.0 | kimi-k2.5 · 83 | gpt-5.4-nano · 56 | Wrapper text ×7 Over word count ×2 Incomplete output ×1 |
|
Defensive client email exec_risk_001 |
Message Risk Review | 82.0 | 84.0 | 77.0 | claude-opus-4.7 · 84 | grok-4.20-beta · 77 | - |
|
Too-blunt team feedback exec_risk_002 |
Message Risk Review | 77.7 | 86.0 | 56.0 | claude-opus-4.7 · 86 | minimax-m2.7 · 56 | Unsupported invention ×2 Wrapper text ×2 |
|
Overpromising sales reply exec_risk_003 |
Message Risk Review | 80.7 | 84.0 | 64.0 | kimi-k2.5 · 84 | minimax-m2.7 · 64 | Wrapper text ×2 Unsupported invention ×1 |
|
Client escalation prep exec_5min_001 |
Useful in Five Minutes | 75.7 | 86.0 | 38.0 | claude-opus-4.6 · 86 | kimi-k2.5 · 38 | Unsupported invention ×5 Unsafe or misleading ×1 Wrapper text ×1 |
|
Investor call prep exec_5min_002 |
Useful in Five Minutes | 73.0 | 86.0 | 34.0 | claude-sonnet-4.6 · 86 | gpt-5.5-pro · 34 | Unsupported invention ×4 Unsafe or misleading ×1 Incomplete output ×1 |
|
Internal conflict meeting exec_5min_003 |
Useful in Five Minutes | 82.5 | 86.0 | 74.0 | kimi-k2.5 · 86 | gpt-5.4-mini · 74 | Wrapper text ×1 |
|
Noisy founder inbox exec_triage_001 |
Priority Triage Test | 75.4 | 85.0 | 18.0 | claude-opus-4.7 · 85 | kimi-k2.5 · 18 | Incomplete output ×2 Wrapper text ×1 |
|
Monday morning Slack digest exec_triage_002 |
Priority Triage Test | 75.9 | 84.0 | 10.0 | claude-opus-4.7 · 84 | glm-5.1 · 10 | Incomplete output ×2 Malformed output ×2 |
|
Travel day with urgent client issues exec_triage_003 |
Priority Triage Test | 78.2 | 86.0 | 30.0 | claude-opus-4.7 · 86 | minimax-m2.7 · 30 | Incomplete output ×1 Missing required element ×1 |
Model profiles
12 scored tests · Strong
82.0
Highest traits
Lowest traits
12 scored tests · Strong
81.92
Highest traits
Lowest traits
12 scored tests · Strong
81.92
Highest traits
Lowest traits
12 scored tests · Strong
81.83
Highest traits
Lowest traits
12 scored tests · Strong
81.75
Highest traits
Lowest traits
12 scored tests · Strong
81.58
Highest traits
Lowest traits
12 scored tests · Strong
81.25
Highest traits
Lowest traits
12 scored tests · Strong
80.83
Highest traits
Lowest traits
12 scored tests · Strong
80.58
Highest traits
Lowest traits
12 scored tests · Strong
80.0
Highest traits
Lowest traits
12 scored tests · Usable
79.17
Highest traits
Lowest traits
12 scored tests · Usable
78.5
Highest traits
Lowest traits
Compare models
Pick two models to compare average trait scores and individual test outcomes from the published calibrated results.
Sorted by the biggest absolute score difference between the selected models.
| Test | Benchmark | Delta | ||
|---|---|---|---|---|
Frequent problems
Example outputs
winner
The model demonstrates exceptional, expert-level judgment by factoring in signal loss during the flight, providing highly practical next actions, and adding a perfectly timed execution plan for the 45 minutes before boarding.
# Travel Day Triage | Priority | Item | Reason | Next Action | Delegate | |----------|------|--------|-------------|----------| | **P0** | Team: deploy risky fix? | Irreversible risk; needs founder call before you lose signal on the flight | Reply NO-GO until you land, or set clear go/no-go criteria in one message | No | | **P0** | Client login issue | Major client + production issue = revenue and trust risk | Ack client personally, hand to support/eng lead with owner + ETA | Yes (execution) | | **P1** | Flight in 45 min | Hard deadline; missing it cascades into everything else | Head to gate now; handle items below from phone/lounge | No | | **P1** | Investor deck tonight | Time-boxed commitment to a key stakeholder | Send existing draft + note on updates, or have COS/associate polish v2 | Yes (polish) | | **P2** | Hotel check-in time | Low effort, avoids arrival friction | One-line reply with ETA | Yes (assistant) | | **P3** | Newsletter draft | Due tomorrow; can be done in-flight | Draft on plane; review on landing | Yes (draft) | | **P4** | Bank promo email | Noise | Archive / filter rule | Yes (auto) | **Suggested order in the next 45 min:** 1. (2 min) Message team: hold the deploy. 2. (3 min) Reply to client + loop in support owner. 3. (1 min) Hotel ETA reply. 4. (2 min) Forward deck draft to whoever is polishing, with notes. 5. Walk to gate. Newsletter on the plane.
failure
The model response is severely truncated, providing only the header and a partial first row of the requested table, failing the task completely.
| Item | Priority | Reason | Next Action | Delegate? | | :--- | :--- | :--- | :--- | :--- | | **Production alert (7am)** | P0 (Critical
mid performer
The model perfectly followed all instructions, including the negative constraint on shared accountability and the word count limit, providing a highly usable rewrite and accurate risk assessment without any fluff.
**Main Risks**
* **Relationship Damage:** Accusatory language ("your team didn't") threatens a commercially vital partnership.
* **Perceived Unprofessionalism:** Dismissive phrasing ("not sure why this is coming up again") appears defensive and rude.
* **Eroded Trust:** Stating you "can't promise anything" creates ambiguity and reduces client confidence in your delivery.
**Safer Rewrite**
I understand your frustration regarding the timeline. While the initial proposal outlined the schedule, we recognize that recent delays stemmed from both the late arrival of assets and our internal scheduling adjustments. We value this partnership and want to resolve this quickly. Our team is prioritizing your project now that we have the necessary materials. We are targeting completion by the end of this week. I will send a confirmation update by Thursday afternoon to lock in the delivery date. Thank you for your patience as we work to get this across the line.
**Communication Strategy**
The strategy prioritizes de-escalation by validating the client's frustration, neutrally sharing factual causes for the delay, and replacing ambiguity with a specific commitment.
Methodology
Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.
LLM judge
A stable judge model scores each trait from 0 to 10 using benchmark-specific traits.
Heuristics
Deterministic checks catch length, banned phrases, required sections, format validity, and safety flags.
Calibrated ceiling
Rubrics are intentionally strict, leaving room above today's best scores for future model improvements.