Score vs. cost
Average task cost vs overall score
Each dot is one model. X axis is average cost per benchmark task, including model and judge cost; Y axis is average calibrated score.
business benchmark collection
Benchmarks for testing whether models can create clear, specific, non-generic business content that follows a brief and preserves a brand voice.
Which models can produce useful business content without generic AI sludge?
At a glance
Top model
gpt-5.5-pro
84.0
Lowest cost / eval
gpt-5.4-nano
$0.0163
Median rank score
79.08
Last refresh
2026-06-02
Score vs. cost
Each dot is one model. X axis is average cost per benchmark task, including model and judge cost; Y axis is average calibrated score.
Overall ranking
Higher is better. Scores come from completed judged runs.
Benchmark heatmap
Cells are colored by rank within each benchmark: the top ten are split across greens, anything below the top ten is red.
| Rank | Model | Overall | Generic Copy Index | Empty Insight Test | Brief Adherence Test | Point of View Test |
|---|---|---|---|---|---|---|
| 1 |
12 scored tests |
84.0 | 84.3 | 84.3 | 83.0 | 84.3 |
| 2 |
12 scored tests |
82.8 | 85.0 | 82.3 | 84.7 | 79.0 |
| 3 |
12 scored tests |
82.2 | 81.3 | 82.0 | 82.7 | 83.0 |
| 4 |
12 scored tests |
82.2 | 84.7 | 84.0 | 80.7 | 79.7 |
| 5 |
12 scored tests |
82.1 | 81.7 | 82.3 | 83.3 | 81.0 |
| 6 |
12 scored tests |
81.7 | 84.3 | 80.0 | 82.3 | 80.0 |
| 7 |
12 scored tests |
81.4 | 81.7 | 82.3 | 81.3 | 80.3 |
| 8 |
12 scored tests |
81.3 | 79.7 | 82.7 | 80.7 | 82.3 |
| 9 |
12 scored tests |
81.1 | 82.7 | 80.3 | 82.3 | 79.0 |
| 10 |
12 scored tests |
80.9 | 82.7 | 80.0 | 79.7 | 81.3 |
| 11 |
12 scored tests |
80.2 | 82.3 | 80.3 | 76.7 | 81.7 |
| 12 |
12 scored tests |
79.1 | 79.3 | 77.7 | 80.3 | 79.0 |
| 13 |
12 scored tests |
78.9 | 80.0 | 78.0 | 77.3 | 80.3 |
| 14 |
12 scored tests |
78.3 | 74.7 | 79.3 | 80.3 | 79.0 |
| 15 |
12 scored tests |
78.1 | 82.0 | 77.0 | 74.7 | 78.7 |
| 16 |
12 scored tests |
77.9 | 77.7 | 78.3 | 76.7 | 79.0 |
| 17 |
12 scored tests |
76.4 | 77.7 | 80.0 | 69.0 | 79.0 |
| 18 |
12 scored tests |
76.3 | 81.3 | 78.7 | 64.3 | 81.0 |
| 19 |
12 scored tests |
71.1 | 71.3 | 75.3 | 68.0 | 69.7 |
| 20 |
12 scored tests |
70.7 | 80.3 | 80.3 | 61.3 | 60.7 |
| 21 |
12 scored tests |
70.6 | 78.0 | 73.3 | 57.0 | 74.0 |
| 22 |
12 scored tests |
65.0 | 71.0 | 41.0 | 72.0 | 76.0 |
| 23 |
12 scored tests |
58.5 | 69.0 | 79.3 | 25.3 | 60.3 |
Full leaderboard
| Model | Score | Tests | Avg cost / task | Avg seconds / task | Frequent problems |
|---|---|---|---|---|---|
|
|
84.0 Strong | 12/12 | $0.1899 | 64.3s | - |
|
|
82.75 Strong | 12/12 | $0.0305 | 24.3s | Wrapper text |
|
|
82.25 Strong | 12/12 | $0.0219 | 19.4s | - |
|
|
82.25 Strong | 12/12 | $0.0301 | 22.1s | - |
|
|
82.08 Strong | 12/12 | $0.0299 | 21.4s | - |
|
|
81.67 Strong | 12/12 | $0.0292 | 20.9s | - |
|
|
81.42 Strong | 12/12 | $0.0221 | 21.1s | Wrapper text |
|
|
81.33 Strong | 12/12 | $0.0187 | 59.9s | - |
|
|
81.08 Strong | 12/12 | $0.0254 | 24.8s | - |
|
|
80.92 Strong | 12/12 | $0.0190 | 86.1s | - |
|
|
80.25 Strong | 12/12 | $0.0280 | 21.2s | Unsupported invention Over word count |
|
|
79.08 Usable | 12/12 | $0.0173 | 42.7s | Malformed output Under word count |
|
|
78.92 Usable | 12/12 | $0.0341 | 21.2s | Unsupported invention |
|
|
78.33 Usable | 12/12 | $0.0251 | 28.2s | Unsupported invention |
|
|
78.08 Usable | 12/12 | $0.0163 | 14.4s | Under word count |
|
|
77.92 Usable | 12/12 | $0.0173 | 99.7s | - |
|
|
76.42 Usable | 12/12 | $0.0168 | 14.1s | Under word count |
|
|
76.33 Usable | 12/12 | $0.0189 | 17.9s | Under word count Incomplete output |
|
|
71.08 Usable | 12/12 | $0.0167 | 20.8s | Under word count Outside word count Banned phrase Contains em dash |
|
|
70.67 Usable | 12/12 | $0.0173 | 55.0s | Wrapper text Incomplete output Malformed output Outside word count |
|
|
70.58 Usable | 12/12 | $0.0317 | 23.4s | Over word count Wrapper text Outside word count Under word count |
|
|
65.0 Needs editing | 12/12 | $0.0198 | 41.9s | Incomplete output Unsupported invention Wrapper text Over word count |
|
|
58.5 Weak | 12/12 | $0.0348 | 25.5s | Incomplete output Under word count Missing required element Missing specific example |
Test cases
Each row is one prompt, with score distributions, top and low performers, and the most frequent problems judges flagged.
| Test | Benchmark | Avg | Max | Min | Top model | Lowest model | Frequent problems |
|---|---|---|---|---|---|---|---|
|
AI consulting LinkedIn post content_generic_001 |
Generic Copy Index | 79.9 | 86.0 | 74.0 | claude-opus-4.7 · 86 | glm-5 · 74 | Wrapper text ×2 Under word count ×1 |
|
B2B SaaS launch announcement content_generic_002 |
Generic Copy Index | 80.4 | 86.0 | 60.0 | claude-sonnet-4.6 · 86 | deepseek-v3.2 · 60 | Under word count ×1 Unsupported invention ×1 |
|
Founder newsletter intro about product lessons content_generic_003 |
Generic Copy Index | 78.8 | 85.0 | 40.0 | claude-opus-4.7 · 85 | gemini-3.1-pro-preview · 40 | Under word count ×1 Incomplete output ×1 |
|
AI transformation without saying anything content_empty_001 |
Empty Insight Test | 78.1 | 85.0 | 10.0 | claude-opus-4.7 · 85 | minimax-m2.7 · 10 | Over word count ×1 Incomplete output ×1 Under word count ×1 |
|
Remote work productivity lessons content_empty_002 |
Empty Insight Test | 78.7 | 85.0 | 38.0 | claude-sonnet-4.6 · 85 | minimax-m2.7 · 38 | Over word count ×3 Wrapper text ×1 Incomplete output ×1 |
|
Customer research beats internal opinions content_empty_003 |
Empty Insight Test | 77.8 | 85.0 | 72.0 | gpt-5.5-pro · 85 | grok-4.20-beta · 72 | Outside word count ×1 |
|
Casual founder update content_brief_001 |
Brief Adherence Test | 71.1 | 85.0 | 26.0 | claude-opus-4.7 · 85 | gemini-3.1-pro-preview · 26 | Under word count ×5 Incomplete output ×2 Banned phrase ×1 |
|
Technical blog intro with no hype content_brief_002 |
Brief Adherence Test | 78.9 | 86.0 | 26.0 | claude-opus-4.7 · 86 | gemini-3.1-pro-preview · 26 | Under word count ×1 Incomplete output ×1 Over word count ×1 |
|
Agency case study summary with no exaggerated claims content_brief_003 |
Brief Adherence Test | 72.2 | 84.0 | 24.0 | glm-5 · 84 | gemini-3.1-pro-preview · 24 | Under word count ×3 Unsupported invention ×3 Over word count ×1 |
|
Opinionated AI evals post content_pov_001 |
Point of View Test | 81.3 | 86.0 | 73.0 | gpt-5.5-pro · 86 | deepseek-v3.2 · 73 | Wrapper text ×2 Outside word count ×1 |
|
Copilots fail because workflows are unclear content_pov_002 |
Point of View Test | 74.8 | 85.0 | 19.0 | kimi-k2.5 · 85 | glm-5.1 · 19 | Under word count ×2 Outside word count ×2 Incomplete output ×2 |
|
Founder-led sales is not optional content_pov_003 |
Point of View Test | 77.1 | 85.0 | 67.0 | gpt-5.5-pro · 85 | minimax-m2.7 · 67 | Over word count ×1 Malformed output ×1 |
Model profiles
12 scored tests · Strong
84.0
Highest traits
Lowest traits
12 scored tests · Strong
82.75
Highest traits
Lowest traits
12 scored tests · Strong
82.25
Highest traits
Lowest traits
12 scored tests · Strong
82.25
Highest traits
Lowest traits
12 scored tests · Strong
82.08
Highest traits
Lowest traits
12 scored tests · Strong
81.67
Highest traits
Lowest traits
12 scored tests · Strong
81.42
Highest traits
Lowest traits
12 scored tests · Strong
81.33
Highest traits
Lowest traits
12 scored tests · Strong
81.08
Highest traits
Lowest traits
12 scored tests · Strong
80.92
Highest traits
Lowest traits
12 scored tests · Strong
80.25
Highest traits
Lowest traits
12 scored tests · Usable
79.08
Highest traits
Lowest traits
Compare models
Pick two models to compare average trait scores and individual test outcomes from the published calibrated results.
Sorted by the biggest absolute score difference between the selected models.
| Test | Benchmark | Delta | ||
|---|---|---|---|---|
Frequent problems
Example outputs
winner
The response perfectly executes the prompt with exceptional domain specificity (e.g., 'anecdotal Slack feedback', 'increasing refusals'), meets the strict word count constraint, and maintains a highly authentic, expert tone devoid of generic AI sludge.
Most companies do not have an AI model problem. They have an evaluation problem. It’s tempting to believe the next model release will fix everything: better reasoning, bigger context, lower latency. And yes, model choice matters. But for most AI product teams, the bottleneck is not access to intelligence. It’s knowing, with confidence, whether the system is getting better. Teams ship prompts, RAG changes, tool calls, guardrails, fine-tunes, and agent workflows — then judge quality by a few demos, anecdotal Slack feedback, or whether the last scary failure disappeared. That makes progress feel fast but fragile. Without evaluation, you can’t answer basic product questions: • Did this change improve accuracy or just verbosity? • Are we better on high-value customer cases or only easy ones? • Did we reduce hallucinations while increasing refusals? • Is a cheaper model good enough for 80% of traffic? • Which failures are acceptable, and which block launch? A fair counterpoint: some use cases really do need better models — especially frontier reasoning, specialized domains, strict latency/cost constraints, or multimodal performance. But even there, evaluation is how you prove that “better” translates into product value. Practical takeaway: before swapping models again, build a small eval set from real user tasks, define success and failure clearly, run it on every change, and make model decisions from evidence — not vibes.
failure
The model's response is severely truncated, failing to meet the word count constraint and missing almost all required elements.
**AI Success Starts with Process, Not Just Tools** A growing number of firms treat AI as a plug‑and‑play product, buying the latest model and expecting immediate results. The reality is that without rethinking the underlying workflows, data pipelines, and decision‑making culture, even the most sophisticated algorithm
mid performer
The model perfectly adhered to all constraints, including the strict word count and negative constraints, while capturing an authentic, non-launchy founder tone that requires zero editing.
We just shipped the first version of our evaluation dashboard. It is still very early and admittedly quite rough around the edges, but the core utility is there. Right now, you can use it to compare how different prompts perform across various models. The goal is to help developers move past vibes-based testing and get actual data on what works. Building in public means sharing things before they are fully polished, and this dashboard is definitely in that raw phase. We need real-world feedback to figure out what to fix first. Next on our roadmap is adding role-based benchmark packs to make targeted testing easier. If you are currently struggling to keep track of how your prompts behave when you switch models, I would appreciate you giving this a spin. Let me know what breaks and what we should improve.
Methodology
Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.
LLM judge
A stable judge model scores each trait from 0 to 10 using benchmark-specific traits.
Heuristics
Deterministic checks catch length, banned phrases, required sections, format validity, and safety flags.
Calibrated ceiling
Rubrics are intentionally strict, leaving room above today's best scores for future model improvements.