Confirm Action

Are you sure you want to proceed?

Back to evals

business benchmark collection

Content & Brand

Benchmarks for testing whether models can create clear, specific, non-generic business content that follows a brief and preserves a brand voice.

Which models can produce useful business content without generic AI sludge?

4 benchmarks 12 tests 276 completed runs 20 base models

At a glance

Top model

gpt-5.5-pro

84.0

Lowest cost / eval

gpt-5.4-nano

$0.0163

Median rank score

79.08

Last refresh

2026-06-02

Score vs. cost

Average task cost vs overall score

Each dot is one model. X axis is average cost per benchmark task, including model and judge cost; Y axis is average calibrated score.

Overall ranking

Top models by average score

Higher is better. Scores come from completed judged runs.

Benchmark heatmap

Model performance by benchmark

Cells are colored by rank within each benchmark: the top ten are split across greens, anything below the top ten is red.

Below top 10 #1
Rank Model Overall Generic Copy Index Empty Insight Test Brief Adherence Test Point of View Test
1
gpt-5.5-pro

12 scored tests

84.0 84.3 84.3 83.0 84.3
2
claude-opus-4.7

12 scored tests

82.8 85.0 82.3 84.7 79.0
3
gpt-5.4

12 scored tests

82.2 81.3 82.0 82.7 83.0
4
claude-opus-4.8-low

12 scored tests

82.2 84.7 84.0 80.7 79.7
5
claude-opus-4.8

12 scored tests

82.1 81.7 82.3 83.3 81.0
6
claude-opus-4.8-high

12 scored tests

81.7 84.3 80.0 82.3 80.0
7
claude-sonnet-4.6

12 scored tests

81.4 81.7 82.3 81.3 80.3
8
kimi-k2.5

12 scored tests

81.3 79.7 82.7 80.7 82.3
9
claude-opus-4.6-high

12 scored tests

81.1 82.7 80.3 82.3 79.0
10
qwen3.7-max

12 scored tests

80.9 82.7 80.0 79.7 81.3
11
gpt-5.5

12 scored tests

80.2 82.3 80.3 76.7 81.7
12
glm-5

12 scored tests

79.1 79.3 77.7 80.3 79.0
13
gemini-3.5-flash-high

12 scored tests

78.9 80.0 78.0 77.3 80.3
14
claude-opus-4.6

12 scored tests

78.3 74.7 79.3 80.3 79.0
15
gpt-5.4-nano

12 scored tests

78.1 82.0 77.0 74.7 78.7
16
qwen3.5-plus-02-15

12 scored tests

77.9 77.7 78.3 76.7 79.0
17
gpt-5.4-mini

12 scored tests

76.4 77.7 80.0 69.0 79.0
18
gemini-3-flash-preview

12 scored tests

76.3 81.3 78.7 64.3 81.0
19
deepseek-v3.2

12 scored tests

71.1 71.3 75.3 68.0 69.7
20
glm-5.1

12 scored tests

70.7 80.3 80.3 61.3 60.7
21
grok-4.20-beta

12 scored tests

70.6 78.0 73.3 57.0 74.0
22
minimax-m2.7

12 scored tests

65.0 71.0 41.0 72.0 76.0
23
gemini-3.1-pro-preview

12 scored tests

58.5 69.0 79.3 25.3 60.3

Full leaderboard

Quality, cost, and speed

Model Score Tests Avg cost / task Avg seconds / task Frequent problems
gpt-5.5-pro
84.0 Strong 12/12 $0.1899 64.3s -
claude-opus-4.7
82.75 Strong 12/12 $0.0305 24.3s Wrapper text
gpt-5.4
82.25 Strong 12/12 $0.0219 19.4s -
claude-opus-4.8-low
82.25 Strong 12/12 $0.0301 22.1s -
claude-opus-4.8
82.08 Strong 12/12 $0.0299 21.4s -
claude-opus-4.8-high
81.67 Strong 12/12 $0.0292 20.9s -
claude-sonnet-4.6
81.42 Strong 12/12 $0.0221 21.1s Wrapper text
kimi-k2.5
81.33 Strong 12/12 $0.0187 59.9s -
claude-opus-4.6-high
81.08 Strong 12/12 $0.0254 24.8s -
qwen3.7-max
80.92 Strong 12/12 $0.0190 86.1s -
gpt-5.5
80.25 Strong 12/12 $0.0280 21.2s Unsupported invention Over word count
glm-5
79.08 Usable 12/12 $0.0173 42.7s Malformed output Under word count
gemini-3.5-flash-high
78.92 Usable 12/12 $0.0341 21.2s Unsupported invention
claude-opus-4.6
78.33 Usable 12/12 $0.0251 28.2s Unsupported invention
gpt-5.4-nano
78.08 Usable 12/12 $0.0163 14.4s Under word count
qwen3.5-plus-02-15
77.92 Usable 12/12 $0.0173 99.7s -
gpt-5.4-mini
76.42 Usable 12/12 $0.0168 14.1s Under word count
gemini-3-flash-preview
76.33 Usable 12/12 $0.0189 17.9s Under word count Incomplete output
deepseek-v3.2
71.08 Usable 12/12 $0.0167 20.8s Under word count Outside word count Banned phrase Contains em dash
glm-5.1
70.67 Usable 12/12 $0.0173 55.0s Wrapper text Incomplete output Malformed output Outside word count
grok-4.20-beta
70.58 Usable 12/12 $0.0317 23.4s Over word count Wrapper text Outside word count Under word count
minimax-m2.7
65.0 Needs editing 12/12 $0.0198 41.9s Incomplete output Unsupported invention Wrapper text Over word count
gemini-3.1-pro-preview
58.5 Weak 12/12 $0.0348 25.5s Incomplete output Under word count Missing required element Missing specific example

Test cases

Where the scores come from

Each row is one prompt, with score distributions, top and low performers, and the most frequent problems judges flagged.

Test Benchmark Avg Max Min Top model Lowest model Frequent problems

AI consulting LinkedIn post

content_generic_001

Generic Copy Index 79.9 86.0 74.0 claude-opus-4.7 · 86 glm-5 · 74 Wrapper text ×2 Under word count ×1

B2B SaaS launch announcement

content_generic_002

Generic Copy Index 80.4 86.0 60.0 claude-sonnet-4.6 · 86 deepseek-v3.2 · 60 Under word count ×1 Unsupported invention ×1

Founder newsletter intro about product lessons

content_generic_003

Generic Copy Index 78.8 85.0 40.0 claude-opus-4.7 · 85 gemini-3.1-pro-preview · 40 Under word count ×1 Incomplete output ×1

AI transformation without saying anything

content_empty_001

Empty Insight Test 78.1 85.0 10.0 claude-opus-4.7 · 85 minimax-m2.7 · 10 Over word count ×1 Incomplete output ×1 Under word count ×1

Remote work productivity lessons

content_empty_002

Empty Insight Test 78.7 85.0 38.0 claude-sonnet-4.6 · 85 minimax-m2.7 · 38 Over word count ×3 Wrapper text ×1 Incomplete output ×1

Customer research beats internal opinions

content_empty_003

Empty Insight Test 77.8 85.0 72.0 gpt-5.5-pro · 85 grok-4.20-beta · 72 Outside word count ×1

Casual founder update

content_brief_001

Brief Adherence Test 71.1 85.0 26.0 claude-opus-4.7 · 85 gemini-3.1-pro-preview · 26 Under word count ×5 Incomplete output ×2 Banned phrase ×1

Technical blog intro with no hype

content_brief_002

Brief Adherence Test 78.9 86.0 26.0 claude-opus-4.7 · 86 gemini-3.1-pro-preview · 26 Under word count ×1 Incomplete output ×1 Over word count ×1

Agency case study summary with no exaggerated claims

content_brief_003

Brief Adherence Test 72.2 84.0 24.0 glm-5 · 84 gemini-3.1-pro-preview · 24 Under word count ×3 Unsupported invention ×3 Over word count ×1

Opinionated AI evals post

content_pov_001

Point of View Test 81.3 86.0 73.0 gpt-5.5-pro · 86 deepseek-v3.2 · 73 Wrapper text ×2 Outside word count ×1

Copilots fail because workflows are unclear

content_pov_002

Point of View Test 74.8 85.0 19.0 kimi-k2.5 · 85 glm-5.1 · 19 Under word count ×2 Outside word count ×2 Incomplete output ×2

Founder-led sales is not optional

content_pov_003

Point of View Test 77.1 85.0 67.0 gpt-5.5-pro · 85 minimax-m2.7 · 67 Over word count ×1 Malformed output ×1

Model profiles

Strengths, weaknesses, and tradeoffs

gpt-5.5-pro

12 scored tests · Strong

84.0

Highest traits

specificity8.51
constraint adherence8.5
concision8.5
voice and tone8.5
practical takeaway8.47

Lowest traits

structure7.93
task completion8.23
tone control8.27
originality8.3
tone fit8.37

claude-opus-4.7

12 scored tests · Strong

82.75

Highest traits

tone fit8.7
voice and tone8.6
constraint adherence8.5
task completion8.4
concision8.4

Lowest traits

tone control8.07
structure8.17
argument quality8.18
substance8.23
practical takeaway8.27

gpt-5.4

12 scored tests · Strong

82.25

Highest traits

constraint adherence8.5
practical takeaway8.43
tone fit8.37
tone control8.3
voice and tone8.3

Lowest traits

structure7.93
originality7.97
concision8.0
task completion8.07
clarity8.17

claude-opus-4.8-low

12 scored tests · Strong

82.25

Highest traits

concision8.5
voice and tone8.5
clarity8.4
substance8.4
originality8.4

Lowest traits

task completion7.83
structure7.93
tone control8.0
argument quality8.15
practical takeaway8.17

claude-opus-4.8

12 scored tests · Strong

82.08

Highest traits

constraint adherence8.5
tone fit8.47
clarity8.33
voice and tone8.33
task completion8.3

Lowest traits

originality7.93
structure8.0
tone control8.13
practical takeaway8.13
concision8.13

claude-opus-4.8-high

12 scored tests · Strong

81.67

Highest traits

voice and tone8.5
concision8.4
tone fit8.33
constraint adherence8.3
originality8.23

Lowest traits

substance7.9
structure7.93
argument quality8.05
practical takeaway8.07
tone control8.1

claude-sonnet-4.6

12 scored tests · Strong

81.42

Highest traits

voice and tone8.43
constraint adherence8.33
tone fit8.3
clarity8.27
argument quality8.17

Lowest traits

structure7.83
task completion7.97
tone control8.07
specificity8.13
practical takeaway8.13

kimi-k2.5

12 scored tests · Strong

81.33

Highest traits

clarity8.4
practical takeaway8.33
tone fit8.23
substance8.23
specificity8.2

Lowest traits

originality7.67
structure7.83
task completion8.0
tone control8.03
concision8.07

claude-opus-4.6-high

12 scored tests · Strong

81.08

Highest traits

constraint adherence8.5
tone fit8.37
voice and tone8.33
concision8.23
clarity8.17

Lowest traits

tone control7.63
structure7.93
argument quality7.97
task completion7.97
substance8.0

qwen3.7-max

12 scored tests · Strong

80.92

Highest traits

voice and tone8.4
practical takeaway8.23
specificity8.18
constraint adherence8.17
concision8.17

Lowest traits

structure7.67
task completion7.87
substance7.9
argument quality8.02
usefulness8.05

gpt-5.5

12 scored tests · Strong

80.25

Highest traits

tone fit8.33
voice and tone8.33
concision8.3
specificity8.28
practical takeaway8.23

Lowest traits

constraint adherence7.33
task completion7.67
originality7.97
structure8.0
tone control8.03

glm-5

12 scored tests · Usable

79.08

Highest traits

constraint adherence8.33
tone fit8.23
practical takeaway8.2
voice and tone8.03
concision7.97

Lowest traits

structure7.5
clarity7.5
tone control7.73
task completion7.77
substance7.8

Compare models

Trait profile and test-by-test matchup

Pick two models to compare average trait scores and individual test outcomes from the published calibrated results.

Per-test comparison

Sorted by the biggest absolute score difference between the selected models.

Test Benchmark Delta

Frequent problems

Where models break

Under word count 15
Incomplete output 9
Wrapper text 8
Over word count 8
Outside word count 4
Unsupported invention 4
Missing required element 3
Malformed output 2
Missing specific example 2
Contains em dash 2
Missing specific lesson 1
Banned phrase 1

Example outputs

Winner, middle, failure

winner

gpt-5.5-pro · 86

The response perfectly executes the prompt with exceptional domain specificity (e.g., 'anecdotal Slack feedback', 'increasing refusals'), meets the strict word count constraint, and maintains a highly authentic, expert tone devoid of generic AI sludge.

Most companies do not have an AI model problem. They have an evaluation problem.

It’s tempting to believe the next model release will fix everything: better reasoning, bigger context, lower latency. And yes, model choice matters.

But for most AI product teams, the bottleneck is not access to intelligence. It’s knowing, with confidence, whether the system is getting better.

Teams ship prompts, RAG changes, tool calls, guardrails, fine-tunes, and agent workflows — then judge quality by a few demos, anecdotal Slack feedback, or whether the last scary failure disappeared. That makes progress feel fast but fragile.

Without evaluation, you can’t answer basic product questions:

• Did this change improve accuracy or just verbosity?  
• Are we better on high-value customer cases or only easy ones?  
• Did we reduce hallucinations while increasing refusals?  
• Is a cheaper model good enough for 80% of traffic?  
• Which failures are acceptable, and which block launch?

A fair counterpoint: some use cases really do need better models — especially frontier reasoning, specialized domains, strict latency/cost constraints, or multimodal performance. But even there, evaluation is how you prove that “better” translates into product value.

Practical takeaway: before swapping models again, build a small eval set from real user tasks, define success and failure clearly, run it on every change, and make model decisions from evidence — not vibes.

failure

minimax-m2.7 · 10

The model's response is severely truncated, failing to meet the word count constraint and missing almost all required elements.

**AI Success Starts with Process, Not Just Tools**

A growing number of firms treat AI as a plug‑and‑play product, buying the latest model and expecting immediate results. The reality is that without rethinking the underlying workflows, data pipelines, and decision‑making culture, even the most sophisticated algorithm

mid performer

gemini-3.5-flash-high · 80

The model perfectly adhered to all constraints, including the strict word count and negative constraints, while capturing an authentic, non-launchy founder tone that requires zero editing.

We just shipped the first version of our evaluation dashboard. It is still very early and admittedly quite rough around the edges, but the core utility is there. Right now, you can use it to compare how different prompts perform across various models.

The goal is to help developers move past vibes-based testing and get actual data on what works. Building in public means sharing things before they are fully polished, and this dashboard is definitely in that raw phase. We need real-world feedback to figure out what to fix first.

Next on our roadmap is adding role-based benchmark packs to make targeted testing easier. If you are currently struggling to keep track of how your prompts behave when you switch models, I would appreciate you giving this a spin. Let me know what breaks and what we should improve.

Methodology

How scores are produced

Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.

LLM judge

A stable judge model scores each trait from 0 to 10 using benchmark-specific traits.

Heuristics

Deterministic checks catch length, banned phrases, required sections, format validity, and safety flags.

Calibrated ceiling

Rubrics are intentionally strict, leaving room above today's best scores for future model improvements.