Back to evals

business benchmark collection

Content & Brand

Benchmarks for testing whether models can create clear, specific, non-generic business content that follows a brief and preserves a brand voice.

Which models can produce useful business content without generic AI sludge?

4 benchmarks 12 tests 276 completed runs 20 base models

At a glance

Top model

gpt-5.5-pro

84.0

Lowest cost / eval

gpt-5.4-nano

$0.0163

Median rank score

79.08

Last refresh

2026-06-02

Score vs. cost

Average task cost vs overall score

Each dot is one model. X axis is average cost per benchmark task, including model and judge cost; Y axis is average calibrated score.

Overall ranking

Top models by average score

Higher is better. Scores come from completed judged runs.

Benchmark heatmap

Model performance by benchmark

Cells are colored by rank within each benchmark: the top ten are split across greens, anything below the top ten is red.

Below top 10 #1

Rank	Model	Overall	Generic Copy Index	Empty Insight Test	Brief Adherence Test	Point of View Test
1	gpt-5.5-pro 12 scored tests	84.0	84.3	84.3	83.0	84.3
2	claude-opus-4.7 12 scored tests	82.8	85.0	82.3	84.7	79.0
3	gpt-5.4 12 scored tests	82.2	81.3	82.0	82.7	83.0
4	claude-opus-4.8-low 12 scored tests	82.2	84.7	84.0	80.7	79.7
5	claude-opus-4.8 12 scored tests	82.1	81.7	82.3	83.3	81.0
6	claude-opus-4.8-high 12 scored tests	81.7	84.3	80.0	82.3	80.0
7	claude-sonnet-4.6 12 scored tests	81.4	81.7	82.3	81.3	80.3
8	kimi-k2.5 12 scored tests	81.3	79.7	82.7	80.7	82.3
9	claude-opus-4.6-high 12 scored tests	81.1	82.7	80.3	82.3	79.0
10	qwen3.7-max 12 scored tests	80.9	82.7	80.0	79.7	81.3
11	gpt-5.5 12 scored tests	80.2	82.3	80.3	76.7	81.7
12	glm-5 12 scored tests	79.1	79.3	77.7	80.3	79.0
13	gemini-3.5-flash-high 12 scored tests	78.9	80.0	78.0	77.3	80.3
14	claude-opus-4.6 12 scored tests	78.3	74.7	79.3	80.3	79.0
15	gpt-5.4-nano 12 scored tests	78.1	82.0	77.0	74.7	78.7
16	qwen3.5-plus-02-15 12 scored tests	77.9	77.7	78.3	76.7	79.0
17	gpt-5.4-mini 12 scored tests	76.4	77.7	80.0	69.0	79.0
18	gemini-3-flash-preview 12 scored tests	76.3	81.3	78.7	64.3	81.0
19	deepseek-v3.2 12 scored tests	71.1	71.3	75.3	68.0	69.7
20	glm-5.1 12 scored tests	70.7	80.3	80.3	61.3	60.7
21	grok-4.20-beta 12 scored tests	70.6	78.0	73.3	57.0	74.0
22	minimax-m2.7 12 scored tests	65.0	71.0	41.0	72.0	76.0
23	gemini-3.1-pro-preview 12 scored tests	58.5	69.0	79.3	25.3	60.3

Full leaderboard

Quality, cost, and speed

Model	Score	Tests	Avg cost / task	Avg seconds / task	Frequent problems
gpt-5.5-pro	84.0 Strong	12/12	$0.1899	64.3s	-
claude-opus-4.7	82.75 Strong	12/12	$0.0305	24.3s	Wrapper text
gpt-5.4	82.25 Strong	12/12	$0.0219	19.4s	-
claude-opus-4.8-low	82.25 Strong	12/12	$0.0301	22.1s	-
claude-opus-4.8	82.08 Strong	12/12	$0.0299	21.4s	-
claude-opus-4.8-high	81.67 Strong	12/12	$0.0292	20.9s	-
claude-sonnet-4.6	81.42 Strong	12/12	$0.0221	21.1s	Wrapper text
kimi-k2.5	81.33 Strong	12/12	$0.0187	59.9s	-
claude-opus-4.6-high	81.08 Strong	12/12	$0.0254	24.8s	-
qwen3.7-max	80.92 Strong	12/12	$0.0190	86.1s	-
gpt-5.5	80.25 Strong	12/12	$0.0280	21.2s	Unsupported invention Over word count
glm-5	79.08 Usable	12/12	$0.0173	42.7s	Malformed output Under word count
gemini-3.5-flash-high	78.92 Usable	12/12	$0.0341	21.2s	Unsupported invention
claude-opus-4.6	78.33 Usable	12/12	$0.0251	28.2s	Unsupported invention
gpt-5.4-nano	78.08 Usable	12/12	$0.0163	14.4s	Under word count
qwen3.5-plus-02-15	77.92 Usable	12/12	$0.0173	99.7s	-
gpt-5.4-mini	76.42 Usable	12/12	$0.0168	14.1s	Under word count
gemini-3-flash-preview	76.33 Usable	12/12	$0.0189	17.9s	Under word count Incomplete output
deepseek-v3.2	71.08 Usable	12/12	$0.0167	20.8s	Under word count Outside word count Banned phrase Contains em dash
glm-5.1	70.67 Usable	12/12	$0.0173	55.0s	Wrapper text Incomplete output Malformed output Outside word count
grok-4.20-beta	70.58 Usable	12/12	$0.0317	23.4s	Over word count Wrapper text Outside word count Under word count
minimax-m2.7	65.0 Needs editing	12/12	$0.0198	41.9s	Incomplete output Unsupported invention Wrapper text Over word count
gemini-3.1-pro-preview	58.5 Weak	12/12	$0.0348	25.5s	Incomplete output Under word count Missing required element Missing specific example

Test cases

Where the scores come from

Each row is one prompt, with score distributions, top and low performers, and the most frequent problems judges flagged.

Test	Benchmark	Avg	Max	Min	Top model	Lowest model	Frequent problems
AI consulting LinkedIn post content_generic_001	Generic Copy Index	79.9	86.0	74.0	claude-opus-4.7 · 86	glm-5 · 74	Wrapper text ×2 Under word count ×1
B2B SaaS launch announcement content_generic_002	Generic Copy Index	80.4	86.0	60.0	claude-sonnet-4.6 · 86	deepseek-v3.2 · 60	Under word count ×1 Unsupported invention ×1
Founder newsletter intro about product lessons content_generic_003	Generic Copy Index	78.8	85.0	40.0	claude-opus-4.7 · 85	gemini-3.1-pro-preview · 40	Under word count ×1 Incomplete output ×1
AI transformation without saying anything content_empty_001	Empty Insight Test	78.1	85.0	10.0	claude-opus-4.7 · 85	minimax-m2.7 · 10	Over word count ×1 Incomplete output ×1 Under word count ×1
Remote work productivity lessons content_empty_002	Empty Insight Test	78.7	85.0	38.0	claude-sonnet-4.6 · 85	minimax-m2.7 · 38	Over word count ×3 Wrapper text ×1 Incomplete output ×1
Customer research beats internal opinions content_empty_003	Empty Insight Test	77.8	85.0	72.0	gpt-5.5-pro · 85	grok-4.20-beta · 72	Outside word count ×1
Casual founder update content_brief_001	Brief Adherence Test	71.1	85.0	26.0	claude-opus-4.7 · 85	gemini-3.1-pro-preview · 26	Under word count ×5 Incomplete output ×2 Banned phrase ×1
Technical blog intro with no hype content_brief_002	Brief Adherence Test	78.9	86.0	26.0	claude-opus-4.7 · 86	gemini-3.1-pro-preview · 26	Under word count ×1 Incomplete output ×1 Over word count ×1
Agency case study summary with no exaggerated claims content_brief_003	Brief Adherence Test	72.2	84.0	24.0	glm-5 · 84	gemini-3.1-pro-preview · 24	Under word count ×3 Unsupported invention ×3 Over word count ×1
Opinionated AI evals post content_pov_001	Point of View Test	81.3	86.0	73.0	gpt-5.5-pro · 86	deepseek-v3.2 · 73	Wrapper text ×2 Outside word count ×1
Copilots fail because workflows are unclear content_pov_002	Point of View Test	74.8	85.0	19.0	kimi-k2.5 · 85	glm-5.1 · 19	Under word count ×2 Outside word count ×2 Incomplete output ×2
Founder-led sales is not optional content_pov_003	Point of View Test	77.1	85.0	67.0	gpt-5.5-pro · 85	minimax-m2.7 · 67	Over word count ×1 Malformed output ×1

Model profiles

Strengths, weaknesses, and tradeoffs

gpt-5.5-pro

12 scored tests · Strong

84.0

Highest traits

specificity8.51

constraint adherence8.5

concision8.5

voice and tone8.5

practical takeaway8.47

Lowest traits

structure7.93

task completion8.23

tone control8.27

originality8.3

tone fit8.37

claude-opus-4.7

12 scored tests · Strong

82.75

Highest traits

tone fit8.7

voice and tone8.6

constraint adherence8.5

task completion8.4

concision8.4

Lowest traits

tone control8.07

structure8.17

argument quality8.18

substance8.23

practical takeaway8.27

gpt-5.4

12 scored tests · Strong

82.25

Highest traits

constraint adherence8.5

practical takeaway8.43

tone fit8.37

tone control8.3

voice and tone8.3

Lowest traits

structure7.93

originality7.97

concision8.0

task completion8.07

clarity8.17

claude-opus-4.8-low

12 scored tests · Strong

82.25

Highest traits

concision8.5

voice and tone8.5

clarity8.4

substance8.4

originality8.4

Lowest traits

task completion7.83

structure7.93

tone control8.0

argument quality8.15

practical takeaway8.17

claude-opus-4.8

12 scored tests · Strong

82.08

Highest traits

constraint adherence8.5

tone fit8.47

clarity8.33

voice and tone8.33

task completion8.3

Lowest traits

originality7.93

structure8.0

tone control8.13

practical takeaway8.13

concision8.13

claude-opus-4.8-high

12 scored tests · Strong

81.67

Highest traits

voice and tone8.5

concision8.4

tone fit8.33

constraint adherence8.3

originality8.23

Lowest traits

substance7.9

structure7.93

argument quality8.05

practical takeaway8.07

tone control8.1

claude-sonnet-4.6

12 scored tests · Strong

81.42

Highest traits

voice and tone8.43

constraint adherence8.33

tone fit8.3

clarity8.27

argument quality8.17

Lowest traits

structure7.83

task completion7.97

tone control8.07

specificity8.13

practical takeaway8.13

kimi-k2.5

12 scored tests · Strong

81.33

Highest traits

clarity8.4

practical takeaway8.33

tone fit8.23

substance8.23

specificity8.2

Lowest traits

originality7.67

structure7.83

task completion8.0

tone control8.03

concision8.07

claude-opus-4.6-high

12 scored tests · Strong

81.08

Highest traits

constraint adherence8.5

tone fit8.37

voice and tone8.33

concision8.23

clarity8.17

Lowest traits

tone control7.63

structure7.93

argument quality7.97

task completion7.97

substance8.0

qwen3.7-max

12 scored tests · Strong

80.92

Highest traits

voice and tone8.4

practical takeaway8.23

specificity8.18

constraint adherence8.17

concision8.17

Lowest traits

structure7.67

task completion7.87

substance7.9

argument quality8.02

usefulness8.05

gpt-5.5

12 scored tests · Strong

80.25

Highest traits

tone fit8.33

voice and tone8.33

concision8.3

specificity8.28

practical takeaway8.23

Lowest traits

constraint adherence7.33

task completion7.67

originality7.97

structure8.0

tone control8.03

glm-5

12 scored tests · Usable

79.08

Highest traits

constraint adherence8.33

tone fit8.23

practical takeaway8.2

voice and tone8.03

concision7.97

Lowest traits

structure7.5

clarity7.5

tone control7.73

task completion7.77

substance7.8

Compare models

Trait profile and test-by-test matchup

Pick two models to compare average trait scores and individual test outcomes from the published calibrated results.

Model A Model B

Per-test comparison

Sorted by the biggest absolute score difference between the selected models.

Test

Benchmark

Delta

Frequent problems

Where models break

Under word count 15

Incomplete output 9

Wrapper text 8

Over word count 8

Outside word count 4

Unsupported invention 4

Missing required element 3

Malformed output 2

Missing specific example 2

Contains em dash 2

Missing specific lesson 1

Banned phrase 1

Example outputs

Winner, middle, failure

winner

gpt-5.5-pro · 86

The response perfectly executes the prompt with exceptional domain specificity (e.g., 'anecdotal Slack feedback', 'increasing refusals'), meets the strict word count constraint, and maintains a highly authentic, expert tone devoid of generic AI sludge.

Most companies do not have an AI model problem. They have an evaluation problem.

It’s tempting to believe the next model release will fix everything: better reasoning, bigger context, lower latency. And yes, model choice matters.

But for most AI product teams, the bottleneck is not access to intelligence. It’s knowing, with confidence, whether the system is getting better.

Teams ship prompts, RAG changes, tool calls, guardrails, fine-tunes, and agent workflows — then judge quality by a few demos, anecdotal Slack feedback, or whether the last scary failure disappeared. That makes progress feel fast but fragile.

Without evaluation, you can’t answer basic product questions:

• Did this change improve accuracy or just verbosity?  
• Are we better on high-value customer cases or only easy ones?  
• Did we reduce hallucinations while increasing refusals?  
• Is a cheaper model good enough for 80% of traffic?  
• Which failures are acceptable, and which block launch?

A fair counterpoint: some use cases really do need better models — especially frontier reasoning, specialized domains, strict latency/cost constraints, or multimodal performance. But even there, evaluation is how you prove that “better” translates into product value.

Practical takeaway: before swapping models again, build a small eval set from real user tasks, define success and failure clearly, run it on every change, and make model decisions from evidence — not vibes.

failure

minimax-m2.7 · 10

The model's response is severely truncated, failing to meet the word count constraint and missing almost all required elements.

**AI Success Starts with Process, Not Just Tools**

A growing number of firms treat AI as a plug‑and‑play product, buying the latest model and expecting immediate results. The reality is that without rethinking the underlying workflows, data pipelines, and decision‑making culture, even the most sophisticated algorithm

mid performer

gemini-3.5-flash-high · 80

The model perfectly adhered to all constraints, including the strict word count and negative constraints, while capturing an authentic, non-launchy founder tone that requires zero editing.

We just shipped the first version of our evaluation dashboard. It is still very early and admittedly quite rough around the edges, but the core utility is there. Right now, you can use it to compare how different prompts perform across various models.

The goal is to help developers move past vibes-based testing and get actual data on what works. Building in public means sharing things before they are fully polished, and this dashboard is definitely in that raw phase. We need real-world feedback to figure out what to fix first.

Next on our roadmap is adding role-based benchmark packs to make targeted testing easier. If you are currently struggling to keep track of how your prompts behave when you switch models, I would appreciate you giving this a spin. Let me know what breaks and what we should improve.

Methodology

How scores are produced

Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.

LLM judge

A stable judge model scores each trait from 0 to 10 using benchmark-specific traits.

Heuristics

Deterministic checks catch length, banned phrases, required sections, format validity, and safety flags.

Calibrated ceiling

Rubrics are intentionally strict, leaving room above today's best scores for future model improvements.