Back to evals

business benchmark collection

Landing Pages

Benchmarks for testing whether models can turn a product brief into a clear, persuasive, conversion-aware landing page.

Which models can create landing pages that are clear, specific, persuasive, and buildable?

4 benchmarks 12 tests 276 completed runs 20 base models

At a glance

Top model

claude-opus-4.8-high

83.08

Lowest cost / eval

qwen3.5-plus-02-15

$0.0113

Median rank score

79.67

Last refresh

2026-06-02

Score vs. cost

Average task cost vs overall score

Each dot is one model. X axis is average cost per benchmark task, including model and judge cost; Y axis is average calibrated score.

Overall ranking

Top models by average score

Higher is better. Scores come from completed judged runs.

Benchmark heatmap

Model performance by benchmark

Cells are colored by rank within each benchmark: the top ten are split across greens, anything below the top ten is red.

Below top 10 #1

Rank	Model	Overall	Hero Clarity Test	Five-Second Clarity Test	Objection Handling Test	Landing Page Structure Test
1	claude-opus-4.8-high 12 scored tests	83.1	82.3	83.0	85.3	81.7
2	kimi-k2.5 12 scored tests	82.8	81.0	79.7	85.0	85.3
3	claude-opus-4.6-high 12 scored tests	82.5	85.0	83.7	80.0	81.3
4	claude-opus-4.8-low 12 scored tests	82.4	82.7	79.0	84.7	83.3
5	qwen3.7-max 12 scored tests	82.2	80.7	82.0	85.0	81.0
6	claude-opus-4.7 12 scored tests	81.4	80.3	79.0	85.3	81.0
7	claude-opus-4.8 12 scored tests	81.3	78.7	80.0	85.3	81.3
8	gemini-3-flash-preview 12 scored tests	81.2	80.7	79.0	84.0	81.0
9	qwen3.5-plus-02-15 12 scored tests	80.8	77.3	82.0	82.0	81.7
10	gpt-5.5-pro 12 scored tests	80.7	78.7	81.3	81.3	81.3
11	claude-opus-4.6 12 scored tests	80.3	81.7	80.7	81.0	78.0
12	gpt-5.4-mini 12 scored tests	79.7	77.7	77.3	83.3	80.3
13	gpt-5.5 12 scored tests	79.7	75.7	79.3	82.7	81.0
14	glm-5.1 12 scored tests	79.3	77.3	80.0	83.7	76.3
15	gpt-5.4 12 scored tests	78.7	82.3	77.3	78.0	77.0
16	deepseek-v3.2 12 scored tests	78.2	75.3	78.7	79.7	79.0
17	gpt-5.4-nano 12 scored tests	77.5	77.7	79.3	72.3	80.7
18	claude-sonnet-4.6 12 scored tests	76.8	77.0	79.0	79.3	71.7
19	gemini-3.5-flash-high 12 scored tests	76.1	77.7	80.3	84.3	62.0
20	glm-5 12 scored tests	75.6	75.0	79.0	81.3	67.0
21	grok-4.20-beta 12 scored tests	75.1	72.7	78.3	68.7	80.7
22	gemini-3.1-pro-preview 12 scored tests	74.5	78.7	82.0	85.0	52.3
23	minimax-m2.7 12 scored tests	67.8	72.7	76.7	69.3	52.7

Full leaderboard

Quality, cost, and speed

Model	Score	Tests	Avg cost / task	Avg seconds / task	Frequent problems
claude-opus-4.8-high	83.08 Strong	12/12	$0.0349	24.1s	-
kimi-k2.5	82.75 Strong	12/12	$0.0130	64.0s	Unsupported invention
claude-opus-4.6-high	82.5 Strong	12/12	$0.0365	37.0s	Incomplete output Unsupported invention
claude-opus-4.8-low	82.42 Strong	12/12	$0.0361	24.6s	Wrapper text
qwen3.7-max	82.17 Strong	12/12	$0.0123	57.6s	-
claude-opus-4.7	81.42 Strong	12/12	$0.0394	31.5s	Incomplete output Unsupported invention Wrapper text
claude-opus-4.8	81.33 Strong	12/12	$0.0340	22.7s	Wrapper text
gemini-3-flash-preview	81.17 Strong	12/12	$0.0168	18.1s	Unsupported invention
qwen3.5-plus-02-15	80.75 Strong	12/12	$0.0113	56.3s	-
gpt-5.5-pro	80.67 Strong	12/12	$0.1658	49.3s	-
claude-opus-4.6	80.33 Strong	12/12	$0.0361	36.7s	Incomplete output Unsupported invention
gpt-5.4-mini	79.67 Usable	12/12	$0.0146	14.0s	-
gpt-5.5	79.67 Usable	12/12	$0.0309	22.9s	-
glm-5.1	79.33 Usable	12/12	$0.0120	64.2s	Incomplete output
gpt-5.4	78.67 Usable	12/12	$0.0220	21.3s	Wrapper text
deepseek-v3.2	78.17 Usable	12/12	$0.0122	32.9s	-
gpt-5.4-nano	77.5 Usable	12/12	$0.0129	14.4s	Wrapper text Missing required element
claude-sonnet-4.6	76.75 Usable	12/12	$0.0287	32.7s	Incomplete output Unsupported invention Wrapper text
gemini-3.5-flash-high	76.08 Usable	12/12	$0.0286	20.2s	Incomplete output Missing required element
glm-5	75.58 Usable	12/12	$0.0124	48.5s	Wrapper text Incomplete output Missing required element
grok-4.20-beta	75.08 Usable	12/12	$0.0138	17.3s	Unsupported invention Wrapper text
gemini-3.1-pro-preview	74.5 Usable	12/12	$0.0336	27.2s	Incomplete output
minimax-m2.7	67.83 Needs editing	12/12	$0.0124	36.9s	Unsupported invention Incomplete output Missing required element

Test cases

Where the scores come from

Each row is one prompt, with score distributions, top and low performers, and the most frequent problems judges flagged.

Test	Benchmark	Avg	Max	Min	Top model	Lowest model	Frequent problems
AI bookkeeping tool for freelancers landing_hero_001	Hero Clarity Test	76.8	86.0	70.0	claude-opus-4.6-high · 86	minimax-m2.7 · 70	Wrapper text ×2
Compliance tracker for clinics landing_hero_002	Hero Clarity Test	78.9	84.0	73.0	kimi-k2.5 · 84	qwen3.5-plus-02-15 · 73	Wrapper text ×2
Hiring platform for hospitality teams landing_hero_003	Hero Clarity Test	80.2	86.0	74.0	claude-opus-4.7 · 86	claude-opus-4.8 · 74	Wrapper text ×3 Unsupported invention ×1
Internal knowledge bot for law firms landing_5sec_001	Five-Second Clarity Test	81.4	86.0	74.0	claude-opus-4.6-high · 86	claude-opus-4.8 · 74	Wrapper text ×1
AI survey analysis tool landing_5sec_002	Five-Second Clarity Test	81.0	85.0	74.0	claude-opus-4.7 · 85	claude-opus-4.8-low · 74	Wrapper text ×1
Warehouse shift planning software landing_5sec_003	Five-Second Clarity Test	77.2	82.0	68.0	gemini-3.1-pro-preview · 82	claude-opus-4.7 · 68	Unsupported invention ×7
AI customer support agent objections landing_objection_001	Objection Handling Test	80.1	86.0	65.0	claude-opus-4.8 · 86	minimax-m2.7 · 65	Unsupported invention ×2 Missing required element ×1
Payroll software migration landing_objection_002	Objection Handling Test	79.9	85.0	60.0	gpt-5.4-mini · 85	minimax-m2.7 · 60	Unsupported invention ×4 Wrapper text ×2
Automated outbound sales tool landing_objection_003	Objection Handling Test	83.5	86.0	76.0	claude-opus-4.7 · 86	gpt-5.4-nano · 76	-
B2B analytics tool page structure landing_structure_001	Landing Page Structure Test	73.2	85.0	32.0	gpt-5.5 · 85	gemini-3.5-flash-high · 32	Incomplete output ×7 Missing required element ×3 Wrapper text ×1
AI writing platform for law firms landing_structure_002	Landing Page Structure Test	82.3	86.0	70.0	claude-opus-4.7 · 86	claude-sonnet-4.6 · 70	Incomplete output ×3
Personal finance app for expats landing_structure_003	Landing Page Structure Test	73.7	86.0	40.0	kimi-k2.5 · 86	minimax-m2.7 · 40	Incomplete output ×6 Wrapper text ×3

Model profiles

Strengths, weaknesses, and tradeoffs

claude-opus-4.8-high

12 scored tests · Strong

83.08

Highest traits

credibility8.6

risk handling8.6

tone8.5

objection coverage8.5

audience fit8.38

Lowest traits

differentiation7.9

proof placement8.0

cta quality8.03

buildability8.13

structure8.17

kimi-k2.5

12 scored tests · Strong

82.75

Highest traits

risk handling8.53

objection coverage8.5

structure8.47

buildability8.47

proof placement8.47

Lowest traits

trust7.73

cta quality7.82

differentiation7.97

message clarity8.2

clarity8.23

claude-opus-4.6-high

12 scored tests · Strong

82.5

Highest traits

audience fit8.67

clarity8.5

differentiation8.43

conversion logic8.4

specificity8.39

Lowest traits

credibility7.7

objection coverage8.1

cta quality8.12

structure8.17

buildability8.17

claude-opus-4.8-low

12 scored tests · Strong

82.42

Highest traits

credibility8.5

risk handling8.5

objection coverage8.5

structure8.43

audience fit8.43

Lowest traits

cta quality8.0

trust8.07

differentiation8.07

message clarity8.23

buildability8.3

qwen3.7-max

12 scored tests · Strong

82.17

Highest traits

credibility8.5

risk handling8.5

objection coverage8.47

tone8.37

specificity8.28

Lowest traits

differentiation7.5

cta quality7.85

structure8.03

conversion logic8.07

buildability8.1

claude-opus-4.7

12 scored tests · Strong

81.42

Highest traits

credibility8.6

risk handling8.6

tone8.5

objection coverage8.5

audience fit8.47

Lowest traits

trust6.93

cta quality7.95

buildability8.0

differentiation8.03

structure8.1

claude-opus-4.8

12 scored tests · Strong

81.33

Highest traits

credibility8.7

risk handling8.7

tone8.6

objection coverage8.5

audience fit8.37

Lowest traits

differentiation7.83

cta quality7.9

structure8.03

proof placement8.07

conversion logic8.13

gemini-3-flash-preview

12 scored tests · Strong

81.17

Highest traits

credibility8.5

risk handling8.47

objection coverage8.33

audience fit8.28

tone8.23

Lowest traits

differentiation7.57

trust7.67

cta quality7.83

structure8.0

buildability8.07

qwen3.5-plus-02-15

12 scored tests · Strong

80.75

Highest traits

credibility8.3

risk handling8.3

objection coverage8.23

message clarity8.2

buildability8.17

Lowest traits

differentiation6.73

cta quality7.83

specificity8.05

tone8.1

audience fit8.12

gpt-5.5-pro

12 scored tests · Strong

80.67

Highest traits

risk handling8.27

credibility8.2

message clarity8.2

buildability8.17

conversion logic8.17

Lowest traits

differentiation7.4

cta quality7.78

tone8.0

trust8.0

structure8.03

claude-opus-4.6

12 scored tests · Strong

80.33

Highest traits

tone8.5

audience fit8.4

clarity8.4

objection coverage8.33

proof placement8.27

Lowest traits

structure7.5

trust7.7

buildability7.73

differentiation7.73

credibility7.83

gpt-5.4-mini

12 scored tests · Usable

79.67

Highest traits

credibility8.5

risk handling8.5

objection coverage8.33

tone8.17

buildability8.13

Lowest traits

cta quality7.33

differentiation7.33

audience fit7.85

specificity7.86

trust7.9

Compare models

Trait profile and test-by-test matchup

Pick two models to compare average trait scores and individual test outcomes from the published calibrated results.

Model A Model B

Per-test comparison

Sorted by the biggest absolute score difference between the selected models.

Test

Benchmark

Delta

Frequent problems

Where models break

Incomplete output 16

Wrapper text 15

Unsupported invention 14

Missing required element 4

Example outputs

Winner, middle, failure

winner

kimi-k2.5 · 86

The model demonstrates unusually deep task-specific judgment by incorporating highly accurate niche details (FBAR, FATCA, data residency, repatriation) into a sophisticated, highly buildable conversion flow.

 **1. The Expat Reality Check (Hero)**
* **Title:** "Your Money Lives in Three Countries. Your Dashboard Should Too."
* **Purpose:** Immediate pattern interruption + identification of the core struggle; establish the "multi-jurisdictional" pain point within 3 seconds.
* **Key Copy Points:**
  * Headline acknowledges the fragmentation (accounts in London, taxes in Berlin, investments in Singapore).
  * Subhead: "One dashboard to track real-time currency exposure, cross-border tax liabilities, and account balances—without the 3 AM spreadsheet panic."
  * Primary CTA: "Create your expat money map" ( prominent, high-contrast button).
  * Risk reversal: "Free for 30 days. No credit card required to map your first country."
* **Visual/Component:** Interactive world map visualization showing animated data lines connecting 3-4 major financial hubs (e.g., NYC, London, Tokyo) converging into a single, clean mobile interface. Auto-plays on loop to demonstrate consolidation.

**2. The Hidden Tax Trap (Problem Agitation)**
* **Title:** "Are You Accidentally Tax Non-Compliant?"
* **Purpose:** Agitate the specific anxiety of expat tax complexity (FBAR, FATCA, double taxation treaties) to heighten urgency.
* **Key Copy Points:**
  * "You track expenses in euros, earn in dollars, and owe taxes in both."
  * "Missing one FBAR filing can cost $10,000. We auto-flag reporting thresholds across your jurisdictions."
  * "Currency swings just ate your rent budget—again."
* **Visual/Component:** Split-screen "Anxiety Dashboard" vs. "Clarity Dashboard." Left side: chaotic spreadsheet screenshots, red alert icons, and overlapping calendar reminders. Right side: Clean, green-status tax checklist with countdown timers to filing deadlines per country.

**3. The Money Map Method (Solution)**
* **Titl

failure

gemini-3.5-flash-high · 32

The model's response cuts off abruptly after the second section, failing to provide a complete landing page structure, missing the final CTA, and failing to fully address the constraints.

Here is a conversion-aware landing page structure tailored for B2B SaaS growth and product teams.

---

### Section 1: The Hero
* **Title:** Stop Guessing Why Your Free Trials Don't Convert
* **Purpose:** Instantly grab the attention of growth and product leaders by stating the exact problem solved, the primary benefit, and a low-friction call to action.
* **Key Copy Points:**
* **Headline:** Turn trial drop-offs into paying customers.
* **Subheadline:** Map every friction point in your trial user journey automatically. See exactly where users get stuck, fix the leaks, and boost your trial-to-paid conversion rate by an average of 18%.
* **Primary CTA Button:** Request a Demo
* **Micro-copy (under CTA):** *See your trial bottlenecks in a 15-minute personalized walkthrough.*
* **Suggested Visual/Component:** A high-fidelity, interactive mock-up of the dashboard. It should show a user funnel with a highlighted drop-off point (e.g., a red warning circle over the step: *"Invited team member - 72% drop-off"*).

---

### Section 2: Integration & Speed Bar
* **Title:** Plugs Directly Into Your Existing Growth Stack
* **Purpose:** Build immediate technical credibility and handle the "how much dev work is this?" objection right out of the gate.
* **Key Copy Points:**
* No new tracking scripts required. We sync with the tools you already use.
* Setup is usually complete in under 2 weeks.
* One-click integrations with: **Segment**, **HubSpot**, **Stripe**, and **PostHog**.
* **Suggested Visual/Component:** A clean, horizontal logo bar featuring Segment, HubSpot, Stripe, and PostHog, with a subtle "Connected" green check

mid performer

deepseek-v3.2 · 81

The model perfectly follows all constraints, delivering clear, concise, and highly relevant copy that directly addresses the target audience's objections with the provided proof points.

**Headline:**  
Your Firm's Knowledge. Your Private AI. Zero Hallucinations.

**Subheadline:**  
A secure internal AI assistant that answers questions directly from your approved documents—and cites every source. Deployed inside your firm's private cloud, never on public servers.

**Benefit Bullets:**
- **Confidentiality Assured:** Operates entirely within your firm’s private cloud—no external data exposure.
- **Answers You Can Trust:** Every response is grounded in your internal documents and includes verifiable source citations.
- **Instant Expertise:** Quickly find precise information across memos, case files, and policies without manual searches.

**Primary CTA:**  
Book a Security Review

**Secondary CTA:**  
See a Confidential Demo

Methodology

How scores are produced

Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.

LLM judge

A stable judge model scores each trait from 0 to 10 using benchmark-specific traits.

Heuristics

Deterministic checks catch length, banned phrases, required sections, format validity, and safety flags.

Calibrated ceiling

Rubrics are intentionally strict, leaving room above today's best scores for future model improvements.