Confirm Action

Are you sure you want to proceed?

Back to evals

business benchmark collection

Landing Pages

Benchmarks for testing whether models can turn a product brief into a clear, persuasive, conversion-aware landing page.

Which models can create landing pages that are clear, specific, persuasive, and buildable?

4 benchmarks 12 tests 276 completed runs 20 base models

At a glance

Top model

claude-opus-4.8-high

83.08

Lowest cost / eval

qwen3.5-plus-02-15

$0.0113

Median rank score

79.67

Last refresh

2026-06-02

Score vs. cost

Average task cost vs overall score

Each dot is one model. X axis is average cost per benchmark task, including model and judge cost; Y axis is average calibrated score.

Overall ranking

Top models by average score

Higher is better. Scores come from completed judged runs.

Benchmark heatmap

Model performance by benchmark

Cells are colored by rank within each benchmark: the top ten are split across greens, anything below the top ten is red.

Below top 10 #1
Rank Model Overall Hero Clarity Test Five-Second Clarity Test Objection Handling Test Landing Page Structure Test
1
claude-opus-4.8-high

12 scored tests

83.1 82.3 83.0 85.3 81.7
2
kimi-k2.5

12 scored tests

82.8 81.0 79.7 85.0 85.3
3
claude-opus-4.6-high

12 scored tests

82.5 85.0 83.7 80.0 81.3
4
claude-opus-4.8-low

12 scored tests

82.4 82.7 79.0 84.7 83.3
5
qwen3.7-max

12 scored tests

82.2 80.7 82.0 85.0 81.0
6
claude-opus-4.7

12 scored tests

81.4 80.3 79.0 85.3 81.0
7
claude-opus-4.8

12 scored tests

81.3 78.7 80.0 85.3 81.3
8
gemini-3-flash-preview

12 scored tests

81.2 80.7 79.0 84.0 81.0
9
qwen3.5-plus-02-15

12 scored tests

80.8 77.3 82.0 82.0 81.7
10
gpt-5.5-pro

12 scored tests

80.7 78.7 81.3 81.3 81.3
11
claude-opus-4.6

12 scored tests

80.3 81.7 80.7 81.0 78.0
12
gpt-5.4-mini

12 scored tests

79.7 77.7 77.3 83.3 80.3
13
gpt-5.5

12 scored tests

79.7 75.7 79.3 82.7 81.0
14
glm-5.1

12 scored tests

79.3 77.3 80.0 83.7 76.3
15
gpt-5.4

12 scored tests

78.7 82.3 77.3 78.0 77.0
16
deepseek-v3.2

12 scored tests

78.2 75.3 78.7 79.7 79.0
17
gpt-5.4-nano

12 scored tests

77.5 77.7 79.3 72.3 80.7
18
claude-sonnet-4.6

12 scored tests

76.8 77.0 79.0 79.3 71.7
19
gemini-3.5-flash-high

12 scored tests

76.1 77.7 80.3 84.3 62.0
20
glm-5

12 scored tests

75.6 75.0 79.0 81.3 67.0
21
grok-4.20-beta

12 scored tests

75.1 72.7 78.3 68.7 80.7
22
gemini-3.1-pro-preview

12 scored tests

74.5 78.7 82.0 85.0 52.3
23
minimax-m2.7

12 scored tests

67.8 72.7 76.7 69.3 52.7

Full leaderboard

Quality, cost, and speed

Model Score Tests Avg cost / task Avg seconds / task Frequent problems
claude-opus-4.8-high
83.08 Strong 12/12 $0.0349 24.1s -
kimi-k2.5
82.75 Strong 12/12 $0.0130 64.0s Unsupported invention
claude-opus-4.6-high
82.5 Strong 12/12 $0.0365 37.0s Incomplete output Unsupported invention
claude-opus-4.8-low
82.42 Strong 12/12 $0.0361 24.6s Wrapper text
qwen3.7-max
82.17 Strong 12/12 $0.0123 57.6s -
claude-opus-4.7
81.42 Strong 12/12 $0.0394 31.5s Incomplete output Unsupported invention Wrapper text
claude-opus-4.8
81.33 Strong 12/12 $0.0340 22.7s Wrapper text
gemini-3-flash-preview
81.17 Strong 12/12 $0.0168 18.1s Unsupported invention
qwen3.5-plus-02-15
80.75 Strong 12/12 $0.0113 56.3s -
gpt-5.5-pro
80.67 Strong 12/12 $0.1658 49.3s -
claude-opus-4.6
80.33 Strong 12/12 $0.0361 36.7s Incomplete output Unsupported invention
gpt-5.4-mini
79.67 Usable 12/12 $0.0146 14.0s -
gpt-5.5
79.67 Usable 12/12 $0.0309 22.9s -
glm-5.1
79.33 Usable 12/12 $0.0120 64.2s Incomplete output
gpt-5.4
78.67 Usable 12/12 $0.0220 21.3s Wrapper text
deepseek-v3.2
78.17 Usable 12/12 $0.0122 32.9s -
gpt-5.4-nano
77.5 Usable 12/12 $0.0129 14.4s Wrapper text Missing required element
claude-sonnet-4.6
76.75 Usable 12/12 $0.0287 32.7s Incomplete output Unsupported invention Wrapper text
gemini-3.5-flash-high
76.08 Usable 12/12 $0.0286 20.2s Incomplete output Missing required element
glm-5
75.58 Usable 12/12 $0.0124 48.5s Wrapper text Incomplete output Missing required element
grok-4.20-beta
75.08 Usable 12/12 $0.0138 17.3s Unsupported invention Wrapper text
gemini-3.1-pro-preview
74.5 Usable 12/12 $0.0336 27.2s Incomplete output
minimax-m2.7
67.83 Needs editing 12/12 $0.0124 36.9s Unsupported invention Incomplete output Missing required element

Test cases

Where the scores come from

Each row is one prompt, with score distributions, top and low performers, and the most frequent problems judges flagged.

Test Benchmark Avg Max Min Top model Lowest model Frequent problems

AI bookkeeping tool for freelancers

landing_hero_001

Hero Clarity Test 76.8 86.0 70.0 claude-opus-4.6-high · 86 minimax-m2.7 · 70 Wrapper text ×2

Compliance tracker for clinics

landing_hero_002

Hero Clarity Test 78.9 84.0 73.0 kimi-k2.5 · 84 qwen3.5-plus-02-15 · 73 Wrapper text ×2

Hiring platform for hospitality teams

landing_hero_003

Hero Clarity Test 80.2 86.0 74.0 claude-opus-4.7 · 86 claude-opus-4.8 · 74 Wrapper text ×3 Unsupported invention ×1

Internal knowledge bot for law firms

landing_5sec_001

Five-Second Clarity Test 81.4 86.0 74.0 claude-opus-4.6-high · 86 claude-opus-4.8 · 74 Wrapper text ×1

AI survey analysis tool

landing_5sec_002

Five-Second Clarity Test 81.0 85.0 74.0 claude-opus-4.7 · 85 claude-opus-4.8-low · 74 Wrapper text ×1

Warehouse shift planning software

landing_5sec_003

Five-Second Clarity Test 77.2 82.0 68.0 gemini-3.1-pro-preview · 82 claude-opus-4.7 · 68 Unsupported invention ×7

AI customer support agent objections

landing_objection_001

Objection Handling Test 80.1 86.0 65.0 claude-opus-4.8 · 86 minimax-m2.7 · 65 Unsupported invention ×2 Missing required element ×1

Payroll software migration

landing_objection_002

Objection Handling Test 79.9 85.0 60.0 gpt-5.4-mini · 85 minimax-m2.7 · 60 Unsupported invention ×4 Wrapper text ×2

Automated outbound sales tool

landing_objection_003

Objection Handling Test 83.5 86.0 76.0 claude-opus-4.7 · 86 gpt-5.4-nano · 76 -

B2B analytics tool page structure

landing_structure_001

Landing Page Structure Test 73.2 85.0 32.0 gpt-5.5 · 85 gemini-3.5-flash-high · 32 Incomplete output ×7 Missing required element ×3 Wrapper text ×1

AI writing platform for law firms

landing_structure_002

Landing Page Structure Test 82.3 86.0 70.0 claude-opus-4.7 · 86 claude-sonnet-4.6 · 70 Incomplete output ×3

Personal finance app for expats

landing_structure_003

Landing Page Structure Test 73.7 86.0 40.0 kimi-k2.5 · 86 minimax-m2.7 · 40 Incomplete output ×6 Wrapper text ×3

Model profiles

Strengths, weaknesses, and tradeoffs

claude-opus-4.8-high

12 scored tests · Strong

83.08

Highest traits

credibility8.6
risk handling8.6
tone8.5
objection coverage8.5
audience fit8.38

Lowest traits

differentiation7.9
proof placement8.0
cta quality8.03
buildability8.13
structure8.17

kimi-k2.5

12 scored tests · Strong

82.75

Highest traits

risk handling8.53
objection coverage8.5
structure8.47
buildability8.47
proof placement8.47

Lowest traits

trust7.73
cta quality7.82
differentiation7.97
message clarity8.2
clarity8.23

claude-opus-4.6-high

12 scored tests · Strong

82.5

Highest traits

audience fit8.67
clarity8.5
differentiation8.43
conversion logic8.4
specificity8.39

Lowest traits

credibility7.7
objection coverage8.1
cta quality8.12
structure8.17
buildability8.17

claude-opus-4.8-low

12 scored tests · Strong

82.42

Highest traits

credibility8.5
risk handling8.5
objection coverage8.5
structure8.43
audience fit8.43

Lowest traits

cta quality8.0
trust8.07
differentiation8.07
message clarity8.23
buildability8.3

qwen3.7-max

12 scored tests · Strong

82.17

Highest traits

credibility8.5
risk handling8.5
objection coverage8.47
tone8.37
specificity8.28

Lowest traits

differentiation7.5
cta quality7.85
structure8.03
conversion logic8.07
buildability8.1

claude-opus-4.7

12 scored tests · Strong

81.42

Highest traits

credibility8.6
risk handling8.6
tone8.5
objection coverage8.5
audience fit8.47

Lowest traits

trust6.93
cta quality7.95
buildability8.0
differentiation8.03
structure8.1

claude-opus-4.8

12 scored tests · Strong

81.33

Highest traits

credibility8.7
risk handling8.7
tone8.6
objection coverage8.5
audience fit8.37

Lowest traits

differentiation7.83
cta quality7.9
structure8.03
proof placement8.07
conversion logic8.13

gemini-3-flash-preview

12 scored tests · Strong

81.17

Highest traits

credibility8.5
risk handling8.47
objection coverage8.33
audience fit8.28
tone8.23

Lowest traits

differentiation7.57
trust7.67
cta quality7.83
structure8.0
buildability8.07

qwen3.5-plus-02-15

12 scored tests · Strong

80.75

Highest traits

credibility8.3
risk handling8.3
objection coverage8.23
message clarity8.2
buildability8.17

Lowest traits

differentiation6.73
cta quality7.83
specificity8.05
tone8.1
audience fit8.12

gpt-5.5-pro

12 scored tests · Strong

80.67

Highest traits

risk handling8.27
credibility8.2
message clarity8.2
buildability8.17
conversion logic8.17

Lowest traits

differentiation7.4
cta quality7.78
tone8.0
trust8.0
structure8.03

claude-opus-4.6

12 scored tests · Strong

80.33

Highest traits

tone8.5
audience fit8.4
clarity8.4
objection coverage8.33
proof placement8.27

Lowest traits

structure7.5
trust7.7
buildability7.73
differentiation7.73
credibility7.83

gpt-5.4-mini

12 scored tests · Usable

79.67

Highest traits

credibility8.5
risk handling8.5
objection coverage8.33
tone8.17
buildability8.13

Lowest traits

cta quality7.33
differentiation7.33
audience fit7.85
specificity7.86
trust7.9

Compare models

Trait profile and test-by-test matchup

Pick two models to compare average trait scores and individual test outcomes from the published calibrated results.

Per-test comparison

Sorted by the biggest absolute score difference between the selected models.

Test Benchmark Delta

Frequent problems

Where models break

Incomplete output 16
Wrapper text 15
Unsupported invention 14
Missing required element 4

Example outputs

Winner, middle, failure

winner

kimi-k2.5 · 86

The model demonstrates unusually deep task-specific judgment by incorporating highly accurate niche details (FBAR, FATCA, data residency, repatriation) into a sophisticated, highly buildable conversion flow.

 **1. The Expat Reality Check (Hero)**
* **Title:** "Your Money Lives in Three Countries. Your Dashboard Should Too."
* **Purpose:** Immediate pattern interruption + identification of the core struggle; establish the "multi-jurisdictional" pain point within 3 seconds.
* **Key Copy Points:**
  * Headline acknowledges the fragmentation (accounts in London, taxes in Berlin, investments in Singapore).
  * Subhead: "One dashboard to track real-time currency exposure, cross-border tax liabilities, and account balances—without the 3 AM spreadsheet panic."
  * Primary CTA: "Create your expat money map" ( prominent, high-contrast button).
  * Risk reversal: "Free for 30 days. No credit card required to map your first country."
* **Visual/Component:** Interactive world map visualization showing animated data lines connecting 3-4 major financial hubs (e.g., NYC, London, Tokyo) converging into a single, clean mobile interface. Auto-plays on loop to demonstrate consolidation.

**2. The Hidden Tax Trap (Problem Agitation)**
* **Title:** "Are You Accidentally Tax Non-Compliant?"
* **Purpose:** Agitate the specific anxiety of expat tax complexity (FBAR, FATCA, double taxation treaties) to heighten urgency.
* **Key Copy Points:**
  * "You track expenses in euros, earn in dollars, and owe taxes in both."
  * "Missing one FBAR filing can cost $10,000. We auto-flag reporting thresholds across your jurisdictions."
  * "Currency swings just ate your rent budget—again."
* **Visual/Component:** Split-screen "Anxiety Dashboard" vs. "Clarity Dashboard." Left side: chaotic spreadsheet screenshots, red alert icons, and overlapping calendar reminders. Right side: Clean, green-status tax checklist with countdown timers to filing deadlines per country.

**3. The Money Map Method (Solution)**
* **Titl

failure

gemini-3.5-flash-high · 32

The model's response cuts off abruptly after the second section, failing to provide a complete landing page structure, missing the final CTA, and failing to fully address the constraints.

Here is a conversion-aware landing page structure tailored for B2B SaaS growth and product teams. 

---

### Section 1: The Hero
* **Title:** Stop Guessing Why Your Free Trials Don't Convert
* **Purpose:** Instantly grab the attention of growth and product leaders by stating the exact problem solved, the primary benefit, and a low-friction call to action.
* **Key Copy Points:**
  * **Headline:** Turn trial drop-offs into paying customers.
  * **Subheadline:** Map every friction point in your trial user journey automatically. See exactly where users get stuck, fix the leaks, and boost your trial-to-paid conversion rate by an average of 18%.
  * **Primary CTA Button:** Request a Demo
  * **Micro-copy (under CTA):** *See your trial bottlenecks in a 15-minute personalized walkthrough.*
* **Suggested Visual/Component:** A high-fidelity, interactive mock-up of the dashboard. It should show a user funnel with a highlighted drop-off point (e.g., a red warning circle over the step: *"Invited team member - 72% drop-off"*).

---

### Section 2: Integration & Speed Bar
* **Title:** Plugs Directly Into Your Existing Growth Stack
* **Purpose:** Build immediate technical credibility and handle the "how much dev work is this?" objection right out of the gate.
* **Key Copy Points:**
  * No new tracking scripts required. We sync with the tools you already use.
  * Setup is usually complete in under 2 weeks.
  * One-click integrations with: **Segment**, **HubSpot**, **Stripe**, and **PostHog**.
* **Suggested Visual/Component:** A clean, horizontal logo bar featuring Segment, HubSpot, Stripe, and PostHog, with a subtle "Connected" green check

mid performer

deepseek-v3.2 · 81

The model perfectly follows all constraints, delivering clear, concise, and highly relevant copy that directly addresses the target audience's objections with the provided proof points.

**Headline:**  
Your Firm's Knowledge. Your Private AI. Zero Hallucinations.

**Subheadline:**  
A secure internal AI assistant that answers questions directly from your approved documents—and cites every source. Deployed inside your firm's private cloud, never on public servers.

**Benefit Bullets:**
- **Confidentiality Assured:** Operates entirely within your firm’s private cloud—no external data exposure.
- **Answers You Can Trust:** Every response is grounded in your internal documents and includes verifiable source citations.
- **Instant Expertise:** Quickly find precise information across memos, case files, and policies without manual searches.

**Primary CTA:**  
Book a Security Review

**Secondary CTA:**  
See a Confidential Demo

Methodology

How scores are produced

Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.

LLM judge

A stable judge model scores each trait from 0 to 10 using benchmark-specific traits.

Heuristics

Deterministic checks catch length, banned phrases, required sections, format validity, and safety flags.

Calibrated ceiling

Rubrics are intentionally strict, leaving room above today's best scores for future model improvements.