Score vs. cost
Average task cost vs overall score
Each dot is one model. X axis is average cost per benchmark task, including model and judge cost; Y axis is average calibrated score.
business benchmark collection
Benchmarks for testing whether models can turn a product brief into a clear, persuasive, conversion-aware landing page.
Which models can create landing pages that are clear, specific, persuasive, and buildable?
At a glance
Top model
claude-opus-4.8-high
83.08
Lowest cost / eval
qwen3.5-plus-02-15
$0.0113
Median rank score
79.67
Last refresh
2026-06-02
Score vs. cost
Each dot is one model. X axis is average cost per benchmark task, including model and judge cost; Y axis is average calibrated score.
Overall ranking
Higher is better. Scores come from completed judged runs.
Benchmark heatmap
Cells are colored by rank within each benchmark: the top ten are split across greens, anything below the top ten is red.
| Rank | Model | Overall | Hero Clarity Test | Five-Second Clarity Test | Objection Handling Test | Landing Page Structure Test |
|---|---|---|---|---|---|---|
| 1 |
12 scored tests |
83.1 | 82.3 | 83.0 | 85.3 | 81.7 |
| 2 |
12 scored tests |
82.8 | 81.0 | 79.7 | 85.0 | 85.3 |
| 3 |
12 scored tests |
82.5 | 85.0 | 83.7 | 80.0 | 81.3 |
| 4 |
12 scored tests |
82.4 | 82.7 | 79.0 | 84.7 | 83.3 |
| 5 |
12 scored tests |
82.2 | 80.7 | 82.0 | 85.0 | 81.0 |
| 6 |
12 scored tests |
81.4 | 80.3 | 79.0 | 85.3 | 81.0 |
| 7 |
12 scored tests |
81.3 | 78.7 | 80.0 | 85.3 | 81.3 |
| 8 |
12 scored tests |
81.2 | 80.7 | 79.0 | 84.0 | 81.0 |
| 9 |
12 scored tests |
80.8 | 77.3 | 82.0 | 82.0 | 81.7 |
| 10 |
12 scored tests |
80.7 | 78.7 | 81.3 | 81.3 | 81.3 |
| 11 |
12 scored tests |
80.3 | 81.7 | 80.7 | 81.0 | 78.0 |
| 12 |
12 scored tests |
79.7 | 77.7 | 77.3 | 83.3 | 80.3 |
| 13 |
12 scored tests |
79.7 | 75.7 | 79.3 | 82.7 | 81.0 |
| 14 |
12 scored tests |
79.3 | 77.3 | 80.0 | 83.7 | 76.3 |
| 15 |
12 scored tests |
78.7 | 82.3 | 77.3 | 78.0 | 77.0 |
| 16 |
12 scored tests |
78.2 | 75.3 | 78.7 | 79.7 | 79.0 |
| 17 |
12 scored tests |
77.5 | 77.7 | 79.3 | 72.3 | 80.7 |
| 18 |
12 scored tests |
76.8 | 77.0 | 79.0 | 79.3 | 71.7 |
| 19 |
12 scored tests |
76.1 | 77.7 | 80.3 | 84.3 | 62.0 |
| 20 |
12 scored tests |
75.6 | 75.0 | 79.0 | 81.3 | 67.0 |
| 21 |
12 scored tests |
75.1 | 72.7 | 78.3 | 68.7 | 80.7 |
| 22 |
12 scored tests |
74.5 | 78.7 | 82.0 | 85.0 | 52.3 |
| 23 |
12 scored tests |
67.8 | 72.7 | 76.7 | 69.3 | 52.7 |
Full leaderboard
| Model | Score | Tests | Avg cost / task | Avg seconds / task | Frequent problems |
|---|---|---|---|---|---|
|
|
83.08 Strong | 12/12 | $0.0349 | 24.1s | - |
|
|
82.75 Strong | 12/12 | $0.0130 | 64.0s | Unsupported invention |
|
|
82.5 Strong | 12/12 | $0.0365 | 37.0s | Incomplete output Unsupported invention |
|
|
82.42 Strong | 12/12 | $0.0361 | 24.6s | Wrapper text |
|
|
82.17 Strong | 12/12 | $0.0123 | 57.6s | - |
|
|
81.42 Strong | 12/12 | $0.0394 | 31.5s | Incomplete output Unsupported invention Wrapper text |
|
|
81.33 Strong | 12/12 | $0.0340 | 22.7s | Wrapper text |
|
|
81.17 Strong | 12/12 | $0.0168 | 18.1s | Unsupported invention |
|
|
80.75 Strong | 12/12 | $0.0113 | 56.3s | - |
|
|
80.67 Strong | 12/12 | $0.1658 | 49.3s | - |
|
|
80.33 Strong | 12/12 | $0.0361 | 36.7s | Incomplete output Unsupported invention |
|
|
79.67 Usable | 12/12 | $0.0146 | 14.0s | - |
|
|
79.67 Usable | 12/12 | $0.0309 | 22.9s | - |
|
|
79.33 Usable | 12/12 | $0.0120 | 64.2s | Incomplete output |
|
|
78.67 Usable | 12/12 | $0.0220 | 21.3s | Wrapper text |
|
|
78.17 Usable | 12/12 | $0.0122 | 32.9s | - |
|
|
77.5 Usable | 12/12 | $0.0129 | 14.4s | Wrapper text Missing required element |
|
|
76.75 Usable | 12/12 | $0.0287 | 32.7s | Incomplete output Unsupported invention Wrapper text |
|
|
76.08 Usable | 12/12 | $0.0286 | 20.2s | Incomplete output Missing required element |
|
|
75.58 Usable | 12/12 | $0.0124 | 48.5s | Wrapper text Incomplete output Missing required element |
|
|
75.08 Usable | 12/12 | $0.0138 | 17.3s | Unsupported invention Wrapper text |
|
|
74.5 Usable | 12/12 | $0.0336 | 27.2s | Incomplete output |
|
|
67.83 Needs editing | 12/12 | $0.0124 | 36.9s | Unsupported invention Incomplete output Missing required element |
Test cases
Each row is one prompt, with score distributions, top and low performers, and the most frequent problems judges flagged.
| Test | Benchmark | Avg | Max | Min | Top model | Lowest model | Frequent problems |
|---|---|---|---|---|---|---|---|
|
AI bookkeeping tool for freelancers landing_hero_001 |
Hero Clarity Test | 76.8 | 86.0 | 70.0 | claude-opus-4.6-high · 86 | minimax-m2.7 · 70 | Wrapper text ×2 |
|
Compliance tracker for clinics landing_hero_002 |
Hero Clarity Test | 78.9 | 84.0 | 73.0 | kimi-k2.5 · 84 | qwen3.5-plus-02-15 · 73 | Wrapper text ×2 |
|
Hiring platform for hospitality teams landing_hero_003 |
Hero Clarity Test | 80.2 | 86.0 | 74.0 | claude-opus-4.7 · 86 | claude-opus-4.8 · 74 | Wrapper text ×3 Unsupported invention ×1 |
|
Internal knowledge bot for law firms landing_5sec_001 |
Five-Second Clarity Test | 81.4 | 86.0 | 74.0 | claude-opus-4.6-high · 86 | claude-opus-4.8 · 74 | Wrapper text ×1 |
|
AI survey analysis tool landing_5sec_002 |
Five-Second Clarity Test | 81.0 | 85.0 | 74.0 | claude-opus-4.7 · 85 | claude-opus-4.8-low · 74 | Wrapper text ×1 |
|
Warehouse shift planning software landing_5sec_003 |
Five-Second Clarity Test | 77.2 | 82.0 | 68.0 | gemini-3.1-pro-preview · 82 | claude-opus-4.7 · 68 | Unsupported invention ×7 |
|
AI customer support agent objections landing_objection_001 |
Objection Handling Test | 80.1 | 86.0 | 65.0 | claude-opus-4.8 · 86 | minimax-m2.7 · 65 | Unsupported invention ×2 Missing required element ×1 |
|
Payroll software migration landing_objection_002 |
Objection Handling Test | 79.9 | 85.0 | 60.0 | gpt-5.4-mini · 85 | minimax-m2.7 · 60 | Unsupported invention ×4 Wrapper text ×2 |
|
Automated outbound sales tool landing_objection_003 |
Objection Handling Test | 83.5 | 86.0 | 76.0 | claude-opus-4.7 · 86 | gpt-5.4-nano · 76 | - |
|
B2B analytics tool page structure landing_structure_001 |
Landing Page Structure Test | 73.2 | 85.0 | 32.0 | gpt-5.5 · 85 | gemini-3.5-flash-high · 32 | Incomplete output ×7 Missing required element ×3 Wrapper text ×1 |
|
AI writing platform for law firms landing_structure_002 |
Landing Page Structure Test | 82.3 | 86.0 | 70.0 | claude-opus-4.7 · 86 | claude-sonnet-4.6 · 70 | Incomplete output ×3 |
|
Personal finance app for expats landing_structure_003 |
Landing Page Structure Test | 73.7 | 86.0 | 40.0 | kimi-k2.5 · 86 | minimax-m2.7 · 40 | Incomplete output ×6 Wrapper text ×3 |
Model profiles
12 scored tests · Strong
83.08
Highest traits
Lowest traits
12 scored tests · Strong
82.75
Highest traits
Lowest traits
12 scored tests · Strong
82.5
Highest traits
Lowest traits
12 scored tests · Strong
82.42
Highest traits
Lowest traits
12 scored tests · Strong
82.17
Highest traits
Lowest traits
12 scored tests · Strong
81.42
Highest traits
Lowest traits
12 scored tests · Strong
81.33
Highest traits
Lowest traits
12 scored tests · Strong
81.17
Highest traits
Lowest traits
12 scored tests · Strong
80.75
Highest traits
Lowest traits
12 scored tests · Strong
80.67
Highest traits
Lowest traits
12 scored tests · Strong
80.33
Highest traits
Lowest traits
12 scored tests · Usable
79.67
Highest traits
Lowest traits
Compare models
Pick two models to compare average trait scores and individual test outcomes from the published calibrated results.
Sorted by the biggest absolute score difference between the selected models.
| Test | Benchmark | Delta | ||
|---|---|---|---|---|
Frequent problems
Example outputs
winner
The model demonstrates unusually deep task-specific judgment by incorporating highly accurate niche details (FBAR, FATCA, data residency, repatriation) into a sophisticated, highly buildable conversion flow.
**1. The Expat Reality Check (Hero)** * **Title:** "Your Money Lives in Three Countries. Your Dashboard Should Too." * **Purpose:** Immediate pattern interruption + identification of the core struggle; establish the "multi-jurisdictional" pain point within 3 seconds. * **Key Copy Points:** * Headline acknowledges the fragmentation (accounts in London, taxes in Berlin, investments in Singapore). * Subhead: "One dashboard to track real-time currency exposure, cross-border tax liabilities, and account balances—without the 3 AM spreadsheet panic." * Primary CTA: "Create your expat money map" ( prominent, high-contrast button). * Risk reversal: "Free for 30 days. No credit card required to map your first country." * **Visual/Component:** Interactive world map visualization showing animated data lines connecting 3-4 major financial hubs (e.g., NYC, London, Tokyo) converging into a single, clean mobile interface. Auto-plays on loop to demonstrate consolidation. **2. The Hidden Tax Trap (Problem Agitation)** * **Title:** "Are You Accidentally Tax Non-Compliant?" * **Purpose:** Agitate the specific anxiety of expat tax complexity (FBAR, FATCA, double taxation treaties) to heighten urgency. * **Key Copy Points:** * "You track expenses in euros, earn in dollars, and owe taxes in both." * "Missing one FBAR filing can cost $10,000. We auto-flag reporting thresholds across your jurisdictions." * "Currency swings just ate your rent budget—again." * **Visual/Component:** Split-screen "Anxiety Dashboard" vs. "Clarity Dashboard." Left side: chaotic spreadsheet screenshots, red alert icons, and overlapping calendar reminders. Right side: Clean, green-status tax checklist with countdown timers to filing deadlines per country. **3. The Money Map Method (Solution)** * **Titl
failure
The model's response cuts off abruptly after the second section, failing to provide a complete landing page structure, missing the final CTA, and failing to fully address the constraints.
Here is a conversion-aware landing page structure tailored for B2B SaaS growth and product teams. --- ### Section 1: The Hero * **Title:** Stop Guessing Why Your Free Trials Don't Convert * **Purpose:** Instantly grab the attention of growth and product leaders by stating the exact problem solved, the primary benefit, and a low-friction call to action. * **Key Copy Points:** * **Headline:** Turn trial drop-offs into paying customers. * **Subheadline:** Map every friction point in your trial user journey automatically. See exactly where users get stuck, fix the leaks, and boost your trial-to-paid conversion rate by an average of 18%. * **Primary CTA Button:** Request a Demo * **Micro-copy (under CTA):** *See your trial bottlenecks in a 15-minute personalized walkthrough.* * **Suggested Visual/Component:** A high-fidelity, interactive mock-up of the dashboard. It should show a user funnel with a highlighted drop-off point (e.g., a red warning circle over the step: *"Invited team member - 72% drop-off"*). --- ### Section 2: Integration & Speed Bar * **Title:** Plugs Directly Into Your Existing Growth Stack * **Purpose:** Build immediate technical credibility and handle the "how much dev work is this?" objection right out of the gate. * **Key Copy Points:** * No new tracking scripts required. We sync with the tools you already use. * Setup is usually complete in under 2 weeks. * One-click integrations with: **Segment**, **HubSpot**, **Stripe**, and **PostHog**. * **Suggested Visual/Component:** A clean, horizontal logo bar featuring Segment, HubSpot, Stripe, and PostHog, with a subtle "Connected" green check
mid performer
The model perfectly follows all constraints, delivering clear, concise, and highly relevant copy that directly addresses the target audience's objections with the provided proof points.
**Headline:** Your Firm's Knowledge. Your Private AI. Zero Hallucinations. **Subheadline:** A secure internal AI assistant that answers questions directly from your approved documents—and cites every source. Deployed inside your firm's private cloud, never on public servers. **Benefit Bullets:** - **Confidentiality Assured:** Operates entirely within your firm’s private cloud—no external data exposure. - **Answers You Can Trust:** Every response is grounded in your internal documents and includes verifiable source citations. - **Instant Expertise:** Quickly find precise information across memos, case files, and policies without manual searches. **Primary CTA:** Book a Security Review **Secondary CTA:** See a Confidential Demo
Methodology
Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.
LLM judge
A stable judge model scores each trait from 0 to 10 using benchmark-specific traits.
Heuristics
Deterministic checks catch length, banned phrases, required sections, format validity, and safety flags.
Calibrated ceiling
Rubrics are intentionally strict, leaving room above today's best scores for future model improvements.