Score vs. cost
Average task cost vs overall score
Each dot is one model. X axis is average cost per benchmark task, including model and judge cost; Y axis is average calibrated score.
business benchmark collection
Benchmarks for testing whether models can improve startup pitches, critique weak claims, and anticipate investor concerns.
Which models can make startup pitches clearer, more credible, and harder to pick apart?
At a glance
Top model
claude-opus-4.7
84.5
Lowest cost / eval
glm-5
$0.0120
Median rank score
78.75
Last refresh
2026-06-02
Score vs. cost
Each dot is one model. X axis is average cost per benchmark task, including model and judge cost; Y axis is average calibrated score.
Overall ranking
Higher is better. Scores come from completed judged runs.
Benchmark heatmap
Cells are colored by rank within each benchmark: the top ten are split across greens, anything below the top ten is red.
| Rank | Model | Overall | Founder Reality Check | Investor Question Test | Deck Doctor | Market Sizing Reality Check |
|---|---|---|---|---|---|---|
| 1 |
12 scored tests |
84.5 | 84.7 | 84.0 | 85.0 | 84.3 |
| 2 |
12 scored tests |
83.1 | 83.0 | 83.7 | 82.0 | 83.7 |
| 3 |
12 scored tests |
82.8 | 83.3 | 83.3 | 82.3 | 82.3 |
| 4 |
12 scored tests |
82.7 | 82.7 | 82.3 | 82.3 | 83.3 |
| 5 |
12 scored tests |
82.2 | 82.7 | 81.0 | 82.3 | 83.0 |
| 6 |
12 scored tests |
82.1 | 83.7 | 79.7 | 82.7 | 82.3 |
| 7 |
12 scored tests |
79.8 | 81.0 | 78.0 | 81.3 | 79.0 |
| 8 |
12 scored tests |
79.8 | 82.0 | 81.0 | 74.3 | 81.7 |
| 9 |
12 scored tests |
79.7 | 81.7 | 77.0 | 76.0 | 84.0 |
| 10 |
12 scored tests |
79.3 | 83.3 | 72.3 | 76.7 | 85.0 |
| 11 |
12 scored tests |
78.8 | 83.7 | 68.7 | 82.3 | 80.7 |
| 12 |
12 scored tests |
78.8 | 80.0 | 76.0 | 80.3 | 78.7 |
| 13 |
12 scored tests |
78.5 | 83.0 | 72.7 | 75.0 | 83.3 |
| 14 |
12 scored tests |
78.4 | 81.7 | 77.7 | 69.3 | 85.0 |
| 15 |
12 scored tests |
78.1 | 80.7 | 81.7 | 70.7 | 79.3 |
| 16 |
12 scored tests |
77.5 | 64.3 | 82.0 | 81.7 | 82.0 |
| 17 |
12 scored tests |
76.2 | 81.0 | 73.3 | 75.0 | 75.3 |
| 18 |
12 scored tests |
76.0 | 80.7 | 68.3 | 70.7 | 84.3 |
| 19 |
12 scored tests |
75.6 | 82.3 | 66.3 | 77.0 | 76.7 |
| 20 |
12 scored tests |
72.5 | 82.0 | 46.7 | 82.0 | 79.3 |
| 21 |
12 scored tests |
71.9 | 79.7 | 59.3 | 76.0 | 72.7 |
| 22 |
12 scored tests |
64.5 | 75.7 | 38.0 | 78.0 | 66.3 |
| 23 |
12 scored tests |
46.2 | 33.3 | 45.7 | 60.7 | 45.3 |
Full leaderboard
| Model | Score | Tests | Avg cost / task | Avg seconds / task | Frequent problems |
|---|---|---|---|---|---|
|
|
84.5 Strong | 12/12 | $0.0515 | 39.7s | - |
|
|
83.08 Strong | 12/12 | $0.0672 | 47.5s | - |
|
|
82.83 Strong | 12/12 | $0.0483 | 33.7s | - |
|
|
82.67 Strong | 12/12 | $0.0510 | 33.7s | - |
|
|
82.25 Strong | 12/12 | $0.0159 | 65.1s | Incomplete output |
|
|
82.08 Strong | 12/12 | $0.0490 | 33.1s | Incomplete output |
|
|
79.83 Usable | 12/12 | $0.0155 | 67.0s | Incomplete output Missing required element |
|
|
79.75 Usable | 12/12 | $0.0156 | 62.0s | Unsupported invention |
|
|
79.67 Usable | 12/12 | $0.0143 | 64.0s | Unsupported invention Incomplete output Missing required element |
|
|
79.33 Usable | 12/12 | $0.0581 | 60.7s | Incomplete output Missing required element Unsupported invention |
|
|
78.83 Usable | 12/12 | $0.0382 | 47.7s | Incomplete output Missing required element |
|
|
78.75 Usable | 12/12 | $0.0342 | 34.8s | Wrapper text |
|
|
78.5 Usable | 12/12 | $0.0188 | 20.7s | Unsupported invention Incomplete output |
|
|
78.42 Usable | 12/12 | $0.0158 | 20.0s | Unsupported invention |
|
|
78.08 Usable | 12/12 | $0.0148 | 57.6s | Unsupported invention |
|
|
77.5 Usable | 12/12 | $0.4846 | 81.1s | Incomplete output Missing required element Unsupported invention |
|
|
76.17 Usable | 12/12 | $0.0191 | 19.2s | Wrapper text |
|
|
76.0 Usable | 12/12 | $0.0602 | 60.2s | Incomplete output Unsupported invention Missing required element |
|
|
75.58 Usable | 12/12 | $0.0178 | 21.3s | Wrapper text |
|
|
72.5 Usable | 12/12 | $0.0374 | 31.1s | Incomplete output Missing required element |
|
|
71.92 Usable | 12/12 | $0.0315 | 20.8s | Incomplete output Missing required element Unsupported invention |
|
|
64.5 Needs editing | 12/12 | $0.0120 | 61.2s | Incomplete output Missing required element Wrapper text |
|
|
46.25 Weak | 12/12 | $0.0130 | 52.2s | Incomplete output Missing required element Unsupported invention Wrapper text |
Test cases
Each row is one prompt, with score distributions, top and low performers, and the most frequent problems judges flagged.
| Test | Benchmark | Avg | Max | Min | Top model | Lowest model | Frequent problems |
|---|---|---|---|---|---|---|---|
|
Enterprise self-serve fantasy pitch_reality_001 |
Founder Reality Check | 79.1 | 86.0 | 52.0 | claude-opus-4.7 · 86 | gpt-5.5-pro · 52 | Incomplete output ×3 Wrapper text ×2 Missing required element ×1 |
|
Marketplace with no supply strategy pitch_reality_002 |
Founder Reality Check | 78.9 | 85.0 | 10.0 | claude-opus-4.7 · 85 | minimax-m2.7 · 10 | Incomplete output ×3 Missing required element ×1 |
|
Consumer app with no retention data pitch_reality_003 |
Founder Reality Check | 78.8 | 84.0 | 10.0 | gemini-3-flash-preview · 84 | minimax-m2.7 · 10 | Incomplete output ×2 Missing required element ×1 Malformed output ×1 |
|
AI customer support startup pitch_questions_001 |
Investor Question Test | 74.3 | 86.0 | 41.0 | glm-5.1 · 86 | minimax-m2.7 · 41 | Incomplete output ×10 Missing required element ×6 Wrapper text ×2 |
|
Vertical SaaS for dentists pitch_questions_002 |
Investor Question Test | 70.6 | 85.0 | 34.0 | qwen3.7-max · 85 | glm-5 · 34 | Incomplete output ×9 Missing required element ×7 Wrapper text ×3 |
|
AI education app for parents pitch_questions_003 |
Investor Question Test | 71.5 | 84.0 | 18.0 | claude-opus-4.7 · 84 | glm-5 · 18 | Incomplete output ×8 Missing required element ×4 Unsupported invention ×1 |
|
Weak problem slide pitch_deck_001 |
Deck Doctor | 77.4 | 86.0 | 46.0 | claude-opus-4.7 · 86 | deepseek-v3.2 · 46 | Unsupported invention ×6 |
|
Vague solution slide pitch_deck_002 |
Deck Doctor | 79.7 | 86.0 | 60.0 | gpt-5.5-pro · 86 | minimax-m2.7 · 60 | Unsupported invention ×5 Wrapper text ×1 |
|
Overstuffed traction slide pitch_deck_003 |
Deck Doctor | 75.5 | 86.0 | 49.0 | kimi-k2.5 · 86 | minimax-m2.7 · 49 | Unsupported invention ×5 |
|
Lazy TAM slide pitch_market_001 |
Market Sizing Reality Check | 81.4 | 85.0 | 61.0 | claude-opus-4.7 · 85 | minimax-m2.7 · 61 | - |
|
Global education market TAM pitch_market_002 |
Market Sizing Reality Check | 77.0 | 86.0 | 44.0 | grok-4.20-beta · 86 | glm-5 · 44 | Incomplete output ×4 Missing required element ×1 |
|
Creator economy TAM pitch_market_003 |
Market Sizing Reality Check | 78.7 | 86.0 | 28.0 | claude-opus-4.6-high · 86 | minimax-m2.7 · 28 | Incomplete output ×5 Missing required element ×1 |
Model profiles
12 scored tests · Strong
84.5
Highest traits
Lowest traits
12 scored tests · Strong
83.08
Highest traits
Lowest traits
12 scored tests · Strong
82.83
Highest traits
Lowest traits
12 scored tests · Strong
82.67
Highest traits
Lowest traits
12 scored tests · Strong
82.25
Highest traits
Lowest traits
12 scored tests · Strong
82.08
Highest traits
Lowest traits
12 scored tests · Usable
79.83
Highest traits
Lowest traits
12 scored tests · Usable
79.75
Highest traits
Lowest traits
12 scored tests · Usable
79.67
Highest traits
Lowest traits
12 scored tests · Usable
79.33
Highest traits
Lowest traits
12 scored tests · Usable
78.83
Highest traits
Lowest traits
12 scored tests · Usable
78.75
Highest traits
Lowest traits
Compare models
Pick two models to compare average trait scores and individual test outcomes from the published calibrated results.
Sorted by the biggest absolute score difference between the selected models.
| Test | Benchmark | Delta | ||
|---|---|---|---|---|
Frequent problems
Example outputs
winner
The model provides an exceptionally realistic, expert-level critique with highly specific industry references (SignalFire, Tubular Labs, TubeBuddy), a precise breakdown of the GMV vs. SaaS fallacy, and a highly actionable revised bottom-up TAM.
# Critique: "1% of the Creator Economy" TAM Claim
## The Typical Claim
*"The creator economy is $250B+. We just need 1% = $2.5B addressable market."*
This is a textbook example of the "1% of China" fallacy dressed in creator economy clothing. Let me dismantle it layer by layer.
---
## Weaknesses
### 1. The Denominator is Wildly Inflated
- The "$250B creator economy" includes everything: OnlyFans payouts, Roblox developer earnings, Substack subscriptions, TikTok creator funds, Cameo, Patreon, merch platforms, and creator-focused venture investment itself. Your analytics tool touches almost none of these revenue streams.
- You don't serve "the creator economy." You serve **one platform** (YouTube), **one segment** (mid-tier), with **one value prop** (analytics + sponsorship workflow).
### 2. The Segment Math Doesn't Support It
- There are roughly **300,000-500,000** YouTube channels with 50k-500k subscribers globally.
- Many are inactive, non-English, hobbyist, or in niches where sponsorships are irrelevant (e.g., religious content, foreign-language repost channels, music lyric channels).
- Realistic serviceable channels might be **100,000-200,000**.
- At what ACV? A mid-tier YouTuber making $2k-$15k/month in total revenue is not paying $500/month for tooling. Realistic pricing: **$20-$80/month**, maybe $100-$200/month if sponsorship workflow genuinely closes deals.
- That gives you a **SAM of $24M-$192M/year**, not $2.5B.
### 3. "1% of X" Hides the Actual Go-to-Market Question
- The framing implies passive capture ("we just need...") when the real question is: **through what mechanism do you acquire and retain 100,000 creators who are notoriously fickle, tool-fatigued, and cost-sensitive?**
- Creators in this tier are underserved by sales teams but over-served by
failure
The model response is severely truncated, cutting off mid-sentence in the first section and completely missing the requested assumptions, realistic path, and tactful explanation.
## Quick Summary of the Plan The founder is building a **local‑services marketplace** (e.g., plumbers, cleaners, tutors, handymen) that assumes **demand will automatically pull in supply**. The idea is: 1. **Build demand** (users looking for a service) through marketing, SEO, and community outreach. 2. **Expect providers** (the “supply”) to join on their own because they see a growing pool of potential customers. 3. **Avoid exclusive contracts** – providers can be on many platforms, and the market already has many incumbents (Thumbtack, HomeAdvisor, Angie’s List, Yelp, etc.). The plan sounds elegant – “if you build it, they will come” – but local‑service marketplaces are notoriously **supply‑constrained**. Below is a structured review: **risks, assumptions to test, a realistic path forward, and a tactful explanation** of why the plan may need adjustment. --- ## 1. Risks | Risk | Why It Matters | Typical Outcome if Unmitigated | |------|----------------|--------------------------------| | **Chicken‑and‑Egg Failure** | Without enough supply, demand sees poor selection → low conversion → high churn for both sides. | Marketplace never reaches liquidity; both sides leave. | | **Supply‑Side Adoption Friction** | Providers are busy, risk‑averse, and already on other platforms. They need a clear value proposition beyond “you might get leads.” | Low provider activation, high drop‑out rate, poor supply density. | | **Incumbent Advantage** | Established players have brand trust, SEO dominance, and large provider networks. | New entrant struggles to acquire demand organically; CAC rises. | | **Trust & Quality Control** | Local services are high‑stakes (home entry, safety). Users need confidence in provider vetting. | Negative reviews, disputes, and liability claims
mid performer
The model provides highly realistic, actionable, and specific advice grounded in established marketplace theory, effectively challenging the founder's assumptions while maintaining a constructive and tactful tone.
Here is a review of the founder's plan, structured to be constructive while clearly highlighting the significant challenges associated with a "demand-first" strategy in a crowded local services market. ### 1. Tactful Explanation of the Core Issue **The "Field of Dreams" Fallacy** The current plan relies on the hypothesis that *demand creates supply*. In marketplace theory, this is often referred to as the "Field of Dreams" approach (*if you build it, they will come*). While this works for content platforms (like YouTube) or network effects where the product exists without the other side (like a messaging app), it is historically the highest-risk approach for **transactional local services marketplaces.** In local services, the product *is* the supply. A customer searching for a plumber who finds no available providers does not have a "partial product experience"; they have a broken product. If they do not find a provider within minutes, they will leave for an incumbent (Thumbtack, Angi, Google) and likely never return. **The "No Exclusivity" Challenge** Because supply is non-exclusive, providers are already listed on multiple platforms. They have no incentive to monitor a new platform with low traffic. Without a guaranteed volume of high-quality leads, their attention will remain on the incumbents where they already have established reputations and workflows. ### 2. Key Risks * **Liquidity Failure (The Empty Restaurant):** If a customer posts a request and does not receive a bid/booking within a short window (e.g., 1 hour for urgent, 24 hours for non-urgent), trust is broken permanently. High bounce rates on the demand side will spike Customer Acquisition Costs (CAC) because paid traffic will not convert. * **Supply Churn:** If you do manage to onboard providers
Methodology
Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.
LLM judge
A stable judge model scores each trait from 0 to 10 using benchmark-specific traits.
Heuristics
Deterministic checks catch length, banned phrases, required sections, format validity, and safety flags.
Calibrated ceiling
Rubrics are intentionally strict, leaving room above today's best scores for future model improvements.