Confirm Action

Are you sure you want to proceed?

Back to evals

business benchmark collection

Investor & Pitch

Benchmarks for testing whether models can improve startup pitches, critique weak claims, and anticipate investor concerns.

Which models can make startup pitches clearer, more credible, and harder to pick apart?

4 benchmarks 12 tests 276 completed runs 20 base models

At a glance

Top model

claude-opus-4.7

84.5

Lowest cost / eval

glm-5

$0.0120

Median rank score

78.75

Last refresh

2026-06-02

Score vs. cost

Average task cost vs overall score

Each dot is one model. X axis is average cost per benchmark task, including model and judge cost; Y axis is average calibrated score.

Overall ranking

Top models by average score

Higher is better. Scores come from completed judged runs.

Benchmark heatmap

Model performance by benchmark

Cells are colored by rank within each benchmark: the top ten are split across greens, anything below the top ten is red.

Below top 10 #1
Rank Model Overall Founder Reality Check Investor Question Test Deck Doctor Market Sizing Reality Check
1
claude-opus-4.7

12 scored tests

84.5 84.7 84.0 85.0 84.3
2
gpt-5.5

12 scored tests

83.1 83.0 83.7 82.0 83.7
3
claude-opus-4.8

12 scored tests

82.8 83.3 83.3 82.3 82.3
4
claude-opus-4.8-low

12 scored tests

82.7 82.7 82.3 82.3 83.3
5
qwen3.7-max

12 scored tests

82.2 82.7 81.0 82.3 83.0
6
claude-opus-4.8-high

12 scored tests

82.1 83.7 79.7 82.7 82.3
7
glm-5.1

12 scored tests

79.8 81.0 78.0 81.3 79.0
8
qwen3.5-plus-02-15

12 scored tests

79.8 82.0 81.0 74.3 81.7
9
kimi-k2.5

12 scored tests

79.7 81.7 77.0 76.0 84.0
10
claude-opus-4.6-high

12 scored tests

79.3 83.3 72.3 76.7 85.0
11
claude-sonnet-4.6

12 scored tests

78.8 83.7 68.7 82.3 80.7
12
gpt-5.4

12 scored tests

78.8 80.0 76.0 80.3 78.7
13
gemini-3-flash-preview

12 scored tests

78.5 83.0 72.7 75.0 83.3
14
grok-4.20-beta

12 scored tests

78.4 81.7 77.7 69.3 85.0
15
deepseek-v3.2

12 scored tests

78.1 80.7 81.7 70.7 79.3
16
gpt-5.5-pro

12 scored tests

77.5 64.3 82.0 81.7 82.0
17
gpt-5.4-mini

12 scored tests

76.2 81.0 73.3 75.0 75.3
18
claude-opus-4.6

12 scored tests

76.0 80.7 68.3 70.7 84.3
19
gpt-5.4-nano

12 scored tests

75.6 82.3 66.3 77.0 76.7
20
gemini-3.1-pro-preview

12 scored tests

72.5 82.0 46.7 82.0 79.3
21
gemini-3.5-flash-high

12 scored tests

71.9 79.7 59.3 76.0 72.7
22
glm-5

12 scored tests

64.5 75.7 38.0 78.0 66.3
23
minimax-m2.7

12 scored tests

46.2 33.3 45.7 60.7 45.3

Full leaderboard

Quality, cost, and speed

Model Score Tests Avg cost / task Avg seconds / task Frequent problems
claude-opus-4.7
84.5 Strong 12/12 $0.0515 39.7s -
gpt-5.5
83.08 Strong 12/12 $0.0672 47.5s -
claude-opus-4.8
82.83 Strong 12/12 $0.0483 33.7s -
claude-opus-4.8-low
82.67 Strong 12/12 $0.0510 33.7s -
qwen3.7-max
82.25 Strong 12/12 $0.0159 65.1s Incomplete output
claude-opus-4.8-high
82.08 Strong 12/12 $0.0490 33.1s Incomplete output
glm-5.1
79.83 Usable 12/12 $0.0155 67.0s Incomplete output Missing required element
qwen3.5-plus-02-15
79.75 Usable 12/12 $0.0156 62.0s Unsupported invention
kimi-k2.5
79.67 Usable 12/12 $0.0143 64.0s Unsupported invention Incomplete output Missing required element
claude-opus-4.6-high
79.33 Usable 12/12 $0.0581 60.7s Incomplete output Missing required element Unsupported invention
claude-sonnet-4.6
78.83 Usable 12/12 $0.0382 47.7s Incomplete output Missing required element
gpt-5.4
78.75 Usable 12/12 $0.0342 34.8s Wrapper text
gemini-3-flash-preview
78.5 Usable 12/12 $0.0188 20.7s Unsupported invention Incomplete output
grok-4.20-beta
78.42 Usable 12/12 $0.0158 20.0s Unsupported invention
deepseek-v3.2
78.08 Usable 12/12 $0.0148 57.6s Unsupported invention
gpt-5.5-pro
77.5 Usable 12/12 $0.4846 81.1s Incomplete output Missing required element Unsupported invention
gpt-5.4-mini
76.17 Usable 12/12 $0.0191 19.2s Wrapper text
claude-opus-4.6
76.0 Usable 12/12 $0.0602 60.2s Incomplete output Unsupported invention Missing required element
gpt-5.4-nano
75.58 Usable 12/12 $0.0178 21.3s Wrapper text
gemini-3.1-pro-preview
72.5 Usable 12/12 $0.0374 31.1s Incomplete output Missing required element
gemini-3.5-flash-high
71.92 Usable 12/12 $0.0315 20.8s Incomplete output Missing required element Unsupported invention
glm-5
64.5 Needs editing 12/12 $0.0120 61.2s Incomplete output Missing required element Wrapper text
minimax-m2.7
46.25 Weak 12/12 $0.0130 52.2s Incomplete output Missing required element Unsupported invention Wrapper text

Test cases

Where the scores come from

Each row is one prompt, with score distributions, top and low performers, and the most frequent problems judges flagged.

Test Benchmark Avg Max Min Top model Lowest model Frequent problems

Enterprise self-serve fantasy

pitch_reality_001

Founder Reality Check 79.1 86.0 52.0 claude-opus-4.7 · 86 gpt-5.5-pro · 52 Incomplete output ×3 Wrapper text ×2 Missing required element ×1

Marketplace with no supply strategy

pitch_reality_002

Founder Reality Check 78.9 85.0 10.0 claude-opus-4.7 · 85 minimax-m2.7 · 10 Incomplete output ×3 Missing required element ×1

Consumer app with no retention data

pitch_reality_003

Founder Reality Check 78.8 84.0 10.0 gemini-3-flash-preview · 84 minimax-m2.7 · 10 Incomplete output ×2 Missing required element ×1 Malformed output ×1

AI customer support startup

pitch_questions_001

Investor Question Test 74.3 86.0 41.0 glm-5.1 · 86 minimax-m2.7 · 41 Incomplete output ×10 Missing required element ×6 Wrapper text ×2

Vertical SaaS for dentists

pitch_questions_002

Investor Question Test 70.6 85.0 34.0 qwen3.7-max · 85 glm-5 · 34 Incomplete output ×9 Missing required element ×7 Wrapper text ×3

AI education app for parents

pitch_questions_003

Investor Question Test 71.5 84.0 18.0 claude-opus-4.7 · 84 glm-5 · 18 Incomplete output ×8 Missing required element ×4 Unsupported invention ×1

Weak problem slide

pitch_deck_001

Deck Doctor 77.4 86.0 46.0 claude-opus-4.7 · 86 deepseek-v3.2 · 46 Unsupported invention ×6

Vague solution slide

pitch_deck_002

Deck Doctor 79.7 86.0 60.0 gpt-5.5-pro · 86 minimax-m2.7 · 60 Unsupported invention ×5 Wrapper text ×1

Overstuffed traction slide

pitch_deck_003

Deck Doctor 75.5 86.0 49.0 kimi-k2.5 · 86 minimax-m2.7 · 49 Unsupported invention ×5

Lazy TAM slide

pitch_market_001

Market Sizing Reality Check 81.4 85.0 61.0 claude-opus-4.7 · 85 minimax-m2.7 · 61 -

Global education market TAM

pitch_market_002

Market Sizing Reality Check 77.0 86.0 44.0 grok-4.20-beta · 86 glm-5 · 44 Incomplete output ×4 Missing required element ×1

Creator economy TAM

pitch_market_003

Market Sizing Reality Check 78.7 86.0 28.0 claude-opus-4.6-high · 86 minimax-m2.7 · 28 Incomplete output ×5 Missing required element ×1

Model profiles

Strengths, weaknesses, and tradeoffs

claude-opus-4.7

12 scored tests · Strong

84.5

Highest traits

investor fit8.6
enterprise realism8.57
tact8.53
specificity8.52
clarity8.47

Lowest traits

concision8.23
coverage8.33
data needs8.37
revised framing8.37
question sharpness8.37

gpt-5.5

12 scored tests · Strong

83.08

Highest traits

market realism8.5
specificity8.4
usefulness8.4
tact8.4
enterprise realism8.4

Lowest traits

evidence quality8.03
problem framing8.1
revised framing8.17
investor fit8.23
assumption detection8.23

claude-opus-4.8

12 scored tests · Strong

82.83

Highest traits

tact8.47
enterprise realism8.43
investor fit8.4
investor realism8.4
usefulness8.37

Lowest traits

concision8.07
data needs8.1
problem framing8.1
evidence quality8.13
revised framing8.17

claude-opus-4.8-low

12 scored tests · Strong

82.67

Highest traits

tact8.37
specificity8.36
data needs8.33
market realism8.33
enterprise realism8.33

Lowest traits

evidence quality8.1
concision8.13
coverage8.13
investor realism8.17
alternative path quality8.17

qwen3.7-max

12 scored tests · Strong

82.25

Highest traits

investor realism8.5
question sharpness8.47
tact8.4
specificity8.39
clarity8.32

Lowest traits

usefulness8.0
evidence quality8.07
concision8.1
alternative path quality8.17
data needs8.2

claude-opus-4.8-high

12 scored tests · Strong

82.08

Highest traits

tact8.47
investor fit8.4
clarity8.38
enterprise realism8.37
alternative path quality8.33

Lowest traits

usefulness7.87
coverage7.97
question sharpness8.07
data needs8.13
revised framing8.13

glm-5.1

12 scored tests · Usable

79.83

Highest traits

investor realism8.43
investor fit8.27
enterprise realism8.27
market realism8.23
specificity8.22

Lowest traits

coverage6.3
usefulness6.67
revised framing6.9
evidence quality7.8
clarity7.87

qwen3.5-plus-02-15

12 scored tests · Usable

79.75

Highest traits

enterprise realism8.33
market realism8.3
tact8.27
alternative path quality8.23
clarity8.22

Lowest traits

evidence quality6.0
investor fit7.5
problem framing7.67
question sharpness7.97
data needs8.0

kimi-k2.5

12 scored tests · Usable

79.67

Highest traits

market realism8.5
revised framing8.5
enterprise realism8.37
data needs8.23
clarity8.2

Lowest traits

coverage6.27
evidence quality6.57
usefulness6.63
investor fit7.77
problem framing7.8

claude-opus-4.6-high

12 scored tests · Usable

79.33

Highest traits

market realism8.53
revised framing8.5
clarity8.43
data needs8.4
tact8.4

Lowest traits

usefulness4.83
coverage5.5
evidence quality6.57
problem framing7.57
investor fit7.67

claude-sonnet-4.6

12 scored tests · Usable

78.83

Highest traits

tact8.6
enterprise realism8.47
investor fit8.37
problem framing8.3
assumption detection8.3

Lowest traits

coverage3.0
usefulness4.5
question sharpness7.73
investor realism8.0
data needs8.03

gpt-5.4

12 scored tests · Usable

78.75

Highest traits

enterprise realism8.2
tact8.17
usefulness8.13
investor realism8.13
concision8.07

Lowest traits

data needs7.8
market realism7.83
revised framing7.83
evidence quality7.87
question sharpness7.93

Compare models

Trait profile and test-by-test matchup

Pick two models to compare average trait scores and individual test outcomes from the published calibrated results.

Per-test comparison

Sorted by the biggest absolute score difference between the selected models.

Test Benchmark Delta

Frequent problems

Where models break

Incomplete output 44
Missing required element 22
Unsupported invention 17
Wrapper text 9
Malformed output 1

Example outputs

Winner, middle, failure

winner

claude-opus-4.6-high · 86

The model provides an exceptionally realistic, expert-level critique with highly specific industry references (SignalFire, Tubular Labs, TubeBuddy), a precise breakdown of the GMV vs. SaaS fallacy, and a highly actionable revised bottom-up TAM.

# Critique: "1% of the Creator Economy" TAM Claim

## The Typical Claim
*"The creator economy is $250B+. We just need 1% = $2.5B addressable market."*

This is a textbook example of the "1% of China" fallacy dressed in creator economy clothing. Let me dismantle it layer by layer.

---

## Weaknesses

### 1. The Denominator is Wildly Inflated
- The "$250B creator economy" includes everything: OnlyFans payouts, Roblox developer earnings, Substack subscriptions, TikTok creator funds, Cameo, Patreon, merch platforms, and creator-focused venture investment itself. Your analytics tool touches almost none of these revenue streams.
- You don't serve "the creator economy." You serve **one platform** (YouTube), **one segment** (mid-tier), with **one value prop** (analytics + sponsorship workflow).

### 2. The Segment Math Doesn't Support It
- There are roughly **300,000-500,000** YouTube channels with 50k-500k subscribers globally.
- Many are inactive, non-English, hobbyist, or in niches where sponsorships are irrelevant (e.g., religious content, foreign-language repost channels, music lyric channels).
- Realistic serviceable channels might be **100,000-200,000**.
- At what ACV? A mid-tier YouTuber making $2k-$15k/month in total revenue is not paying $500/month for tooling. Realistic pricing: **$20-$80/month**, maybe $100-$200/month if sponsorship workflow genuinely closes deals.
- That gives you a **SAM of $24M-$192M/year**, not $2.5B.

### 3. "1% of X" Hides the Actual Go-to-Market Question
- The framing implies passive capture ("we just need...") when the real question is: **through what mechanism do you acquire and retain 100,000 creators who are notoriously fickle, tool-fatigued, and cost-sensitive?**
- Creators in this tier are underserved by sales teams but over-served by 

failure

minimax-m2.7 · 10

The model response is severely truncated, cutting off mid-sentence in the first section and completely missing the requested assumptions, realistic path, and tactful explanation.

## Quick Summary of the Plan
The founder is building a **local‑services marketplace** (e.g., plumbers, cleaners, tutors, handymen) that assumes **demand will automatically pull in supply**. The idea is:

1. **Build demand** (users looking for a service) through marketing, SEO, and community outreach.  
2. **Expect providers** (the “supply”) to join on their own because they see a growing pool of potential customers.  
3. **Avoid exclusive contracts** – providers can be on many platforms, and the market already has many incumbents (Thumbtack, HomeAdvisor, Angie’s List, Yelp, etc.).

The plan sounds elegant – “if you build it, they will come” – but local‑service marketplaces are notoriously **supply‑constrained**. Below is a structured review: **risks, assumptions to test, a realistic path forward, and a tactful explanation** of why the plan may need adjustment.

---

## 1. Risks

| Risk | Why It Matters | Typical Outcome if Unmitigated |
|------|----------------|--------------------------------|
| **Chicken‑and‑Egg Failure** | Without enough supply, demand sees poor selection → low conversion → high churn for both sides. | Marketplace never reaches liquidity; both sides leave. |
| **Supply‑Side Adoption Friction** | Providers are busy, risk‑averse, and already on other platforms. They need a clear value proposition beyond “you might get leads.” | Low provider activation, high drop‑out rate, poor supply density. |
| **Incumbent Advantage** | Established players have brand trust, SEO dominance, and large provider networks. | New entrant struggles to acquire demand organically; CAC rises. |
| **Trust & Quality Control** | Local services are high‑stakes (home entry, safety). Users need confidence in provider vetting. | Negative reviews, disputes, and liability claims

mid performer

qwen3.5-plus-02-15 · 82

The model provides highly realistic, actionable, and specific advice grounded in established marketplace theory, effectively challenging the founder's assumptions while maintaining a constructive and tactful tone.

Here is a review of the founder's plan, structured to be constructive while clearly highlighting the significant challenges associated with a "demand-first" strategy in a crowded local services market.

### 1. Tactful Explanation of the Core Issue
**The "Field of Dreams" Fallacy**
The current plan relies on the hypothesis that *demand creates supply*. In marketplace theory, this is often referred to as the "Field of Dreams" approach (*if you build it, they will come*). While this works for content platforms (like YouTube) or network effects where the product exists without the other side (like a messaging app), it is historically the highest-risk approach for **transactional local services marketplaces.**

In local services, the product *is* the supply. A customer searching for a plumber who finds no available providers does not have a "partial product experience"; they have a broken product. If they do not find a provider within minutes, they will leave for an incumbent (Thumbtack, Angi, Google) and likely never return.

**The "No Exclusivity" Challenge**
Because supply is non-exclusive, providers are already listed on multiple platforms. They have no incentive to monitor a new platform with low traffic. Without a guaranteed volume of high-quality leads, their attention will remain on the incumbents where they already have established reputations and workflows.

### 2. Key Risks
*   **Liquidity Failure (The Empty Restaurant):** If a customer posts a request and does not receive a bid/booking within a short window (e.g., 1 hour for urgent, 24 hours for non-urgent), trust is broken permanently. High bounce rates on the demand side will spike Customer Acquisition Costs (CAC) because paid traffic will not convert.
*   **Supply Churn:** If you do manage to onboard providers

Methodology

How scores are produced

Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.

LLM judge

A stable judge model scores each trait from 0 to 10 using benchmark-specific traits.

Heuristics

Deterministic checks catch length, banned phrases, required sections, format validity, and safety flags.

Calibrated ceiling

Rubrics are intentionally strict, leaving room above today's best scores for future model improvements.