Back to evals

business benchmark collection

Investor & Pitch

Benchmarks for testing whether models can improve startup pitches, critique weak claims, and anticipate investor concerns.

Which models can make startup pitches clearer, more credible, and harder to pick apart?

4 benchmarks 12 tests 276 completed runs 20 base models

At a glance

Top model

claude-opus-4.7

84.5

Lowest cost / eval

glm-5

$0.0120

Median rank score

78.75

Last refresh

2026-06-02

Score vs. cost

Average task cost vs overall score

Each dot is one model. X axis is average cost per benchmark task, including model and judge cost; Y axis is average calibrated score.

Overall ranking

Top models by average score

Higher is better. Scores come from completed judged runs.

Benchmark heatmap

Model performance by benchmark

Cells are colored by rank within each benchmark: the top ten are split across greens, anything below the top ten is red.

Below top 10 #1

Rank	Model	Overall	Founder Reality Check	Investor Question Test	Deck Doctor	Market Sizing Reality Check
1	claude-opus-4.7 12 scored tests	84.5	84.7	84.0	85.0	84.3
2	gpt-5.5 12 scored tests	83.1	83.0	83.7	82.0	83.7
3	claude-opus-4.8 12 scored tests	82.8	83.3	83.3	82.3	82.3
4	claude-opus-4.8-low 12 scored tests	82.7	82.7	82.3	82.3	83.3
5	qwen3.7-max 12 scored tests	82.2	82.7	81.0	82.3	83.0
6	claude-opus-4.8-high 12 scored tests	82.1	83.7	79.7	82.7	82.3
7	glm-5.1 12 scored tests	79.8	81.0	78.0	81.3	79.0
8	qwen3.5-plus-02-15 12 scored tests	79.8	82.0	81.0	74.3	81.7
9	kimi-k2.5 12 scored tests	79.7	81.7	77.0	76.0	84.0
10	claude-opus-4.6-high 12 scored tests	79.3	83.3	72.3	76.7	85.0
11	claude-sonnet-4.6 12 scored tests	78.8	83.7	68.7	82.3	80.7
12	gpt-5.4 12 scored tests	78.8	80.0	76.0	80.3	78.7
13	gemini-3-flash-preview 12 scored tests	78.5	83.0	72.7	75.0	83.3
14	grok-4.20-beta 12 scored tests	78.4	81.7	77.7	69.3	85.0
15	deepseek-v3.2 12 scored tests	78.1	80.7	81.7	70.7	79.3
16	gpt-5.5-pro 12 scored tests	77.5	64.3	82.0	81.7	82.0
17	gpt-5.4-mini 12 scored tests	76.2	81.0	73.3	75.0	75.3
18	claude-opus-4.6 12 scored tests	76.0	80.7	68.3	70.7	84.3
19	gpt-5.4-nano 12 scored tests	75.6	82.3	66.3	77.0	76.7
20	gemini-3.1-pro-preview 12 scored tests	72.5	82.0	46.7	82.0	79.3
21	gemini-3.5-flash-high 12 scored tests	71.9	79.7	59.3	76.0	72.7
22	glm-5 12 scored tests	64.5	75.7	38.0	78.0	66.3
23	minimax-m2.7 12 scored tests	46.2	33.3	45.7	60.7	45.3

Full leaderboard

Quality, cost, and speed

Model	Score	Tests	Avg cost / task	Avg seconds / task	Frequent problems
claude-opus-4.7	84.5 Strong	12/12	$0.0515	39.7s	-
gpt-5.5	83.08 Strong	12/12	$0.0672	47.5s	-
claude-opus-4.8	82.83 Strong	12/12	$0.0483	33.7s	-
claude-opus-4.8-low	82.67 Strong	12/12	$0.0510	33.7s	-
qwen3.7-max	82.25 Strong	12/12	$0.0159	65.1s	Incomplete output
claude-opus-4.8-high	82.08 Strong	12/12	$0.0490	33.1s	Incomplete output
glm-5.1	79.83 Usable	12/12	$0.0155	67.0s	Incomplete output Missing required element
qwen3.5-plus-02-15	79.75 Usable	12/12	$0.0156	62.0s	Unsupported invention
kimi-k2.5	79.67 Usable	12/12	$0.0143	64.0s	Unsupported invention Incomplete output Missing required element
claude-opus-4.6-high	79.33 Usable	12/12	$0.0581	60.7s	Incomplete output Missing required element Unsupported invention
claude-sonnet-4.6	78.83 Usable	12/12	$0.0382	47.7s	Incomplete output Missing required element
gpt-5.4	78.75 Usable	12/12	$0.0342	34.8s	Wrapper text
gemini-3-flash-preview	78.5 Usable	12/12	$0.0188	20.7s	Unsupported invention Incomplete output
grok-4.20-beta	78.42 Usable	12/12	$0.0158	20.0s	Unsupported invention
deepseek-v3.2	78.08 Usable	12/12	$0.0148	57.6s	Unsupported invention
gpt-5.5-pro	77.5 Usable	12/12	$0.4846	81.1s	Incomplete output Missing required element Unsupported invention
gpt-5.4-mini	76.17 Usable	12/12	$0.0191	19.2s	Wrapper text
claude-opus-4.6	76.0 Usable	12/12	$0.0602	60.2s	Incomplete output Unsupported invention Missing required element
gpt-5.4-nano	75.58 Usable	12/12	$0.0178	21.3s	Wrapper text
gemini-3.1-pro-preview	72.5 Usable	12/12	$0.0374	31.1s	Incomplete output Missing required element
gemini-3.5-flash-high	71.92 Usable	12/12	$0.0315	20.8s	Incomplete output Missing required element Unsupported invention
glm-5	64.5 Needs editing	12/12	$0.0120	61.2s	Incomplete output Missing required element Wrapper text
minimax-m2.7	46.25 Weak	12/12	$0.0130	52.2s	Incomplete output Missing required element Unsupported invention Wrapper text

Test cases

Where the scores come from

Each row is one prompt, with score distributions, top and low performers, and the most frequent problems judges flagged.

Test	Benchmark	Avg	Max	Min	Top model	Lowest model	Frequent problems
Enterprise self-serve fantasy pitch_reality_001	Founder Reality Check	79.1	86.0	52.0	claude-opus-4.7 · 86	gpt-5.5-pro · 52	Incomplete output ×3 Wrapper text ×2 Missing required element ×1
Marketplace with no supply strategy pitch_reality_002	Founder Reality Check	78.9	85.0	10.0	claude-opus-4.7 · 85	minimax-m2.7 · 10	Incomplete output ×3 Missing required element ×1
Consumer app with no retention data pitch_reality_003	Founder Reality Check	78.8	84.0	10.0	gemini-3-flash-preview · 84	minimax-m2.7 · 10	Incomplete output ×2 Missing required element ×1 Malformed output ×1
AI customer support startup pitch_questions_001	Investor Question Test	74.3	86.0	41.0	glm-5.1 · 86	minimax-m2.7 · 41	Incomplete output ×10 Missing required element ×6 Wrapper text ×2
Vertical SaaS for dentists pitch_questions_002	Investor Question Test	70.6	85.0	34.0	qwen3.7-max · 85	glm-5 · 34	Incomplete output ×9 Missing required element ×7 Wrapper text ×3
AI education app for parents pitch_questions_003	Investor Question Test	71.5	84.0	18.0	claude-opus-4.7 · 84	glm-5 · 18	Incomplete output ×8 Missing required element ×4 Unsupported invention ×1
Weak problem slide pitch_deck_001	Deck Doctor	77.4	86.0	46.0	claude-opus-4.7 · 86	deepseek-v3.2 · 46	Unsupported invention ×6
Vague solution slide pitch_deck_002	Deck Doctor	79.7	86.0	60.0	gpt-5.5-pro · 86	minimax-m2.7 · 60	Unsupported invention ×5 Wrapper text ×1
Overstuffed traction slide pitch_deck_003	Deck Doctor	75.5	86.0	49.0	kimi-k2.5 · 86	minimax-m2.7 · 49	Unsupported invention ×5
Lazy TAM slide pitch_market_001	Market Sizing Reality Check	81.4	85.0	61.0	claude-opus-4.7 · 85	minimax-m2.7 · 61	-
Global education market TAM pitch_market_002	Market Sizing Reality Check	77.0	86.0	44.0	grok-4.20-beta · 86	glm-5 · 44	Incomplete output ×4 Missing required element ×1
Creator economy TAM pitch_market_003	Market Sizing Reality Check	78.7	86.0	28.0	claude-opus-4.6-high · 86	minimax-m2.7 · 28	Incomplete output ×5 Missing required element ×1

Model profiles

Strengths, weaknesses, and tradeoffs

claude-opus-4.7

12 scored tests · Strong

84.5

Highest traits

investor fit8.6

enterprise realism8.57

tact8.53

specificity8.52

clarity8.47

Lowest traits

concision8.23

coverage8.33

data needs8.37

revised framing8.37

question sharpness8.37

gpt-5.5

12 scored tests · Strong

83.08

Highest traits

market realism8.5

specificity8.4

usefulness8.4

tact8.4

enterprise realism8.4

Lowest traits

evidence quality8.03

problem framing8.1

revised framing8.17

investor fit8.23

assumption detection8.23

claude-opus-4.8

12 scored tests · Strong

82.83

Highest traits

tact8.47

enterprise realism8.43

investor fit8.4

investor realism8.4

usefulness8.37

Lowest traits

concision8.07

data needs8.1

problem framing8.1

evidence quality8.13

revised framing8.17

claude-opus-4.8-low

12 scored tests · Strong

82.67

Highest traits

tact8.37

specificity8.36

data needs8.33

market realism8.33

enterprise realism8.33

Lowest traits

evidence quality8.1

concision8.13

coverage8.13

investor realism8.17

alternative path quality8.17

qwen3.7-max

12 scored tests · Strong

82.25

Highest traits

investor realism8.5

question sharpness8.47

tact8.4

specificity8.39

clarity8.32

Lowest traits

usefulness8.0

evidence quality8.07

concision8.1

alternative path quality8.17

data needs8.2

claude-opus-4.8-high

12 scored tests · Strong

82.08

Highest traits

tact8.47

investor fit8.4

clarity8.38

enterprise realism8.37

alternative path quality8.33

Lowest traits

usefulness7.87

coverage7.97

question sharpness8.07

data needs8.13

revised framing8.13

glm-5.1

12 scored tests · Usable

79.83

Highest traits

investor realism8.43

investor fit8.27

enterprise realism8.27

market realism8.23

specificity8.22

Lowest traits

coverage6.3

usefulness6.67

revised framing6.9

evidence quality7.8

clarity7.87

qwen3.5-plus-02-15

12 scored tests · Usable

79.75

Highest traits

enterprise realism8.33

market realism8.3

tact8.27

alternative path quality8.23

clarity8.22

Lowest traits

evidence quality6.0

investor fit7.5

problem framing7.67

question sharpness7.97

data needs8.0

kimi-k2.5

12 scored tests · Usable

79.67

Highest traits

market realism8.5

revised framing8.5

enterprise realism8.37

data needs8.23

clarity8.2

Lowest traits

coverage6.27

evidence quality6.57

usefulness6.63

investor fit7.77

problem framing7.8

claude-opus-4.6-high

12 scored tests · Usable

79.33

Highest traits

market realism8.53

revised framing8.5

clarity8.43

data needs8.4

tact8.4

Lowest traits

usefulness4.83

coverage5.5

evidence quality6.57

problem framing7.57

investor fit7.67

claude-sonnet-4.6

12 scored tests · Usable

78.83

Highest traits

tact8.6

enterprise realism8.47

investor fit8.37

problem framing8.3

assumption detection8.3

Lowest traits

coverage3.0

usefulness4.5

question sharpness7.73

investor realism8.0

data needs8.03

gpt-5.4

12 scored tests · Usable

78.75

Highest traits

enterprise realism8.2

tact8.17

usefulness8.13

investor realism8.13

concision8.07

Lowest traits

data needs7.8

market realism7.83

revised framing7.83

evidence quality7.87

question sharpness7.93

Compare models

Trait profile and test-by-test matchup

Pick two models to compare average trait scores and individual test outcomes from the published calibrated results.

Model A Model B

Per-test comparison

Sorted by the biggest absolute score difference between the selected models.

Test

Benchmark

Delta

Frequent problems

Where models break

Incomplete output 44

Missing required element 22

Unsupported invention 17

Wrapper text 9

Malformed output 1

Example outputs

Winner, middle, failure

winner

claude-opus-4.6-high · 86

The model provides an exceptionally realistic, expert-level critique with highly specific industry references (SignalFire, Tubular Labs, TubeBuddy), a precise breakdown of the GMV vs. SaaS fallacy, and a highly actionable revised bottom-up TAM.

# Critique: "1% of the Creator Economy" TAM Claim

## The Typical Claim
*"The creator economy is $250B+. We just need 1% = $2.5B addressable market."*

This is a textbook example of the "1% of China" fallacy dressed in creator economy clothing. Let me dismantle it layer by layer.

---

## Weaknesses

### 1. The Denominator is Wildly Inflated
- The "$250B creator economy" includes everything: OnlyFans payouts, Roblox developer earnings, Substack subscriptions, TikTok creator funds, Cameo, Patreon, merch platforms, and creator-focused venture investment itself. Your analytics tool touches almost none of these revenue streams.
- You don't serve "the creator economy." You serve **one platform** (YouTube), **one segment** (mid-tier), with **one value prop** (analytics + sponsorship workflow).

### 2. The Segment Math Doesn't Support It
- There are roughly **300,000-500,000** YouTube channels with 50k-500k subscribers globally.
- Many are inactive, non-English, hobbyist, or in niches where sponsorships are irrelevant (e.g., religious content, foreign-language repost channels, music lyric channels).
- Realistic serviceable channels might be **100,000-200,000**.
- At what ACV? A mid-tier YouTuber making $2k-$15k/month in total revenue is not paying $500/month for tooling. Realistic pricing: **$20-$80/month**, maybe $100-$200/month if sponsorship workflow genuinely closes deals.
- That gives you a **SAM of $24M-$192M/year**, not $2.5B.

### 3. "1% of X" Hides the Actual Go-to-Market Question
- The framing implies passive capture ("we just need...") when the real question is: **through what mechanism do you acquire and retain 100,000 creators who are notoriously fickle, tool-fatigued, and cost-sensitive?**
- Creators in this tier are underserved by sales teams but over-served by

failure

minimax-m2.7 · 10

The model response is severely truncated, cutting off mid-sentence in the first section and completely missing the requested assumptions, realistic path, and tactful explanation.

## Quick Summary of the Plan
The founder is building a **local‑services marketplace** (e.g., plumbers, cleaners, tutors, handymen) that assumes **demand will automatically pull in supply**. The idea is:

1. **Build demand** (users looking for a service) through marketing, SEO, and community outreach.  
2. **Expect providers** (the “supply”) to join on their own because they see a growing pool of potential customers.  
3. **Avoid exclusive contracts** – providers can be on many platforms, and the market already has many incumbents (Thumbtack, HomeAdvisor, Angie’s List, Yelp, etc.).

The plan sounds elegant – “if you build it, they will come” – but local‑service marketplaces are notoriously **supply‑constrained**. Below is a structured review: **risks, assumptions to test, a realistic path forward, and a tactful explanation** of why the plan may need adjustment.

---

## 1. Risks

| Risk | Why It Matters | Typical Outcome if Unmitigated |
|------|----------------|--------------------------------|
| **Chicken‑and‑Egg Failure** | Without enough supply, demand sees poor selection → low conversion → high churn for both sides. | Marketplace never reaches liquidity; both sides leave. |
| **Supply‑Side Adoption Friction** | Providers are busy, risk‑averse, and already on other platforms. They need a clear value proposition beyond “you might get leads.” | Low provider activation, high drop‑out rate, poor supply density. |
| **Incumbent Advantage** | Established players have brand trust, SEO dominance, and large provider networks. | New entrant struggles to acquire demand organically; CAC rises. |
| **Trust & Quality Control** | Local services are high‑stakes (home entry, safety). Users need confidence in provider vetting. | Negative reviews, disputes, and liability claims

mid performer

qwen3.5-plus-02-15 · 82

The model provides highly realistic, actionable, and specific advice grounded in established marketplace theory, effectively challenging the founder's assumptions while maintaining a constructive and tactful tone.

Here is a review of the founder's plan, structured to be constructive while clearly highlighting the significant challenges associated with a "demand-first" strategy in a crowded local services market.

### 1. Tactful Explanation of the Core Issue
**The "Field of Dreams" Fallacy**
The current plan relies on the hypothesis that *demand creates supply*. In marketplace theory, this is often referred to as the "Field of Dreams" approach (*if you build it, they will come*). While this works for content platforms (like YouTube) or network effects where the product exists without the other side (like a messaging app), it is historically the highest-risk approach for **transactional local services marketplaces.**

In local services, the product *is* the supply. A customer searching for a plumber who finds no available providers does not have a "partial product experience"; they have a broken product. If they do not find a provider within minutes, they will leave for an incumbent (Thumbtack, Angi, Google) and likely never return.

**The "No Exclusivity" Challenge**
Because supply is non-exclusive, providers are already listed on multiple platforms. They have no incentive to monitor a new platform with low traffic. Without a guaranteed volume of high-quality leads, their attention will remain on the incumbents where they already have established reputations and workflows.

### 2. Key Risks
* **Liquidity Failure (The Empty Restaurant):** If a customer posts a request and does not receive a bid/booking within a short window (e.g., 1 hour for urgent, 24 hours for non-urgent), trust is broken permanently. High bounce rates on the demand side will spike Customer Acquisition Costs (CAC) because paid traffic will not convert.
* **Supply Churn:** If you do manage to onboard providers

Methodology

How scores are produced

Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.

LLM judge

A stable judge model scores each trait from 0 to 10 using benchmark-specific traits.

Heuristics

Deterministic checks catch length, banned phrases, required sections, format validity, and safety flags.

Calibrated ceiling

Rubrics are intentionally strict, leaving room above today's best scores for future model improvements.