OpenAI · 9 models tested

Which OpenAI model should you use?

Q: What is the best OpenAI model?

GPT-5.5 — the highest average percentile (88) across the task areas we benchmarked, with 7 top-3 finishes.

Q: What is the best cheap OpenAI model?

GPT-5.4: within 5 percentile points of GPT-5.5 at $17.5 per 1M tokens.

Q: Where does OpenAI lead every other lab?

A OpenAI model holds the outright #1 spot in 7 task areas: Chef / Home Cooking, Coding, Knowledge & Docs, Presentations & Decks, Product & Project Management, Research & Competitive Analysis, Structured Output.

Lineup GPT-5.5 GPT-5.4 GPT-5.4 Mini GPT-5.5 Pro

OpenAI's strongest model is GPT-5.5 (avg percentile 88, top-3 in 7 of 22 task areas). On a budget, GPT-5.4 stays close at $17.5/1M.

The OpenAI lineup, ranked

Model	Overall	Code & data	Writing & comms	Business & strategy	Creative & visual	AA Intel†	Top-3s	Price / 1M
GPT-5.5 Best overall 22 task areas	88	86★	92★	82	96★	55	7	$35.0
GPT-5.4 Best value 22 task areas	83	82	87	82★	80	51	3	$17.5
GPT-5.4 Mini 22 task areas	64	73	75	61	47	40	2	$5.25
GPT-5.5 Pro 6 task areas	62	—	94	46	69	—	1	$210.0
GPT-5 Mini 22 task areas	51	68	36	53	53	—	1	$2.25
GPT-5.4 Nano 6 task areas	23	—	37	20	20	38	0	$1.45
o1 1 task areas	7	—	—	—	7	—	0	$75.0
GPT-4o 1 task areas	2	—	—	—	2	—	0	$12.5
GPT-4o Mini 1 task areas	0	—	—	—	0	—	0	$0.75

Area figures are mean percentile ranks; 85+ 70–84 55–69 40–54 <40 · ★ = the lineup's best in that area. †AA Intel = Artificial Analysis Intelligence Index, third-party reference only.

Best OpenAI model, task by task

The lineup's strongest model in each task area — and where it lands against every model we test, not just OpenAI's. #1 means it beats every rival lab too.

Task area	Their best here	Rank vs. everyone	Rating
Chef / Home Cooking	GPT-5.5 Pro	#1 / 126	Strong
Coding	GPT-5.5	#1 / 115	Excellent
Knowledge & Docs	GPT-5.4	#1 / 107	Excellent
Presentations & Decks	GPT-5.4	#1 / 107	Excellent
Product & Project Management	GPT-5 Mini	#1 / 107	Excellent
Research & Competitive Analysis	GPT-5.4	#1 / 107	Excellent
Structured Output	GPT-5.4 Mini	#1 / 110	Excellent
Content & Brand	GPT-5.5	#2 / 124	Strong
Creative & Comedy	GPT-5.5	#2 / 107	Elo
Customer Support	GPT-5.5	#2 / 113	Strong
Translation & Localization	GPT-5.5	#3 / 107	Excellent
Landing Pages	GPT-5.5	#4 / 69	Strong
Investor & Pitch	GPT-5.4	#5 / 63	Strong
Summarization & Meeting Notes	GPT-5.5	#8 / 107	Excellent
Frontend & Landing Pages	GPT-5 Mini	#9 / 106	Needs editing
Legal & HR	GPT-5.5	#9 / 107	Excellent
Executive Assistant	GPT-5.5	#10 / 112	Strong
RAG, Safety & Grounding	GPT-5.5	#14 / 110	Excellent
Data & Analytics	GPT-5 Mini	#25 / 110	Excellent
AI Strategy	GPT-5.5 Pro	#27 / 126	Strong
Sales	GPT-5.4 Mini	#38 / 107	Usable
Training & Education	GPT-5.5	#59 / 107	Excellent

Where another lab clearly wins

Honesty corner: the task areas where even OpenAI's best model ranks furthest from the top. See who actually leads there.

Training & Education their best: #59 Sales their best: #38 AI Strategy their best: #27

Frequently asked

What is the best OpenAI model?

GPT-5.5 — the highest average percentile (88) across the task areas we benchmarked, with 7 top-3 finishes.

What is the best cheap OpenAI model?

GPT-5.4: within 5 percentile points of GPT-5.5 at $17.5 per 1M tokens.

Where does OpenAI lead every other lab?

A OpenAI model holds the outright #1 spot in 7 task areas: Chef / Home Cooking, Coding, Knowledge & Docs, Presentations & Decks, Product & Project Management, Research & Competitive Analysis, Structured Output.

Every model, every lab — full leaderboard → Best model by task →

This page is Spring Prompt, running

We just did this for every model. Do it for your prompt.

The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.

Generate test cases from your prompt — no eval set required to start.
Compare models side by side with quality, cost and latency in one matrix.
Optimise the winner until the scores say it's ready to ship.

Join the waitlist Browse all benchmarks

Experiment · Cold outreach email

Prompt × model results

12 test cases · 3 evals

Claude Opus

GPT-5

Gemini

7.1

6.8

7.4

8.3

7.9

8.0

9.2 ★

8.6

8.4

Best combo: v3 × Claude Opus

9.2 quality · $0.004/run · 1.8s