42 models · 22 task areas · 37,277 graded runs

AI model leaderboard

Q: What is the best AI model overall right now?

GPT-5.5 leads our leaderboard with an average percentile of 88 across 22 task areas, finishing top-3 in 7 of them.

Q: What is the best value AI model?

GLM 5.2: within 13 percentile points of the leader at $3.93/1M tokens vs $35 for GPT-5.5.

Q: What is the best cheap AI model (under $5/1M tokens)?

Kimi K2.6 — average percentile 85 across 6 task areas at $4.07/1M tokens.

Q: Which model is best for code & data tasks?

Qwen3.7 Max leads the Code & data area group with an average percentile of 94 across those task areas.

Q: Which model is best for writing & comms tasks?

GPT-5.5 leads the Writing & comms area group with an average percentile of 92 across those task areas.

Q: Which model is best for business & strategy tasks?

Claude Opus 4.8 leads the Business & strategy area group with an average percentile of 90 across those task areas.

Q: Which model is best for creative & visual tasks?

GPT-5.5 leads the Creative & visual area group with an average percentile of 96 across those task areas.

Every model we test, ranked by how it actually performs on real work. Rankings are percentile-based across tasks, so no single benchmark dominates.

Podium

1. GPT-5.5 Anthropic

2. Claude Opus 4.8 Moonshot

3. Kimi K2.6

Current leader: GPT-5.5 — top-3 in 7 of 22 task areas.

Code & data

Qwen3.7 Max

avg percentile 94 in this area

Writing & comms

GPT-5.5

avg percentile 92 in this area

Business & strategy

Claude Opus 4.8

avg percentile 90 in this area

Creative & visual

GPT-5.5

avg percentile 96 in this area

#	Model	Overall	Code & data	Writing & comms	Business & strategy	Creative & visual	AA Intel†	Top-3s	Price / 1M
1	GPT-5.5 OpenAI · 2 months ago	88	86	92★	82	96★	55	7	$35.0
2	Claude Opus 4.8 Anthropic · 1 month ago	87	84	87	90★	82	56	7	$30.0
3	Kimi K2.6 MoonshotAI	85	89	88	88	69	43	1	$4.07
4	Qwen3.7 Max Qwen · 1 month ago	85	94★	89	76	87	46	5	$5.0
5	Gemini 3.1 Pro Preview Google · 4 months ago	84	75	90	82	91	46	3	$14.0
6	Claude Opus 4.7 Anthropic · 3 months ago	84	—	90	91	69	54	1	$30.0
7	GPT-5.4 OpenAI · 4 months ago	83	82	87	82	80	51	3	$17.5
8	Claude Opus 4.5 Anthropic · 7 months ago	79	64	81	78	92	—	3	$30.0
9	Claude Sonnet 4.6 Anthropic · 4 months ago	76	57	72	85	83	47	4	$18.0
10	Gemini 3.5 Flash Google · 1 month ago	75	78	80	65	87	50	0	$10.5

Show the other 32 models ▾

11	Kimi K2.7 Code MoonshotAI · 20 days ago	75	72	68	84	67	42	1	$4.24
12	GLM 5.2 Z.ai · 16 days ago	74	81	81	72	63	51	0	$3.93
13	Claude Opus 4.6 Anthropic · 5 months ago	73	59	71	82	73	—	2	$30.0
14	Claude Fable 5 Anthropic · 23 days ago	68	40	73	73	81	60	6	$60.0
15	Qwen3.5 Plus 2026-02-15 Qwen · 4 months ago	68	76	72	66	61	—	1	$1.82
16	Kimi K2.5 MoonshotAI · 5 months ago	68	78	59	72	67	—	0	$2.4
17	DeepSeek V4 Pro DeepSeek · 2 months ago	68	68	66	66	73	44	0	$1.3
18	Gemini 3.1 Flash Lite Preview Google · 4 months ago	67	61	86	—	—	25	0	$1.75
19	MiniMax M3 MiniMax	66	63	54	77	64	44	3	$1.5
20	GPT-5.4 Mini OpenAI · 4 months ago	64	73	75	61	47	40	2	$5.25
21	Gemini 3 Flash Preview Google · 6 months ago	64	75	64	51	80	—	2	$3.5
22	GLM 5.1 Z.ai · 3 months ago	64	—	80	69	46	40	0	$4.0
23	GPT-5.5 Pro OpenAI · 2 months ago	62	—	94	47	69	—	1	$210.0
24	Claude Sonnet 4.5 Anthropic · 9 months ago	61	55	75	53	62	36	1	$18.0
25	Claude Sonnet 5 Anthropic · 2 days ago	60	64	56	63	56	53	1	$12.0
26	Gemini 3.1 Flash Lite Google · 2 months ago	58	61	70	45	64	—	0	$1.75
27	GLM 5 Z.ai · 5 months ago	55	55	69	48	51	—	0	$2.52
28	GPT-5 Mini OpenAI · 11 months ago	51	68	36	53	53	—	1	$2.25
29	DeepSeek V3.1 Terminus DeepSeek · 9 months ago	40	41	36	37	52	—	0	$1.22
30	DeepSeek V3.2 DeepSeek · 7 months ago	40	36	38	42	42	—	0	$0.57
31	Claude Haiku 4.5 Anthropic · 9 months ago	39	29	44	42	34	30	0	$6.0
32	Grok 4.20 Beta xAI · —	37	34	30	41	44	—	0	$8.0
33	Gemini 2.5 Pro Google · 1.0 years ago	37	—	—	—	37	26	0	$11.25
34	Grok 4.20 xAI · 3 months ago	36	33	24	39	53	—	0	$3.75
35	Mistral Medium 3.1 Mistral · 11 months ago	29	19	31	24	44	15	0	$2.4
36	GPT-5.4 Nano OpenAI · 4 months ago	23	—	37	20	20	38	0	$1.45
37	MiniMax M2.7 MiniMax · 3 months ago	17	35	12	17	8	38	0	$0.9
38	Gemini 2.5 Flash Google · 1.0 years ago	10	—	—	—	10	—	0	$0.75
39	o1 OpenAI · 1.5 years ago	7	—	—	—	7	—	0	$75.0
40	Qwen3 Coder Flash Qwen · 9 months ago	3	—	—	—	3	—	0	$1.17
41	GPT-4o OpenAI · 2.1 years ago	2	—	—	—	2	—	0	$12.5
42	GPT-4o Mini OpenAI · 2.0 years ago	0	—	—	—	0	—	0	$0.75

Overall and area figures are mean percentile ranks across the task areas each model entered (rank-normalized, so 0–100 scores and Elo collections never mix raw). 85+ 70–84 55–69 40–54 <40 · ★ = best in that area. †AA Intel = Artificial Analysis Intelligence Index, a third-party benchmark shown for reference only — never blended into our rankings. Price = blended input+output cost per 1M tokens.

Quality vs. price — every model

Best value sits high and to the left. Log-scale price axis.

The grades behind the ranks

Every rank on this page traces back to graded runs like these — a real test case, a real model output, and an independent judge's unedited verdict.

Coding

Top score · kimi-k2.6-medium 100/100

“The response provides a perfect, minimal, and idiomatic TypeScript solution that safely narrows the union type using a null check and a discriminated union tag. The explanation is concise and directly addresses the prompt's requirements.”

Lowest score · gemini-3.1-pro-preview 0/100

“The model triggered a false-positive safety refusal and completely failed to perform the requested code review.”

Sales

Top score · claude-opus-4.5-max 95/100

“The response is an exceptional, production-ready cold email that perfectly captures the requested scrappy voice, leverages the context trigger brilliantly, integrates verified facts, and closes with a highly effective CTA while staying well under the word limit.”

Lowest score · glm-5 0/100

“The model completely failed to generate the requested email, outputting only a single digit.”

Legal & HR

Top score · minimax-m3-high 100/100

“The model perfectly executed the task by acknowledging the lack of specific data and providing a highly structured, actionable, and professional template for the manager to use. It completely avoided fabricating any details, adhering strictly to the constraints.”

Lowest score · qwen3.5-plus-02-15-medium 0/100

“The original judge response was completely malformed and contained no extractable scores.”

Looking for the best model for one task?

Does more "thinking" actually make models better? → Are AI models getting better over time? → Do headline benchmark scores predict real-world performance? →

Frequently asked

What is the best AI model overall right now?

GPT-5.5 leads our leaderboard with an average percentile of 88 across 22 task areas, finishing top-3 in 7 of them.

What is the best value AI model?

GLM 5.2: within 13 percentile points of the leader at $3.93/1M tokens vs $35 for GPT-5.5.

What is the best cheap AI model (under $5/1M tokens)?

Kimi K2.6 — average percentile 85 across 6 task areas at $4.07/1M tokens.

Which model is best for code & data tasks?

Qwen3.7 Max leads the Code & data area group with an average percentile of 94 across those task areas.

Which model is best for writing & comms tasks?

GPT-5.5 leads the Writing & comms area group with an average percentile of 92 across those task areas.

Which model is best for business & strategy tasks?

Claude Opus 4.8 leads the Business & strategy area group with an average percentile of 90 across those task areas.

Which model is best for creative & visual tasks?

GPT-5.5 leads the Creative & visual area group with an average percentile of 96 across those task areas.

This page is Spring Prompt, running

We just did this for every model. Do it for your prompt.

The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.

Generate test cases from your prompt — no eval set required to start.
Compare models side by side with quality, cost and latency in one matrix.
Optimise the winner until the scores say it's ready to ship.

Join the waitlist Browse all benchmarks

Experiment · Cold outreach email

Prompt × model results

12 test cases · 3 evals

Claude Opus

GPT-5

Gemini

7.1

6.8

7.4

8.3

7.9

8.0

9.2 ★

8.6

8.4

Best combo: v3 × Claude Opus

9.2 quality · $0.004/run · 1.8s