42 models · 22 task areas · 37,277 graded runs
AI model leaderboard
Every model we test, ranked by how it actually performs on real work. Rankings are percentile-based across tasks, so no single benchmark dominates.
Current leader: GPT-5.5 — top-3 in 7 of 22 task areas.
Code & data
Qwen3.7 Max
avg percentile 94 in this area
Writing & comms
GPT-5.5
avg percentile 92 in this area
Business & strategy
Claude Opus 4.8
avg percentile 90 in this area
Creative & visual
GPT-5.5
avg percentile 96 in this area
| # | Model | Overall | Code & data | Writing & comms | Business & strategy | Creative & visual | AA Intel† | Top-3s | Price / 1M |
|---|---|---|---|---|---|---|---|---|---|
| 1 |
GPT-5.5
OpenAI · 2 months ago
|
88 | 86 | 92★ | 82 | 96★ | 55 | 7 | $35.0 |
| 2 |
Claude Opus 4.8
Anthropic · 1 month ago
|
87 | 84 | 87 | 90★ | 82 | 56 | 7 | $30.0 |
| 3 |
Kimi K2.6
MoonshotAI
|
85 | 89 | 88 | 88 | 69 | 43 | 1 | $4.07 |
| 4 |
Qwen3.7 Max
Qwen · 1 month ago
|
85 | 94★ | 89 | 76 | 87 | 46 | 5 | $5.0 |
| 5 |
Gemini 3.1 Pro Preview
Google · 4 months ago
|
84 | 75 | 90 | 82 | 91 | 46 | 3 | $14.0 |
| 6 |
Claude Opus 4.7
Anthropic · 3 months ago
|
84 | — | 90 | 91 | 69 | 54 | 1 | $30.0 |
| 7 |
GPT-5.4
OpenAI · 4 months ago
|
83 | 82 | 87 | 82 | 80 | 51 | 3 | $17.5 |
| 8 |
Claude Opus 4.5
Anthropic · 7 months ago
|
79 | 64 | 81 | 78 | 92 | — | 3 | $30.0 |
| 9 |
Claude Sonnet 4.6
Anthropic · 4 months ago
|
76 | 57 | 72 | 85 | 83 | 47 | 4 | $18.0 |
| 10 |
Gemini 3.5 Flash
Google · 1 month ago
|
75 | 78 | 80 | 65 | 87 | 50 | 0 | $10.5 |
Show the other 32 models ▾
| 11 |
Kimi K2.7 Code
MoonshotAI · 20 days ago
|
75 | 72 | 68 | 84 | 67 | 42 | 1 | $4.24 |
| 12 |
GLM 5.2
Z.ai · 16 days ago
|
74 | 81 | 81 | 72 | 63 | 51 | 0 | $3.93 |
| 13 |
Claude Opus 4.6
Anthropic · 5 months ago
|
73 | 59 | 71 | 82 | 73 | — | 2 | $30.0 |
| 14 |
Claude Fable 5
Anthropic · 23 days ago
|
68 | 40 | 73 | 73 | 81 | 60 | 6 | $60.0 |
| 15 |
Qwen3.5 Plus 2026-02-15
Qwen · 4 months ago
|
68 | 76 | 72 | 66 | 61 | — | 1 | $1.82 |
| 16 |
Kimi K2.5
MoonshotAI · 5 months ago
|
68 | 78 | 59 | 72 | 67 | — | 0 | $2.4 |
| 17 |
DeepSeek V4 Pro
DeepSeek · 2 months ago
|
68 | 68 | 66 | 66 | 73 | 44 | 0 | $1.3 |
| 18 |
Gemini 3.1 Flash Lite Preview
Google · 4 months ago
|
67 | 61 | 86 | — | — | 25 | 0 | $1.75 |
| 19 |
MiniMax M3
MiniMax
|
66 | 63 | 54 | 77 | 64 | 44 | 3 | $1.5 |
| 20 |
GPT-5.4 Mini
OpenAI · 4 months ago
|
64 | 73 | 75 | 61 | 47 | 40 | 2 | $5.25 |
| 21 |
Gemini 3 Flash Preview
Google · 6 months ago
|
64 | 75 | 64 | 51 | 80 | — | 2 | $3.5 |
| 22 |
GLM 5.1
Z.ai · 3 months ago
|
64 | — | 80 | 69 | 46 | 40 | 0 | $4.0 |
| 23 |
GPT-5.5 Pro
OpenAI · 2 months ago
|
62 | — | 94 | 47 | 69 | — | 1 | $210.0 |
| 24 |
Claude Sonnet 4.5
Anthropic · 9 months ago
|
61 | 55 | 75 | 53 | 62 | 36 | 1 | $18.0 |
| 25 |
Claude Sonnet 5
Anthropic · 2 days ago
|
60 | 64 | 56 | 63 | 56 | 53 | 1 | $12.0 |
| 26 |
Gemini 3.1 Flash Lite
Google · 2 months ago
|
58 | 61 | 70 | 45 | 64 | — | 0 | $1.75 |
| 27 |
GLM 5
Z.ai · 5 months ago
|
55 | 55 | 69 | 48 | 51 | — | 0 | $2.52 |
| 28 |
GPT-5 Mini
OpenAI · 11 months ago
|
51 | 68 | 36 | 53 | 53 | — | 1 | $2.25 |
| 29 |
DeepSeek V3.1 Terminus
DeepSeek · 9 months ago
|
40 | 41 | 36 | 37 | 52 | — | 0 | $1.22 |
| 30 |
DeepSeek V3.2
DeepSeek · 7 months ago
|
40 | 36 | 38 | 42 | 42 | — | 0 | $0.57 |
| 31 |
Claude Haiku 4.5
Anthropic · 9 months ago
|
39 | 29 | 44 | 42 | 34 | 30 | 0 | $6.0 |
| 32 |
Grok 4.20 Beta
xAI · —
|
37 | 34 | 30 | 41 | 44 | — | 0 | $8.0 |
| 33 |
Gemini 2.5 Pro
Google · 1.0 years ago
|
37 | — | — | — | 37 | 26 | 0 | $11.25 |
| 34 |
Grok 4.20
xAI · 3 months ago
|
36 | 33 | 24 | 39 | 53 | — | 0 | $3.75 |
| 35 |
Mistral Medium 3.1
Mistral · 11 months ago
|
29 | 19 | 31 | 24 | 44 | 15 | 0 | $2.4 |
| 36 |
GPT-5.4 Nano
OpenAI · 4 months ago
|
23 | — | 37 | 20 | 20 | 38 | 0 | $1.45 |
| 37 |
MiniMax M2.7
MiniMax · 3 months ago
|
17 | 35 | 12 | 17 | 8 | 38 | 0 | $0.9 |
| 38 |
Gemini 2.5 Flash
Google · 1.0 years ago
|
10 | — | — | — | 10 | — | 0 | $0.75 |
| 39 |
o1
OpenAI · 1.5 years ago
|
7 | — | — | — | 7 | — | 0 | $75.0 |
| 40 |
Qwen3 Coder Flash
Qwen · 9 months ago
|
3 | — | — | — | 3 | — | 0 | $1.17 |
| 41 |
GPT-4o
OpenAI · 2.1 years ago
|
2 | — | — | — | 2 | — | 0 | $12.5 |
| 42 |
GPT-4o Mini
OpenAI · 2.0 years ago
|
0 | — | — | — | 0 | — | 0 | $0.75 |
Overall and area figures are mean percentile ranks across the task areas each model entered (rank-normalized, so 0–100 scores and Elo collections never mix raw). 85+ 70–84 55–69 40–54 <40 · ★ = best in that area. †AA Intel = Artificial Analysis Intelligence Index, a third-party benchmark shown for reference only — never blended into our rankings. Price = blended input+output cost per 1M tokens.
Quality vs. price — every model
Best value sits high and to the left. Log-scale price axis.
The grades behind the ranks
Every rank on this page traces back to graded runs like these — a real test case, a real model output, and an independent judge's unedited verdict.
“The response provides a perfect, minimal, and idiomatic TypeScript solution that safely narrows the union type using a null check and a discriminated union tag. The explanation is concise and directly addresses the prompt's requirements.”
“The model triggered a false-positive safety refusal and completely failed to perform the requested code review.”
“The response is an exceptional, production-ready cold email that perfectly captures the requested scrappy voice, leverages the context trigger brilliantly, integrates verified facts, and closes with a highly effective CTA while staying well under the word limit.”
“The model completely failed to generate the requested email, outputting only a single digit.”
“The model perfectly executed the task by acknowledging the lack of specific data and providing a highly structured, actionable, and professional template for the manager to use. It completely avoided fabricating any details, adhering strictly to the constraints.”
“The original judge response was completely malformed and contained no extractable scores.”
Looking for the best model for one task?
Frequently asked
What is the best AI model overall right now?
GPT-5.5 leads our leaderboard with an average percentile of 88 across 22 task areas, finishing top-3 in 7 of them.
What is the best value AI model?
GLM 5.2: within 13 percentile points of the leader at $3.93/1M tokens vs $35 for GPT-5.5.
What is the best cheap AI model (under $5/1M tokens)?
Kimi K2.6 — average percentile 85 across 6 task areas at $4.07/1M tokens.
Which model is best for code & data tasks?
Qwen3.7 Max leads the Code & data area group with an average percentile of 94 across those task areas.
Which model is best for writing & comms tasks?
GPT-5.5 leads the Writing & comms area group with an average percentile of 92 across those task areas.
Which model is best for business & strategy tasks?
Claude Opus 4.8 leads the Business & strategy area group with an average percentile of 90 across those task areas.
Which model is best for creative & visual tasks?
GPT-5.5 leads the Creative & visual area group with an average percentile of 96 across those task areas.
This page is Spring Prompt, running
We just did this for every model. Do it for your prompt.
The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.
- Generate test cases from your prompt — no eval set required to start.
- Compare models side by side with quality, cost and latency in one matrix.
- Optimise the winner until the scores say it's ready to ship.
Prompt × model results
12 test cases · 3 evals