Confirm Action

Are you sure you want to proceed?

42 models · 22 task areas · 37,277 graded runs

AI model leaderboard

Every model we test, ranked by how it actually performs on real work. Rankings are percentile-based across tasks, so no single benchmark dominates.

Podium OpenAI 1. GPT-5.5 Anthropic 2. Claude Opus 4.8 Moonshot 3. Kimi K2.6

Current leader: GPT-5.5 — top-3 in 7 of 22 task areas.

Code & data

Qwen3.7 Max

avg percentile 94 in this area

Writing & comms

GPT-5.5

avg percentile 92 in this area

Business & strategy

Claude Opus 4.8

avg percentile 90 in this area

Creative & visual

GPT-5.5

avg percentile 96 in this area

# Model Overall Code & dataWriting & commsBusiness & strategyCreative & visual AA Intel† Top-3s Price / 1M
1 GPT-5.5
OpenAI · 2 months ago
88 86 92 82 96 55 7 $35.0
2 Claude Opus 4.8
Anthropic · 1 month ago
87 84 87 90 82 56 7 $30.0
3 Kimi K2.6
MoonshotAI
85 89 88 88 69 43 1 $4.07
4 Qwen3.7 Max
Qwen · 1 month ago
85 94 89 76 87 46 5 $5.0
5 Gemini 3.1 Pro Preview
Google · 4 months ago
84 75 90 82 91 46 3 $14.0
6 Claude Opus 4.7
Anthropic · 3 months ago
84 90 91 69 54 1 $30.0
7 GPT-5.4
OpenAI · 4 months ago
83 82 87 82 80 51 3 $17.5
8 Claude Opus 4.5
Anthropic · 7 months ago
79 64 81 78 92 3 $30.0
9 Claude Sonnet 4.6
Anthropic · 4 months ago
76 57 72 85 83 47 4 $18.0
10 Gemini 3.5 Flash
Google · 1 month ago
75 78 80 65 87 50 0 $10.5
Show the other 32 models ▾
11 Kimi K2.7 Code
MoonshotAI · 20 days ago
75 72 68 84 67 42 1 $4.24
12 GLM 5.2
Z.ai · 16 days ago
74 81 81 72 63 51 0 $3.93
13 Claude Opus 4.6
Anthropic · 5 months ago
73 59 71 82 73 2 $30.0
14 Claude Fable 5
Anthropic · 23 days ago
68 40 73 73 81 60 6 $60.0
15 Qwen3.5 Plus 2026-02-15
Qwen · 4 months ago
68 76 72 66 61 1 $1.82
16 Kimi K2.5
MoonshotAI · 5 months ago
68 78 59 72 67 0 $2.4
17 DeepSeek V4 Pro
DeepSeek · 2 months ago
68 68 66 66 73 44 0 $1.3
18 Gemini 3.1 Flash Lite Preview
Google · 4 months ago
67 61 86 25 0 $1.75
19 MiniMax M3
MiniMax
66 63 54 77 64 44 3 $1.5
20 GPT-5.4 Mini
OpenAI · 4 months ago
64 73 75 61 47 40 2 $5.25
21 Gemini 3 Flash Preview
Google · 6 months ago
64 75 64 51 80 2 $3.5
22 GLM 5.1
Z.ai · 3 months ago
64 80 69 46 40 0 $4.0
23 GPT-5.5 Pro
OpenAI · 2 months ago
62 94 47 69 1 $210.0
24 Claude Sonnet 4.5
Anthropic · 9 months ago
61 55 75 53 62 36 1 $18.0
25 Claude Sonnet 5
Anthropic · 2 days ago
60 64 56 63 56 53 1 $12.0
26 Gemini 3.1 Flash Lite
Google · 2 months ago
58 61 70 45 64 0 $1.75
27 GLM 5
Z.ai · 5 months ago
55 55 69 48 51 0 $2.52
28 GPT-5 Mini
OpenAI · 11 months ago
51 68 36 53 53 1 $2.25
29 DeepSeek V3.1 Terminus
DeepSeek · 9 months ago
40 41 36 37 52 0 $1.22
30 DeepSeek V3.2
DeepSeek · 7 months ago
40 36 38 42 42 0 $0.57
31 Claude Haiku 4.5
Anthropic · 9 months ago
39 29 44 42 34 30 0 $6.0
32 Grok 4.20 Beta
xAI · —
37 34 30 41 44 0 $8.0
33 Gemini 2.5 Pro
Google · 1.0 years ago
37 37 26 0 $11.25
34 Grok 4.20
xAI · 3 months ago
36 33 24 39 53 0 $3.75
35 Mistral Medium 3.1
Mistral · 11 months ago
29 19 31 24 44 15 0 $2.4
36 GPT-5.4 Nano
OpenAI · 4 months ago
23 37 20 20 38 0 $1.45
37 MiniMax M2.7
MiniMax · 3 months ago
17 35 12 17 8 38 0 $0.9
38 Gemini 2.5 Flash
Google · 1.0 years ago
10 10 0 $0.75
39 o1
OpenAI · 1.5 years ago
7 7 0 $75.0
40 Qwen3 Coder Flash
Qwen · 9 months ago
3 3 0 $1.17
41 GPT-4o
OpenAI · 2.1 years ago
2 2 0 $12.5
42 GPT-4o Mini
OpenAI · 2.0 years ago
0 0 0 $0.75

Overall and area figures are mean percentile ranks across the task areas each model entered (rank-normalized, so 0–100 scores and Elo collections never mix raw). 85+ 70–84 55–69 40–54 <40 · ★ = best in that area. †AA Intel = Artificial Analysis Intelligence Index, a third-party benchmark shown for reference only — never blended into our rankings. Price = blended input+output cost per 1M tokens.

Quality vs. price — every model

Best value sits high and to the left. Log-scale price axis.

The grades behind the ranks

Every rank on this page traces back to graded runs like these — a real test case, a real model output, and an independent judge's unedited verdict.

Coding

Top score · kimi-k2.6-medium 100/100

“The response provides a perfect, minimal, and idiomatic TypeScript solution that safely narrows the union type using a null check and a discriminated union tag. The explanation is concise and directly addresses the prompt's requirements.”

Lowest score · gemini-3.1-pro-preview 0/100

“The model triggered a false-positive safety refusal and completely failed to perform the requested code review.”

Sales

Top score · claude-opus-4.5-max 95/100

“The response is an exceptional, production-ready cold email that perfectly captures the requested scrappy voice, leverages the context trigger brilliantly, integrates verified facts, and closes with a highly effective CTA while staying well under the word limit.”

Lowest score · glm-5 0/100

“The model completely failed to generate the requested email, outputting only a single digit.”

Legal & HR

Top score · minimax-m3-high 100/100

“The model perfectly executed the task by acknowledging the lack of specific data and providing a highly structured, actionable, and professional template for the manager to use. It completely avoided fabricating any details, adhering strictly to the constraints.”

Lowest score · qwen3.5-plus-02-15-medium 0/100

“The original judge response was completely malformed and contained no extractable scores.”

Looking for the best model for one task?

Frequently asked

What is the best AI model overall right now?

GPT-5.5 leads our leaderboard with an average percentile of 88 across 22 task areas, finishing top-3 in 7 of them.

What is the best value AI model?

GLM 5.2: within 13 percentile points of the leader at $3.93/1M tokens vs $35 for GPT-5.5.

What is the best cheap AI model (under $5/1M tokens)?

Kimi K2.6 — average percentile 85 across 6 task areas at $4.07/1M tokens.

Which model is best for code & data tasks?

Qwen3.7 Max leads the Code & data area group with an average percentile of 94 across those task areas.

Which model is best for writing & comms tasks?

GPT-5.5 leads the Writing & comms area group with an average percentile of 92 across those task areas.

Which model is best for business & strategy tasks?

Claude Opus 4.8 leads the Business & strategy area group with an average percentile of 90 across those task areas.

Which model is best for creative & visual tasks?

GPT-5.5 leads the Creative & visual area group with an average percentile of 96 across those task areas.

This page is Spring Prompt, running

We just did this for every model. Do it for your prompt.

The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.

  • Generate test cases from your prompt — no eval set required to start.
  • Compare models side by side with quality, cost and latency in one matrix.
  • Optimise the winner until the scores say it's ready to ship.
Experiment · Cold outreach email

Prompt × model results

12 test cases · 3 evals
Claude Opus
GPT-5
Gemini
v1
7.1
6.8
7.4
v2
8.3
7.9
8.0
v3
9.2
8.6
8.4
Best combo: v3 × Claude Opus
9.2 quality · $0.004/run · 1.8s