off · low · medium · high · max — tested on 22 task areas
Does more 'thinking' actually make AI models better?
Most models expose a reasoning dial — off, low, medium, high, max. Extra thinking tokens cost real money and latency, so we measured what they actually buy: the same models, same tests, dial turned up vs. off.
A thinking configuration holds the #1 spot in 14 of 22 task areas — but the median gain over reasoning-off is just +1.8 points, at 1.0× the cost.
Task areas where thinking clearly helps: 10 · a wash: 12 · actively hurts: 0
Where thinking pays — task by task
For every model tested both with reasoning off and at a thinking tier in a task area, we compare its best thinking result against its off baseline. "Median lift" is the middle model's score change; "cost ×" is what the winning tier cost relative to off.
| Task area | #1 config uses | Median lift | Cost × | Models paired | Verdict |
|---|---|---|---|---|---|
| AI Strategy | medium | +2.8 pts | 1.1× | 34 | Helps |
| Chef / Home Cooking | max | +1.8 pts | 1.3× | 34 | Mixed |
| Coding | high | +2.9 pts | 1.0× | 29 | Helps |
| Content & Brand | off | +3.1 pts | 1.2× | 33 | Helps |
| Creative & Comedy | high | +9.9 pct | 1.0× | 26 | Helps |
| Customer Support | low | +1.6 pts | 1.0× | 28 | Mixed |
| Data & Analytics | off | +1.0 pts | 1.0× | 27 | Mixed |
| Executive Assistant | max | +2.8 pts | 1.2× | 29 | Helps |
| Frontend & Landing Pages | max | +1.6 pts | 1.1× | 26 | Mixed |
| Investor & Pitch | off | +0.1 pts | 1.1× | 14 | Mixed |
| Knowledge & Docs | max | +4.0 pts | 1.0× | 26 | Helps |
| Landing Pages | off | +2.5 pts | 1.0× | 14 | Helps |
| Legal & HR | high | +1.7 pts | 1.0× | 26 | Mixed |
| Presentations & Decks | high | +1.1 pts | 1.1× | 26 | Mixed |
| Product & Project Management | off | +2.2 pts | 1.1× | 26 | Helps |
| RAG, Safety & Grounding | low | +0.9 pts | 1.0× | 27 | Mixed |
| Research & Competitive Analysis | max | +7.3 pts | 1.1× | 26 | Helps |
| Sales | max | +2.2 pts | 1.0× | 26 | Helps |
| Structured Output | max | +1.4 pts | 1.0× | 27 | Mixed |
| Summarization & Meeting Notes | off | +1.1 pts | 1.0× | 26 | Mixed |
| Training & Education | off | +1.3 pts | 1.0× | 26 | Mixed |
| Translation & Localization | off | +1.0 pts | 1.0× | 26 | Mixed |
Score lifts are only compared within one task area (never across), and only for 0–100-scored collections; the Elo-scored collection is compared by rank percentile. "Too few pairs" = fewer than 3 models had both an off baseline and a thinking config.
Reasoning off vs. best thinking tier
Each point is one model, averaged across its paired task areas. Above the line = thinking helps.
Thinking buys the most for…
- GPT-5.5 Pro +14.4 pts (3/3 tasks)
- GLM 5.1 +8.0 pts (4/4 tasks)
- Gemini 3.1 Flash Lite Preview +7.9 pts (4/4 tasks)
- Claude Sonnet 4.6 +7.8 pts (6/6 tasks)
- Claude Opus 4.7 +5.0 pts (3/4 tasks)
…and can hurt these
- DeepSeek V3.1 Terminus -0.2 pts (worse in 11/20)
Turning reasoning up made the median result worse for these models — more tokens, lower scores.
How this was measured
- Reasoning tiers: off · low · medium · high · max. On effort-based APIs these map to the provider's reasoning-effort levels; on budget-based APIs to increasing thinking-token budgets.
- Each comparison pairs the same model on the same task area: its reasoning-off runs vs. its best-ranked thinking tier. No cross-model or cross-task score mixing.
- Cross-task summaries use rank percentiles only — 0–100 scores and Elo never average together.
- Coverage disclosure: 2 task areas were tested on a subset of tiers (Investor & Pitch, Landing Pages).
- Full grading methodology lives on each task-area page, e.g. Coding.
Frequently asked
Does more reasoning effort make AI models better?
A thinking configuration holds the #1 spot in 14 of 22 task areas — but the median gain over reasoning-off is just +1.8 points, at 1.0× the cost. It depends heavily on the task.
Which tasks benefit most from extended thinking?
The biggest median gains were in Creative & Comedy, Research & Competitive Analysis, Knowledge & Docs.
How much more does reasoning effort cost?
Across all paired runs, a model's best thinking tier cost a median 1.0× its reasoning-off runs.
This page is Spring Prompt, running
We just did this for every model. Do it for your prompt.
The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.
- Generate test cases from your prompt — no eval set required to start.
- Compare models side by side with quality, cost and latency in one matrix.
- Optimise the winner until the scores say it's ready to ship.
Prompt × model results
12 test cases · 3 evals