off · low · medium · high · max — tested on 22 task areas

Does more 'thinking' actually make AI models better?

Most models expose a reasoning dial — off, low, medium, high, max. Extra thinking tokens cost real money and latency, so we measured what they actually buy: the same models, same tests, dial turned up vs. off.

Thinking pays most for OpenAI

GPT-5.5 Pro Zhipu

GLM 5.1

Gemini 3.1 Flash Lite Preview

A thinking configuration holds the #1 spot in 14 of 22 task areas — but the median gain over reasoning-off is just +1.8 points, at 1.0× the cost.

Task areas where thinking clearly helps: 10 · a wash: 12 · actively hurts: 0

Where thinking pays — task by task

For every model tested both with reasoning off and at a thinking tier in a task area, we compare its best thinking result against its off baseline. "Median lift" is the middle model's score change; "cost ×" is what the winning tier cost relative to off.

Task area	#1 config uses	Median lift	Cost ×	Models paired	Verdict
AI Strategy	medium	+2.8 pts	1.1×	34	Helps
Chef / Home Cooking	max	+1.8 pts	1.3×	34	Mixed
Coding	high	+2.9 pts	1.0×	29	Helps
Content & Brand	off	+3.1 pts	1.2×	33	Helps
Creative & Comedy	high	+9.9 pct	1.0×	26	Helps
Customer Support	low	+1.6 pts	1.0×	28	Mixed
Data & Analytics	off	+1.0 pts	1.0×	27	Mixed
Executive Assistant	max	+2.8 pts	1.2×	29	Helps
Frontend & Landing Pages	max	+1.6 pts	1.1×	26	Mixed
Investor & Pitch	off	+0.1 pts	1.1×	14	Mixed
Knowledge & Docs	max	+4.0 pts	1.0×	26	Helps
Landing Pages	off	+2.5 pts	1.0×	14	Helps
Legal & HR	high	+1.7 pts	1.0×	26	Mixed
Presentations & Decks	high	+1.1 pts	1.1×	26	Mixed
Product & Project Management	off	+2.2 pts	1.1×	26	Helps
RAG, Safety & Grounding	low	+0.9 pts	1.0×	27	Mixed
Research & Competitive Analysis	max	+7.3 pts	1.1×	26	Helps
Sales	max	+2.2 pts	1.0×	26	Helps
Structured Output	max	+1.4 pts	1.0×	27	Mixed
Summarization & Meeting Notes	off	+1.1 pts	1.0×	26	Mixed
Training & Education	off	+1.3 pts	1.0×	26	Mixed
Translation & Localization	off	+1.0 pts	1.0×	26	Mixed

Score lifts are only compared within one task area (never across), and only for 0–100-scored collections; the Elo-scored collection is compared by rank percentile. "Too few pairs" = fewer than 3 models had both an off baseline and a thinking config.

Reasoning off vs. best thinking tier

Each point is one model, averaged across its paired task areas. Above the line = thinking helps.

Thinking buys the most for…

GPT-5.5 Pro +14.4 pts (3/3 tasks)
GLM 5.1 +8.0 pts (4/4 tasks)
Gemini 3.1 Flash Lite Preview +7.9 pts (4/4 tasks)
Claude Sonnet 4.6 +7.8 pts (6/6 tasks)
Claude Opus 4.7 +5.0 pts (3/4 tasks)

…and can hurt these

DeepSeek V3.1 Terminus -0.2 pts (worse in 11/20)

Turning reasoning up made the median result worse for these models — more tokens, lower scores.

How this was measured

Reasoning tiers: off · low · medium · high · max. On effort-based APIs these map to the provider's reasoning-effort levels; on budget-based APIs to increasing thinking-token budgets.
Each comparison pairs the same model on the same task area: its reasoning-off runs vs. its best-ranked thinking tier. No cross-model or cross-task score mixing.
Cross-task summaries use rank percentiles only — 0–100 scores and Elo never average together.
Coverage disclosure: 2 task areas were tested on a subset of tiers (Investor & Pitch, Landing Pages).
Full grading methodology lives on each task-area page, e.g. Coding.

Frequently asked

Does more reasoning effort make AI models better?

A thinking configuration holds the #1 spot in 14 of 22 task areas — but the median gain over reasoning-off is just +1.8 points, at 1.0× the cost. It depends heavily on the task.

Which tasks benefit most from extended thinking?

The biggest median gains were in Creative & Comedy, Research & Competitive Analysis, Knowledge & Docs.

How much more does reasoning effort cost?

Across all paired runs, a model's best thinking tier cost a median 1.0× its reasoning-off runs.

This page is Spring Prompt, running

We just did this for every model. Do it for your prompt.

The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.

Generate test cases from your prompt — no eval set required to start.
Compare models side by side with quality, cost and latency in one matrix.
Optimise the winner until the scores say it's ready to ship.

Join the waitlist Browse all benchmarks

Experiment · Cold outreach email

Prompt × model results

12 test cases · 3 evals

Claude Opus

GPT-5

Gemini

7.1

6.8

7.4

8.3

7.9

8.0

9.2 ★

8.6

8.4

Best combo: v3 × Claude Opus

9.2 quality · $0.004/run · 1.8s