Confirm Action

Are you sure you want to proceed?

off · low · medium · high · max — tested on 22 task areas

Does more 'thinking' actually make AI models better?

Most models expose a reasoning dial — off, low, medium, high, max. Extra thinking tokens cost real money and latency, so we measured what they actually buy: the same models, same tests, dial turned up vs. off.

Thinking pays most for OpenAI GPT-5.5 Pro Zhipu GLM 5.1 Google Gemini 3.1 Flash Lite Preview

A thinking configuration holds the #1 spot in 14 of 22 task areas — but the median gain over reasoning-off is just +1.8 points, at 1.0× the cost.

Task areas where thinking clearly helps: 10 · a wash: 12 · actively hurts: 0

Where thinking pays — task by task

For every model tested both with reasoning off and at a thinking tier in a task area, we compare its best thinking result against its off baseline. "Median lift" is the middle model's score change; "cost ×" is what the winning tier cost relative to off.

Task area #1 config uses Median lift Cost × Models paired Verdict
AI Strategy medium +2.8 pts 1.1× 34 Helps
Chef / Home Cooking max +1.8 pts 1.3× 34 Mixed
Coding high +2.9 pts 1.0× 29 Helps
Content & Brand off +3.1 pts 1.2× 33 Helps
Creative & Comedy high +9.9 pct 1.0× 26 Helps
Customer Support low +1.6 pts 1.0× 28 Mixed
Data & Analytics off +1.0 pts 1.0× 27 Mixed
Executive Assistant max +2.8 pts 1.2× 29 Helps
Frontend & Landing Pages max +1.6 pts 1.1× 26 Mixed
Investor & Pitch off +0.1 pts 1.1× 14 Mixed
Knowledge & Docs max +4.0 pts 1.0× 26 Helps
Landing Pages off +2.5 pts 1.0× 14 Helps
Legal & HR high +1.7 pts 1.0× 26 Mixed
Presentations & Decks high +1.1 pts 1.1× 26 Mixed
Product & Project Management off +2.2 pts 1.1× 26 Helps
RAG, Safety & Grounding low +0.9 pts 1.0× 27 Mixed
Research & Competitive Analysis max +7.3 pts 1.1× 26 Helps
Sales max +2.2 pts 1.0× 26 Helps
Structured Output max +1.4 pts 1.0× 27 Mixed
Summarization & Meeting Notes off +1.1 pts 1.0× 26 Mixed
Training & Education off +1.3 pts 1.0× 26 Mixed
Translation & Localization off +1.0 pts 1.0× 26 Mixed

Score lifts are only compared within one task area (never across), and only for 0–100-scored collections; the Elo-scored collection is compared by rank percentile. "Too few pairs" = fewer than 3 models had both an off baseline and a thinking config.

Reasoning off vs. best thinking tier

Each point is one model, averaged across its paired task areas. Above the line = thinking helps.

Thinking buys the most for…

…and can hurt these

Turning reasoning up made the median result worse for these models — more tokens, lower scores.

How this was measured

Frequently asked

Does more reasoning effort make AI models better?

A thinking configuration holds the #1 spot in 14 of 22 task areas — but the median gain over reasoning-off is just +1.8 points, at 1.0× the cost. It depends heavily on the task.

Which tasks benefit most from extended thinking?

The biggest median gains were in Creative & Comedy, Research & Competitive Analysis, Knowledge & Docs.

How much more does reasoning effort cost?

Across all paired runs, a model's best thinking tier cost a median 1.0× its reasoning-off runs.

This page is Spring Prompt, running

We just did this for every model. Do it for your prompt.

The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.

  • Generate test cases from your prompt — no eval set required to start.
  • Compare models side by side with quality, cost and latency in one matrix.
  • Optimise the winner until the scores say it's ready to ship.
Experiment · Cold outreach email

Prompt × model results

12 test cases · 3 evals
Claude Opus
GPT-5
Gemini
v1
7.1
6.8
7.4
v2
8.3
7.9
8.0
v3
9.2
8.6
8.4
Best combo: v3 × Claude Opus
9.2 quality · $0.004/run · 1.8s