Do benchmark scores predict real-world performance?

Standardized benchmarks like the Intelligence Index measure raw capability. We measure something different — how models do on real business tasks. Here's how the two line up.

Each dot is one of the 16 models we've tested. Models above the trend over-deliver on real tasks relative to their headline score; those below under-deliver.

Punch above their benchmarks

Rank higher on our real tasks than their Intelligence Index suggests.

Gemini 3.1 Pro Preview +5 ranks
Gemini 3.1 Flash Lite Preview +5 ranks
Qwen3.7 Max +4 ranks
Gemini 2.5 Pro +3 ranks
GPT-5.5 +1 ranks

Fall short of their benchmarks

Strong headline scores, weaker on our real tasks.

Claude Opus 4.7 −4 ranks
Gemini 3.5 Flash −3 ranks
GLM 5.1 −3 ranks
GPT-5.4 Nano −3 ranks
MiniMax M2.7 −3 ranks

Headline scores: Source: Artificial Analysis (artificialanalysis.ai) via OpenRouter (openrouter.ai/rankings). · Source: Design Arena (www.designarena.ai) via OpenRouter (openrouter.ai/rankings). Real-task percentiles are Spring Prompt's own benchmarks.