Do benchmark scores predict real-world performance?
Standardized benchmarks like the Intelligence Index measure raw capability. We measure something different — how models do on real business tasks. Here's how the two line up.
Each dot is one of the 16 models we've tested. Models above the trend over-deliver on real tasks relative to their headline score; those below under-deliver.
Punch above their benchmarks
Rank higher on our real tasks than their Intelligence Index suggests.
- Gemini 3.1 Pro Preview +5 ranks
- Gemini 3.1 Flash Lite Preview +5 ranks
- Qwen3.7 Max +4 ranks
- Gemini 2.5 Pro +3 ranks
- GPT-5.5 +1 ranks
Fall short of their benchmarks
Strong headline scores, weaker on our real tasks.
- Claude Opus 4.7 −4 ranks
- Gemini 3.5 Flash −3 ranks
- GLM 5.1 −3 ranks
- GPT-5.4 Nano −3 ranks
- MiniMax M2.7 −3 ranks
Headline scores: Source: Artificial Analysis (artificialanalysis.ai) via OpenRouter (openrouter.ai/rankings). · Source: Design Arena (www.designarena.ai) via OpenRouter (openrouter.ai/rankings). Real-task percentiles are Spring Prompt's own benchmarks.