Business · 16 tasks · 69 models
Best AI models for Data & Analytics
Which models can analyse business data correctly — right numbers, no false precision, no invented causation?
claude-opus-4.8-low leads Data & Analytics (excellent). For tighter budgets, deepseek-v3.2-low is competitive at about 32% of the cost.
Top score — excellent
Clears the quality bar at $8.81/1k/run
~11s per run, still strong
Quality vs. cost
Every model placed by what it delivers and what it costs. The best value sits high and to the left.
Full ranking
| # | Model | Score | Cost/run | Speed | Best for | Arena · dataviz |
|---|---|---|---|---|---|---|
| 1 | claude-opus-4.8-low | 100.0 Excellent | $0.0275 | 17.7s | Best overall | 1277 |
| 2 | claude-opus-4.8-medium | 100.0 Excellent | $0.0279 | 24.7s | Best overall | 1277 |
| 3 | claude-opus-4.8-high | 100.0 Excellent | $0.0297 | 19.4s | Best overall | 1277 |
| 4 | claude-sonnet-4.6-max | 99.9 Excellent | $0.0250 | 33.3s | Best overall | 1318 |
| 5 | claude-opus-4.8-max | 99.9 Excellent | $0.0728 | 47.3s | Best overall | 1277 |
| 6 | claude-sonnet-4.6-medium | 99.9 Excellent | $0.0237 | 32.0s | Best overall | 1318 |
| 7 | qwen3.7-max-max | 99.8 Excellent | $0.0172 | 39.6s | Best overall | 1289 |
| 8 | claude-opus-4.5-medium | 99.8 Excellent | $0.0346 | 31.1s | Best overall | 1286 |
| 9 | kimi-k2.5-max | 99.7 Excellent | $0.0156 | 49.6s | Best overall | — |
| 10 | qwen3.7-max-medium | 99.7 Excellent | $0.0163 | 37.3s | Best overall | 1289 |
| 11 | kimi-k2.5-medium | 99.6 Excellent | $0.0138 | 51.2s | Best overall | — |
| 12 | qwen3.5-plus-02-15-medium | 99.3 Excellent | $0.0137 | 50.6s | Best overall | — |
| 13 | claude-opus-4.5-max | 99.3 Excellent | $0.0358 | 33.4s | Best overall | 1286 |
| 14 | gpt-5-mini-medium | 99.2 Excellent | $0.0140 | 32.0s | Best overall | — |
| 15 | claude-opus-4.6-medium | 99.1 Excellent | $0.0342 | 38.0s | Best overall | 1313 |
| 16 | kimi-k2.7-code-max | 98.9 Excellent | $0.0151 | 30.6s | Best overall | 1274 |
| 17 | claude-sonnet-4.5-max | 98.6 Excellent | $0.0237 | 33.4s | Best overall | 1203 |
| 18 | claude-opus-4.6-max | 98.6 Excellent | $0.0339 | 38.6s | Best overall | 1313 |
| 19 | gemini-3.5-flash-high | 98.5 Excellent | $0.0264 | 20.3s | Best overall | 1268 |
| 20 | glm-5-max | 97.7 Excellent | $0.0176 | 65.7s | Best overall | 1271 |
| 21 | gemini-3.1-flash-lite-max | 97.6 Excellent | $0.0147 | 24.4s | Best overall | — |
| 22 | gemini-3.1-pro-preview-medium | 97.3 Excellent | $0.0269 | 35.5s | Best overall | 1259 |
| 23 | gpt-5.5-max | 97.2 Excellent | $0.0316 | 28.6s | Best overall | 1302 |
| 24 | gpt-5.4-max | 97.2 Excellent | $0.0210 | 24.1s | Best overall | 1220 |
| 25 | claude-haiku-4.5-medium | 97.0 Excellent | $0.0172 | 23.8s | Best overall | 1168 |
| 26 | gpt-5.5-medium | 97.0 Excellent | $0.0249 | 25.8s | Best overall | 1302 |
| 27 | deepseek-v3.1-terminus-medium | 96.8 Excellent | $0.0121 | 35.8s | Best overall | 1208 |
| 28 | gemini-3-flash-preview-max | 96.4 Excellent | $0.0147 | 24.6s | Best overall | — |
| 29 | gemini-3.5-flash-medium | 96.3 Excellent | $0.0238 | 26.9s | Best overall | 1268 |
| 30 | gpt-5.4-mini-max | 96.2 Excellent | $0.0115 | 17.5s | Best overall | — |
| 31 | gemini-3.1-pro-preview-low | 96.0 Excellent | $0.0281 | 23.7s | Best overall | 1259 |
| 32 | claude-haiku-4.5-max | 95.8 Excellent | $0.0173 | 24.2s | Best overall | 1168 |
| 33 | qwen3.5-plus-02-15-max | 95.7 Excellent | $0.0161 | 62.6s | Best overall | — |
| 34 | claude-sonnet-4.5-medium | 95.6 Excellent | $0.0235 | 34.2s | Best overall | 1203 |
| 35 | gemini-3.1-pro-preview-max | 95.1 Excellent | $0.0269 | 33.6s | Best overall | 1259 |
| 36 | kimi-k2.7-code-medium | 94.8 Excellent | $0.0145 | 30.6s | Best overall | 1274 |
| 37 | glm-5-medium | 94.7 Excellent | $0.0163 | 67.6s | Best overall | 1271 |
| 38 | gpt-5-mini-max | 94.6 Excellent | $0.0189 | 60.3s | Best overall | — |
| 39 | gemini-3-flash-preview-medium | 94.4 Excellent | $0.0151 | 25.5s | Best overall | — |
| 40 | gpt-5.4-medium | 94.0 Excellent | $0.0175 | 20.0s | Best overall | 1220 |
| 41 | deepseek-v3.1-terminus-max | 93.6 Excellent | $0.0146 | 35.2s | Best overall | 1208 |
| 42 | grok-4.20-beta-max | 93.6 Excellent | $0.0153 | 20.2s | Best overall | 1244 |
| 43 | minimax-m2.7-medium | 93.5 Excellent | $0.0138 | 43.2s | Best overall | 1272 |
| 44 | deepseek-v3.2-max | 92.8 Excellent | $0.0137 | 38.6s | Best overall | 1200 |
| 45 | deepseek-v3.2-medium | 92.8 Excellent | $0.0138 | 32.3s | Best overall | 1200 |
| 46 | gemini-3.5-flash-low | 92.5 Excellent | $0.0210 | 17.0s | Best overall | 1268 |
| 47 | gpt-5.4-mini-medium | 92.2 Excellent | $0.0136 | 22.6s | Best overall | — |
| 48 | gemini-3.5-flash-max | 92.2 Excellent | $0.0256 | 28.9s | Best overall | 1268 |
| 49 | grok-4.20-beta-medium | 92.0 Excellent | $0.0175 | 21.7s | Best overall | 1244 |
| 50 | gemini-3.1-flash-lite-medium | 91.6 Excellent | $0.0146 | 24.1s | Best overall | — |
| 51 | grok-4.20-medium | 91.4 Excellent | $0.0145 | 21.5s | Best overall | — |
| 52 | gemini-3.1-flash-lite | 91.4 Excellent | $0.0124 | 10.8s | Best overall | — |
| 53 | mistral-medium-3.1-medium | 89.6 Strong | $0.0156 | 23.0s | Best overall | 1191 |
| 54 | grok-4.20-max | 86.4 Strong | $0.0139 | 19.8s | Best overall | — |
| 55 | minimax-m2.7-max | 85.1 Strong | $0.0145 | 40.7s | Best overall | 1272 |
| 56 | mistral-medium-3.1-max | 82.9 Strong | $0.0157 | 21.8s | Strong drafts | 1191 |
| 57 | claude-sonnet-4.6-high | 100.0 Excellent | $0.0243 | 26.6s | Best overall | 1318 |
| 58 | claude-opus-4.5-low | 100.0 Excellent | $0.0304 | 24.5s | Best overall | 1286 |
| 59 | claude-opus-4.6-low | 100.0 Excellent | $0.0339 | 32.1s | Best overall | 1313 |
| 60 | claude-sonnet-4.6-low | 99.6 Excellent | $0.0236 | 24.2s | Best overall | 1318 |
| 61 | qwen3.7-max-high | 96.0 Excellent | $0.0179 | 36.5s | Best overall | 1289 |
| 62 | gpt-5.5-low | 94.0 Excellent | $0.0227 | 15.7s | Best overall | 1302 |
| 63 | claude-sonnet-4.5-low | 93.2 Excellent | $0.0260 | 29.0s | Best overall | 1203 |
| 64 | qwen3.7-max-low | 92.8 Excellent | $0.0206 | 38.6s | Best overall | 1289 |
| 65 | deepseek-v3.2-low | 91.7 Excellent | $0.0088 | 26.8s | Best overall | 1200 |
| 66 | gpt-5-mini | 89.8 Strong | $0.0149 | 24.0s | Best overall | — |
| 67 | gpt-5.4-low | 85.0 Strong | $0.0162 | 12.1s | Best overall | 1220 |
| 68 | claude-sonnet-4.5-high | 83.8 Strong | $0.0242 | 25.0s | Strong drafts | 1203 |
| 69 | claude-haiku-4.5 | 83.2 Strong | $0.0139 | 13.2s | Strong drafts | 1168 |
“Arena · dataviz” is a third-party benchmark shown for context — independent of our tests. Source: Artificial Analysis (artificialanalysis.ai) via OpenRouter (openrouter.ai/rankings). · Source: Design Arena (www.designarena.ai) via OpenRouter (openrouter.ai/rankings).
What separates the top models
Metric Calculation
mediumTests correct denominators, percentage-vs-points, weighted averages, and base-rate reasoning — each with a single right answer.
Leader: claude-opus-4.8-medium
Spot the Misleading Stat
hardTests detection of Simpson's paradox, survivorship bias, peeking/p-hacking, and small-sample over-claiming.
Leader: qwen3.7-max-max
SQL Reasoning
hardTests SQL correctness traps: JOIN fan-out double counting, NOT IN with NULL, inclusive date boundaries, and average-of-averages.
Leader: qwen3.7-max-medium
Honest Communication
mediumTests communicating data without false precision, fabricated confidence, or correlation-as-causation, and distinguishing statistical from practical significance.
Leader: qwen3.7-max-max
See the difference
Real model outputs and how our judge scored them.
Top score
qwen3.7-max-max
Correlation vs causation (Northwind)
100/100
“The model perfectly answers the prompt. It correctly identifies the logical fallacy of treating correlation as causation, provides highly relevant business examples of confounding variables, and clearly outlines the analytical methods required to actually prove a causal relationship.”
## No, the conclusion is not sound. The report commits a classic logical error: **confusing correlation with causation**. Here's a breakdown of why. --- ### What the Data Actually Shows (Correlation) There is an **observed association** between two variables: - **Variable A:** Logging into the dashboard daily - **Variable B:** 18% less detention This simply means the two tend to occur together. It does **not** prove that A causes B. --- ### What the Report Claims (Causation) The phrase *"daily dashboard use **cuts** detention by 18%"* asserts a direct causal mechanism — that the dashboard itself is responsible for the reduction, and that making *any* broker use it daily would produce the same 18% drop. --- ### Why the Leap Is Flawed 1. **Confounding Variables (Third Factors)** A hidden variable could be driving *both* behaviors. For example: - **Conscientiousness / Work Ethic:** Highly diligent brokers are more likely to log in daily *and* more likely to proactively manage shipments to avoid detention. Their discipline is the real cause. - **Experience:** Veteran brokers may both appreciate the dashboard's value *and* know how to avoid detention through skill. - **Training:** Better-trained brokers might use all available tools *and* manage detention more effectively. 2. **Reverse Causation** Brokers drowning in detention fires may simply have **no time** to log into a dashboard. In this case, high detention causes low dashboard use — not the other way around. 3. **Self-Selection Bias** Brokers who *choose* to log in daily are systematically different from those who don't. The comparison is not between equivalent groups. --- ### What Would Be Needed to Support the Causal Claim? | Method | Description | |---|---| | **Randomized Controlled
Mid-pack
qwen3.5-plus-02-15-max
Percentage points vs percent (Northwind)
100/100
“The model correctly calculates both the absolute change in percentage points (+2) and the relative percentage change (+20%), distinguishing between the two perfectly. It also provides excellent context on why this dual framing is important for an executive audience.”
Quarter-over-quarter, the on-time pickup rate increased by **2 percentage points**, rising from **10% to 12%**. This represents a **20% relative improvement** in performance.
***
**Why this phrasing works for executives:**
* **Distinguishes Absolute vs. Relative:** It clarifies the absolute gain (2 percentage points) to avoid confusion with a 2% relative increase (which would be 10.2%).
* **Highlights Momentum:** It captures the 20% relative growth, which signals significant momentum, while keeping the low absolute base (12%) visible to maintain context on overall performance health.
* **Neutral & Precise:** It uses standard financial/operational terminology ("percentage points") to ensure mathematical accuracy without overstating the success.
Lowest score
minimax-m2.7-max
Simpson's paradox in fill rates (Tradewinds)
11/100
“The model severely failed by fabricating arithmetic and misunderstanding Simpson's paradox, leading to flawed business logic.”
**Short answer:** - **On every “type‑of‑shift” level, Pool A fills a higher proportion of both the easy day‑shifts (93 % vs 87 %) and the hard late‑night shifts (73 % vs 69 %).** - **Overall, however, Pool B looks better (82.6 % vs 78 %) because it has many more easy shifts (270 vs 87).** This reversal is a textbook example of **Simpson’s paradox**: the overall numbers are driven by the *mix* of shifts, not by the underlying ability of the pools. **What should Tradewinds conclude?** 1. **Do not rely on the overall fill‑rate alone** when the composition of the shifts differs between pools. 2. **Use a stratified (or weighted) view** of performance. 3. **If the company cares most about filling the hard‑to‑staff late‑night shifts, Pool A is the preferable pool**. 4. **If the schedule will continue to contain many more easy day‑shifts, the higher overall fill‑rate of Pool B may translate into more total filled shifts**, but that advantage is entirely due to the shift mix, not superior performance. Below is a step‑by‑step walk‑through that supports these conclusions. --- ## 1. Raw numbers at a glance | | **Pool A** | **Pool B** | |---------------------|-----------|-----------| | **Overall filled** | 273/350 (78.0 %) | 289/350 (82.6 %) | | **Easy day shifts**| 81/87 (93.0 %) | 234/270 (86.7 %) | | **Hard night shifts**| 192/263 (73.0 %) | 55/80 (68.8 %) | *Note:* “Easy” = day‑shift; “Hard” = late‑night shift. --- ## 2. The paradox in plain English - **Pool A beats Pool B on easy shifts (93 % > 87 %).** - **Pool A also beats Pool B on hard shifts (73 % > 69 %).** - **Yet Pool B has the higher overall rate.** Why? - Pool A works **263 hard** shifts but only **87 easy** ones. - Pool B works **80 hard** shift
Where models still fail
The most common problems we flagged across all models.
Frequently asked
What is the best AI model for data & analytics?
In our benchmarks, claude-opus-4.8-low ranks first for data & analytics, scoring excellent, across 16 test cases.
What is the cheapest good model for data & analytics?
deepseek-v3.2-low is the best value: it clears our quality bar for data & analytics at $8.81/1k per run.
Which model is fastest for data & analytics?
gemini-3.1-flash-lite is the fastest model that still performs well for data & analytics.
How we test
Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.
Judge: gemini-3.1-pro-preview · 1000 model runs across 4 benchmarks · last tested 2026-06-30
This page is Spring Prompt, running
We just did this for every model. Do it for your prompt.
The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.
- Generate test cases from your prompt — no eval set required to start.
- Compare models side by side with quality, cost and latency in one matrix.
- Optimise the winner until the scores say it's ready to ship.
Prompt × model results
12 test cases · 3 evals