Business · 16 tasks · 69 models

Best AI models for Data & Analytics

Name: Data & Analytics AI model benchmark
Creator: Spring Prompt

Which models can analyse business data correctly — right numbers, no false precision, no invented causation?

Top models Anthropic

claude-opus-4.8-low Anthropic

claude-sonnet-4.6-max Qwen

qwen3.7-max-max

claude-opus-4.8-low leads Data & Analytics (excellent). For tighter budgets, deepseek-v3.2-low is competitive at about 32% of the cost.

Best overall Excellent

claude-opus-4.8-low

Top score — excellent

100.0 score $0.0275/run 17.7s

Best value Excellent

deepseek-v3.2-low

Clears the quality bar at $8.81/1k/run

91.7 score $0.0088/run 26.8s

Fastest usable Excellent

gemini-3.1-flash-lite

~11s per run, still strong

91.4 score $0.0124/run 10.8s

Quality vs. cost

Every model placed by what it delivers and what it costs. The best value sits high and to the left.

Full ranking

Best overall Cheapest Fastest Smartest

#	Model	Score	Cost/run	Speed	Best for	Arena · dataviz
1	claude-opus-4.8-low	100.0 Excellent	$0.0275	17.7s	Best overall	1277
2	claude-opus-4.8-medium	100.0 Excellent	$0.0279	24.7s	Best overall	1277
3	claude-opus-4.8-high	100.0 Excellent	$0.0297	19.4s	Best overall	1277
4	claude-sonnet-4.6-max	99.9 Excellent	$0.0250	33.3s	Best overall	1318
5	claude-opus-4.8-max	99.9 Excellent	$0.0728	47.3s	Best overall	1277
6	claude-sonnet-4.6-medium	99.9 Excellent	$0.0237	32.0s	Best overall	1318
7	qwen3.7-max-max	99.8 Excellent	$0.0172	39.6s	Best overall	1289
8	claude-opus-4.5-medium	99.8 Excellent	$0.0346	31.1s	Best overall	1286
9	kimi-k2.5-max	99.7 Excellent	$0.0156	49.6s	Best overall	—
10	qwen3.7-max-medium	99.7 Excellent	$0.0163	37.3s	Best overall	1289
11	kimi-k2.5-medium	99.6 Excellent	$0.0138	51.2s	Best overall	—
12	qwen3.5-plus-02-15-medium	99.3 Excellent	$0.0137	50.6s	Best overall	—
13	claude-opus-4.5-max	99.3 Excellent	$0.0358	33.4s	Best overall	1286
14	gpt-5-mini-medium	99.2 Excellent	$0.0140	32.0s	Best overall	—
15	claude-opus-4.6-medium	99.1 Excellent	$0.0342	38.0s	Best overall	1313
16	kimi-k2.7-code-max	98.9 Excellent	$0.0151	30.6s	Best overall	1274
17	claude-sonnet-4.5-max	98.6 Excellent	$0.0237	33.4s	Best overall	1203
18	claude-opus-4.6-max	98.6 Excellent	$0.0339	38.6s	Best overall	1313
19	gemini-3.5-flash-high	98.5 Excellent	$0.0264	20.3s	Best overall	1268
20	glm-5-max	97.7 Excellent	$0.0176	65.7s	Best overall	1271
21	gemini-3.1-flash-lite-max	97.6 Excellent	$0.0147	24.4s	Best overall	—
22	gemini-3.1-pro-preview-medium	97.3 Excellent	$0.0269	35.5s	Best overall	1259
23	gpt-5.5-max	97.2 Excellent	$0.0316	28.6s	Best overall	1302
24	gpt-5.4-max	97.2 Excellent	$0.0210	24.1s	Best overall	1220
25	claude-haiku-4.5-medium	97.0 Excellent	$0.0172	23.8s	Best overall	1168
26	gpt-5.5-medium	97.0 Excellent	$0.0249	25.8s	Best overall	1302
27	deepseek-v3.1-terminus-medium	96.8 Excellent	$0.0121	35.8s	Best overall	1208
28	gemini-3-flash-preview-max	96.4 Excellent	$0.0147	24.6s	Best overall	—
29	gemini-3.5-flash-medium	96.3 Excellent	$0.0238	26.9s	Best overall	1268
30	gpt-5.4-mini-max	96.2 Excellent	$0.0115	17.5s	Best overall	—
31	gemini-3.1-pro-preview-low	96.0 Excellent	$0.0281	23.7s	Best overall	1259
32	claude-haiku-4.5-max	95.8 Excellent	$0.0173	24.2s	Best overall	1168
33	qwen3.5-plus-02-15-max	95.7 Excellent	$0.0161	62.6s	Best overall	—
34	claude-sonnet-4.5-medium	95.6 Excellent	$0.0235	34.2s	Best overall	1203
35	gemini-3.1-pro-preview-max	95.1 Excellent	$0.0269	33.6s	Best overall	1259
36	kimi-k2.7-code-medium	94.8 Excellent	$0.0145	30.6s	Best overall	1274
37	glm-5-medium	94.7 Excellent	$0.0163	67.6s	Best overall	1271
38	gpt-5-mini-max	94.6 Excellent	$0.0189	60.3s	Best overall	—
39	gemini-3-flash-preview-medium	94.4 Excellent	$0.0151	25.5s	Best overall	—
40	gpt-5.4-medium	94.0 Excellent	$0.0175	20.0s	Best overall	1220
41	deepseek-v3.1-terminus-max	93.6 Excellent	$0.0146	35.2s	Best overall	1208
42	grok-4.20-beta-max	93.6 Excellent	$0.0153	20.2s	Best overall	1244
43	minimax-m2.7-medium	93.5 Excellent	$0.0138	43.2s	Best overall	1272
44	deepseek-v3.2-max	92.8 Excellent	$0.0137	38.6s	Best overall	1200
45	deepseek-v3.2-medium	92.8 Excellent	$0.0138	32.3s	Best overall	1200
46	gemini-3.5-flash-low	92.5 Excellent	$0.0210	17.0s	Best overall	1268
47	gpt-5.4-mini-medium	92.2 Excellent	$0.0136	22.6s	Best overall	—
48	gemini-3.5-flash-max	92.2 Excellent	$0.0256	28.9s	Best overall	1268
49	grok-4.20-beta-medium	92.0 Excellent	$0.0175	21.7s	Best overall	1244
50	gemini-3.1-flash-lite-medium	91.6 Excellent	$0.0146	24.1s	Best overall	—
51	grok-4.20-medium	91.4 Excellent	$0.0145	21.5s	Best overall	—
52	gemini-3.1-flash-lite	91.4 Excellent	$0.0124	10.8s	Best overall	—
53	mistral-medium-3.1-medium	89.6 Strong	$0.0156	23.0s	Best overall	1191
54	grok-4.20-max	86.4 Strong	$0.0139	19.8s	Best overall	—
55	minimax-m2.7-max	85.1 Strong	$0.0145	40.7s	Best overall	1272
56	mistral-medium-3.1-max	82.9 Strong	$0.0157	21.8s	Strong drafts	1191
57	claude-sonnet-4.6-high	100.0 Excellent	$0.0243	26.6s	Best overall	1318
58	claude-opus-4.5-low	100.0 Excellent	$0.0304	24.5s	Best overall	1286
59	claude-opus-4.6-low	100.0 Excellent	$0.0339	32.1s	Best overall	1313
60	claude-sonnet-4.6-low	99.6 Excellent	$0.0236	24.2s	Best overall	1318
61	qwen3.7-max-high	96.0 Excellent	$0.0179	36.5s	Best overall	1289
62	gpt-5.5-low	94.0 Excellent	$0.0227	15.7s	Best overall	1302
63	claude-sonnet-4.5-low	93.2 Excellent	$0.0260	29.0s	Best overall	1203
64	qwen3.7-max-low	92.8 Excellent	$0.0206	38.6s	Best overall	1289
65	deepseek-v3.2-low	91.7 Excellent	$0.0088	26.8s	Best overall	1200
66	gpt-5-mini	89.8 Strong	$0.0149	24.0s	Best overall	—
67	gpt-5.4-low	85.0 Strong	$0.0162	12.1s	Best overall	1220
68	claude-sonnet-4.5-high	83.8 Strong	$0.0242	25.0s	Strong drafts	1203
69	claude-haiku-4.5	83.2 Strong	$0.0139	13.2s	Strong drafts	1168

“Arena · dataviz” is a third-party benchmark shown for context — independent of our tests. Source: Artificial Analysis (artificialanalysis.ai) via OpenRouter (openrouter.ai/rankings). · Source: Design Arena (www.designarena.ai) via OpenRouter (openrouter.ai/rankings).

What separates the top models

Metric Calculation

medium

Tests correct denominators, percentage-vs-points, weighted averages, and base-rate reasoning — each with a single right answer.

Leader: claude-opus-4.8-medium

Spot the Misleading Stat

hard

Tests detection of Simpson's paradox, survivorship bias, peeking/p-hacking, and small-sample over-claiming.

Leader: qwen3.7-max-max

SQL Reasoning

hard

Tests SQL correctness traps: JOIN fan-out double counting, NOT IN with NULL, inclusive date boundaries, and average-of-averages.

Leader: qwen3.7-max-medium

Honest Communication

medium

Tests communicating data without false precision, fabricated confidence, or correlation-as-causation, and distinguishing statistical from practical significance.

Leader: qwen3.7-max-max

See the difference

Real model outputs and how our judge scored them.

Top score qwen3.7-max-max Correlation vs causation (Northwind)

100/100

“The model perfectly answers the prompt. It correctly identifies the logical fallacy of treating correlation as causation, provides highly relevant business examples of confounding variables, and clearly outlines the analytical methods required to actually prove a causal relationship.”

## No, the conclusion is not sound.

The report commits a classic logical error: **confusing correlation with causation**. Here's a breakdown of why.

---

### What the Data Actually Shows (Correlation)
There is an **observed association** between two variables:
- **Variable A:** Logging into the dashboard daily
- **Variable B:** 18% less detention

This simply means the two tend to occur together. It does **not** prove that A causes B.

---

### What the Report Claims (Causation)
The phrase *"daily dashboard use **cuts** detention by 18%"* asserts a direct causal mechanism — that the dashboard itself is responsible for the reduction, and that making *any* broker use it daily would produce the same 18% drop.

---

### Why the Leap Is Flawed

1. **Confounding Variables (Third Factors)**
A hidden variable could be driving *both* behaviors. For example:
- **Conscientiousness / Work Ethic:** Highly diligent brokers are more likely to log in daily *and* more likely to proactively manage shipments to avoid detention. Their discipline is the real cause.
- **Experience:** Veteran brokers may both appreciate the dashboard's value *and* know how to avoid detention through skill.
- **Training:** Better-trained brokers might use all available tools *and* manage detention more effectively.

2. **Reverse Causation**
Brokers drowning in detention fires may simply have **no time** to log into a dashboard. In this case, high detention causes low dashboard use — not the other way around.

3. **Self-Selection Bias**
Brokers who *choose* to log in daily are systematically different from those who don't. The comparison is not between equivalent groups.

---

### What Would Be Needed to Support the Causal Claim?

| Method | Description |
|---|---|
| **Randomized Controlled

Mid-pack qwen3.5-plus-02-15-max Percentage points vs percent (Northwind)

100/100

“The model correctly calculates both the absolute change in percentage points (+2) and the relative percentage change (+20%), distinguishing between the two perfectly. It also provides excellent context on why this dual framing is important for an executive audience.”

Quarter-over-quarter, the on-time pickup rate increased by **2 percentage points**, rising from **10% to 12%**. This represents a **20% relative improvement** in performance.

***

**Why this phrasing works for executives:**
*   **Distinguishes Absolute vs. Relative:** It clarifies the absolute gain (2 percentage points) to avoid confusion with a 2% relative increase (which would be 10.2%).
*   **Highlights Momentum:** It captures the 20% relative growth, which signals significant momentum, while keeping the low absolute base (12%) visible to maintain context on overall performance health.
*   **Neutral & Precise:** It uses standard financial/operational terminology ("percentage points") to ensure mathematical accuracy without overstating the success.

Lowest score minimax-m2.7-max Simpson's paradox in fill rates (Tradewinds)

11/100

“The model severely failed by fabricating arithmetic and misunderstanding Simpson's paradox, leading to flawed business logic.”

**Short answer:**  
- **On every “type‑of‑shift” level, Pool A fills a higher proportion of both the easy day‑shifts (93 % vs 87 %) and the hard late‑night shifts (73 % vs 69 %).**  
- **Overall, however, Pool B looks better (82.6 % vs 78 %) because it has many more easy shifts (270 vs 87).**  

This reversal is a textbook example of **Simpson’s paradox**: the overall numbers are driven by the *mix* of shifts, not by the underlying ability of the pools.  

**What should Tradewinds conclude?**  
1. **Do not rely on the overall fill‑rate alone** when the composition of the shifts differs between pools.  
2. **Use a stratified (or weighted) view** of performance.  
3. **If the company cares most about filling the hard‑to‑staff late‑night shifts, Pool A is the preferable pool**.  
4. **If the schedule will continue to contain many more easy day‑shifts, the higher overall fill‑rate of Pool B may translate into more total filled shifts**, but that advantage is entirely due to the shift mix, not superior performance.  

Below is a step‑by‑step walk‑through that supports these conclusions.

---

## 1.  Raw numbers at a glance

|                     | **Pool A** | **Pool B** |
|---------------------|-----------|-----------|
| **Overall filled**  | 273/350  (78.0 %) | 289/350  (82.6 %) |
| **Easy day shifts**| 81/87    (93.0 %) | 234/270  (86.7 %) |
| **Hard night shifts**| 192/263 (73.0 %) | 55/80    (68.8 %) |

*Note:* “Easy” = day‑shift; “Hard” = late‑night shift.

---

## 2.  The paradox in plain English  

- **Pool A beats Pool B on easy shifts (93 % > 87 %).**  
- **Pool A also beats Pool B on hard shifts (73 % > 69 %).**  
- **Yet Pool B has the higher overall rate.**  

Why?  
- Pool A works **263 hard** shifts but only **87 easy** ones.  
- Pool B works **80 hard** shift

Where models still fail

The most common problems we flagged across all models.

16correlation as causation 16fabricated statistic 13wrong answer 11false precision 5join fanout 3wrong denominator 2not in null 1missing margin of error

Frequently asked

What is the best AI model for data & analytics?

In our benchmarks, claude-opus-4.8-low ranks first for data & analytics, scoring excellent, across 16 test cases.

What is the cheapest good model for data & analytics?

deepseek-v3.2-low is the best value: it clears our quality bar for data & analytics at $8.81/1k per run.

Which model is fastest for data & analytics?

gemini-3.1-flash-lite is the fastest model that still performs well for data & analytics.

How we test

Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.

Judge: gemini-3.1-pro-preview · 1000 model runs across 4 benchmarks · last tested 2026-06-30

This page is Spring Prompt, running

We just did this for every model. Do it for your prompt.

The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.

Generate test cases from your prompt — no eval set required to start.
Compare models side by side with quality, cost and latency in one matrix.
Optimise the winner until the scores say it's ready to ship.

Join the waitlist Browse all benchmarks

Experiment · Cold outreach email

Prompt × model results

12 test cases · 3 evals

Claude Opus

GPT-5

Gemini

7.1

6.8

7.4

8.3

7.9

8.0

9.2 ★

8.6

8.4

Best combo: v3 × Claude Opus

9.2 quality · $0.004/run · 1.8s