Is claude-haiku-4.5-medium good at Data & Analytics?
claude-haiku-4.5-medium ranks #25 of 69 for Data & Analytics — excellent. The top pick for this task is claude-opus-4.8-low.
claude-haiku-4.5-medium on each Data & Analytics sub-task
| Spot the Misleading Stat | 98.5/100 | #45 |
| Metric Calculation | 98.5/100 | #20 |
| SQL Reasoning | 95.5/100 | #60 |
| Honest Communication | 95.5/100 | #26 |
Real examples, graded
WeakSmall-sample over-claim (Ferrovia) 36/100
“The model completely misses the primary statistical flaw (small-sample over-claiming). Instead of recognizing that n=8 is too small to trust the 50% difference (which could easily be noise or skewed by a single outlier), the model treats the 50% premium as a reliable fact. It then pivots to a completely different argument about unit economics versus total revenue, fabricating a hypothetical table to make its point. While the business questions it raises (CAC, LTV) are generally good, it fails the core analytical task of spotting the misleading statistic.”
WeakInclusive date boundary 55/100
“The model correctly identifies the intraday boundary problem and timestamp coercion. However, its first proposed solution (Option 1) uses BETWEEN with '2024-02-01', which is inclusive and will incorrectly include events occurring exactly at midnight on February 1st. The model falsely claims this captures up to Jan 31 23:59:59. While Option 2 is correct, recommending Option 1 as a preferred solution introduces a new date boundary error.”
Frequently asked
Is claude-haiku-4.5-medium good at Data & Analytics?
claude-haiku-4.5-medium ranks #25 of 69 models we tested for Data & Analytics, scoring excellent.
What is claude-haiku-4.5-medium's strongest Data & Analytics skill?
Its best sub-task here is Spot the Misleading Stat.
This page is Spring Prompt, running
We just did this for every model. Do it for your prompt.
The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.
- Generate test cases from your prompt — no eval set required to start.
- Compare models side by side with quality, cost and latency in one matrix.
- Optimise the winner until the scores say it's ready to ship.
Prompt × model results
12 test cases · 3 evals