Anthropic · 9 models tested

Which Anthropic model should you use?

Q: What is the best Anthropic model?

Claude Opus 4.8 — the highest average percentile (87) across the task areas we benchmarked, with 7 top-3 finishes.

Q: What is the best cheap Anthropic model?

Claude Sonnet 4.6: within 10 percentile points of Claude Opus 4.8 at $18 per 1M tokens.

Q: Where does Anthropic lead every other lab?

A Anthropic model holds the outright #1 spot in 8 task areas: AI Strategy, Creative & Comedy, Executive Assistant, Investor & Pitch, Legal & HR, Sales, Summarization & Meeting Notes, Training & Education.

Lineup Claude Opus 4.8 Claude Opus 4.7 Claude Opus 4.5 Claude Sonnet 4.6

Anthropic's strongest model is Claude Opus 4.8 (avg percentile 87, top-3 in 7 of 22 task areas). On a budget, Claude Sonnet 4.6 stays close at $18/1M.

The Anthropic lineup, ranked

Model	Overall	Code & data	Writing & comms	Business & strategy	Creative & visual	AA Intel†	Top-3s	Price / 1M
Claude Opus 4.8 Best overall 22 task areas	87	84★	87★	90★	82	56	7	$30.0
Claude Opus 4.7 6 task areas	84	—	90	91	69	54	1	$30.0
Claude Opus 4.5 22 task areas	79	64	81	78	92★	—	3	$30.0
Claude Sonnet 4.6 Best value 22 task areas	76	57	72	85	83	47	4	$18.0
Claude Opus 4.6 22 task areas	73	59	71	82	73	—	2	$30.0
Claude Fable 5 22 task areas	68	40	73	73	81	60	6	$60.0
Claude Sonnet 4.5 22 task areas	61	55	75	53	62	36	1	$18.0
Claude Sonnet 5 22 task areas	60	64	56	63	56	53	1	$12.0
Claude Haiku 4.5 22 task areas	39	29	44	42	34	30	0	$6.0

Area figures are mean percentile ranks; 85+ 70–84 55–69 40–54 <40 · ★ = the lineup's best in that area. †AA Intel = Artificial Analysis Intelligence Index, third-party reference only.

Best Anthropic model, task by task

The lineup's strongest model in each task area — and where it lands against every model we test, not just Anthropic's. #1 means it beats every rival lab too.

Task area	Their best here	Rank vs. everyone	Rating
AI Strategy	Claude Sonnet 5	#1 / 126	Strong
Creative & Comedy	Claude Fable 5	#1 / 107	Elo
Executive Assistant	Claude Sonnet 4.6	#1 / 112	Strong
Investor & Pitch	Claude Fable 5	#1 / 63	Strong
Legal & HR	Claude Fable 5	#1 / 107	Excellent
Sales	Claude Opus 4.8	#1 / 107	Strong
Summarization & Meeting Notes	Claude Opus 4.5	#1 / 107	Excellent
Training & Education	Claude Opus 4.6	#1 / 107	Excellent
Data & Analytics	Claude Opus 4.8	#2 / 110	Excellent
Frontend & Landing Pages	Claude Opus 4.5	#2 / 106	Usable
Knowledge & Docs	Claude Opus 4.8	#2 / 107	Excellent
Product & Project Management	Claude Opus 4.8	#2 / 107	Excellent
Translation & Localization	Claude Opus 4.6	#2 / 107	Excellent
Chef / Home Cooking	Claude Fable 5	#3 / 126	Strong
Coding	Claude Opus 4.8	#3 / 115	Excellent
Research & Competitive Analysis	Claude Fable 5	#3 / 107	Excellent
Customer Support	Claude Opus 4.8	#7 / 113	Strong
Content & Brand	Claude Fable 5	#8 / 124	Excellent
Landing Pages	Claude Sonnet 4.5	#9 / 69	Usable
RAG, Safety & Grounding	Claude Opus 4.8	#11 / 110	Excellent
Presentations & Decks	Claude Fable 5	#18 / 107	Excellent
Structured Output	Claude Opus 4.8	#56 / 110	Excellent

Where another lab clearly wins

Honesty corner: the task areas where even Anthropic's best model ranks furthest from the top. See who actually leads there.

Structured Output their best: #56 Presentations & Decks their best: #18 RAG, Safety & Grounding their best: #11

Frequently asked

What is the best Anthropic model?

Claude Opus 4.8 — the highest average percentile (87) across the task areas we benchmarked, with 7 top-3 finishes.

What is the best cheap Anthropic model?

Claude Sonnet 4.6: within 10 percentile points of Claude Opus 4.8 at $18 per 1M tokens.

Where does Anthropic lead every other lab?

A Anthropic model holds the outright #1 spot in 8 task areas: AI Strategy, Creative & Comedy, Executive Assistant, Investor & Pitch, Legal & HR, Sales, Summarization & Meeting Notes, Training & Education.

Every model, every lab — full leaderboard → Best model by task →

This page is Spring Prompt, running

We just did this for every model. Do it for your prompt.

The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.

Generate test cases from your prompt — no eval set required to start.
Compare models side by side with quality, cost and latency in one matrix.
Optimise the winner until the scores say it's ready to ship.

Join the waitlist Browse all benchmarks

Experiment · Cold outreach email

Prompt × model results

12 test cases · 3 evals

Claude Opus

GPT-5

Gemini

7.1

6.8

7.4

8.3

7.9

8.0

9.2 ★

8.6

8.4

Best combo: v3 × Claude Opus

9.2 quality · $0.004/run · 1.8s