Business · 14 tasks · 44 models

Smartest AI models for Sales

Name: Sales AI model benchmark
Creator: Spring Prompt

Which models write outbound, follow-ups, discovery, and objection responses that a real buyer would respond to?

Top models Anthropic

claude-opus-4.6-low Anthropic

claude-opus-4.8-low Anthropic

claude-opus-4.5-low

The highest-quality model for Sales is claude-opus-4.6-low (strong).

Best overall ★ Strong

claude-opus-4.6-low

Top score — strong

84.6 score $0.0506/run 44.0s

Best value Usable

gemini-3.1-flash-lite

Clears the quality bar at $0.023/run

73.8 score $0.0233/run 17.7s

Quality vs. cost

Every model placed by what it delivers and what it costs. The best value sits high and to the left.

Full ranking

Best overall Cheapest Fastest Smartest

#	Model	Score	Cost/run	Speed	Best for
1	claude-opus-4.6-low	84.6 Strong	$0.0506	44.0s	Strong drafts
2	claude-opus-4.8-low	83.8 Strong	$0.0399	25.1s	Strong drafts
3	claude-opus-4.5-low	83.7 Strong	$0.0553	40.0s	Strong drafts
4	claude-opus-4.5-high	83.1 Strong	$0.0579	41.8s	Strong drafts
5	claude-sonnet-4.6-low	82.9 Strong	$0.0438	43.0s	Strong drafts
6	claude-sonnet-4.6-high	82.6 Strong	$0.0471	45.1s	Strong drafts
7	claude-sonnet-4.5-high	82.4 Strong	$0.0481	50.1s	Strong drafts
8	kimi-k2.5	81.9 Strong	$0.0238	65.3s	Strong drafts
9	claude-opus-4.8-high	81.9 Strong	$0.0417	27.2s	Strong drafts
10	claude-sonnet-4.5-low	81.7 Strong	$0.0340	36.9s	Strong drafts
11	claude-opus-4.5	81.5 Strong	$0.0401	32.5s	Strong drafts
12	qwen3.7-max-low	81.3 Strong	$0.0333	62.7s	Strong drafts
13	claude-opus-4.6-high	81.3 Strong	$0.0570	49.1s	Strong drafts
14	qwen3.5-plus-02-15	81.0 Strong	$0.0248	71.7s	Strong drafts
15	claude-sonnet-4.5	80.4 Strong	$0.0315	32.8s	Strong drafts
16	claude-opus-4.6	80.3 Strong	$0.0457	40.7s	Strong drafts
17	gemini-3.1-pro-preview-high	80.1 Strong	$0.0411	33.7s	Strong drafts
18	gemini-3.5-flash-high	80.1 Strong	$0.0396	28.4s	Strong drafts
19	gemini-3-flash-preview	79.9 Usable	$0.0273	25.6s	Strong drafts
20	qwen3.7-max-high	79.7 Usable	$0.0333	65.6s	Strong drafts
21	gemini-3.1-pro-preview	79.4 Usable	$0.0463	34.4s	Strong drafts
22	gpt-5.5-high	79.3 Usable	$0.0582	35.8s	Strong drafts
23	kimi-k2.7-code	78.7 Usable	$0.0303	48.1s	Strong drafts
24	gemini-3.1-pro-preview-low	78.6 Usable	$0.0393	31.5s	Strong drafts
25	gpt-5.4	78.2 Usable	$0.0342	24.8s	Strong drafts
26	qwen3.7-max	77.9 Usable	$0.0335	65.0s	Strong drafts
27	gpt-5.4-high	77.4 Usable	$0.0483	33.0s	Strong drafts
28	gemini-3.5-flash-low	77.1 Usable	$0.0322	24.1s	Strong drafts
29	grok-4.20-beta	76.3 Usable	$0.0272	21.3s	Strong drafts
30	gpt-5.5	75.1 Usable	$0.0479	33.5s	Strong drafts
31	gpt-5.4-low	75.0 Usable	$0.0302	21.7s	Strong drafts
32	claude-haiku-4.5	74.0 Usable	$0.0247	21.7s	Needs review
33	gemini-3.1-flash-lite	73.8 Usable	$0.0233	17.7s	Needs review
34	gpt-5.4-mini	72.4 Usable	$0.0269	20.8s	Needs review
35	grok-4.20	72.2 Usable	$0.0270	21.2s	Needs review
36	gpt-5.5-low	71.3 Usable	$0.0395	24.5s	Needs review
37	deepseek-v3.2	71.2 Usable	$0.0240	26.1s	Needs review
38	glm-5	69.9 Needs editing	$0.0232	57.7s	Needs review
39	deepseek-v3.2-high	69.9 Needs editing	$0.0240	26.6s	Needs review
40	deepseek-v3.1-terminus	68.4 Needs editing	$0.0241	35.5s	Needs review
41	deepseek-v3.2-low	66.9 Needs editing	$0.0198	21.6s	Needs review
42	mistral-medium-3.1	65.9 Needs editing	$0.0281	27.6s	Needs review
43	gpt-5-mini	64.3 Needs editing	$0.0288	32.6s	Needs review
44	minimax-m2.7	59.6 Weak	$0.0286	73.2s	Needs review

What separates the top models

Cold Outbound Email

medium

Tests whether the model writes a first-touch cold email that leads with the prospect's problem, stays short, and makes one low-friction ask.

Leader: kimi-k2.5

Follow-up & Break-up

medium

Tests multi-touch follow-ups that add new value instead of 'just checking in', and a genuine pressure-free break-up email.

Leader: claude-opus-4.5-high

Discovery Call Prep

hard

Tests preparation of a discovery call: open problem-centric questions, MEDDIC-style qualification, and a value hypothesis.

Leader: claude-opus-4.6-low

Objection Handling

hard

Tests honest, non-manipulative responses to real objections — probing the concern instead of countering, with no false urgency or reflexive discounting.

Leader: grok-4.20-beta

See the difference

Real model outputs and how our judge scored them.

Top score claude-haiku-4.5 Outcome-sceptic clinician

95/100

“The response perfectly executes the instructions by validating the skepticism, explicitly disclaiming clinical outcomes, and effectively reframing the value around verified operational and reimbursement benefits.”

You're right to be sceptical. We don't claim Lumen reduces readmissions or improves outcomes—we can't, because that's not what we've proven, and honestly, vendors who make those promises without evidence aren't doing anyone a favour.

What we *do* is handle the mechanics of remote monitoring in a way that's actually usable. Your care coordinators see alerts on a single dashboard instead of juggling vendor portals and phone calls. The ECG and BP readings stream in automatically. And we manage the entire reimbursement workflow—the documentation, the coding, the billing—so you're not leaving money on the table or drowning in compliance work.

The real value is operational: you can staff remote monitoring efficiently, you get paid for it properly, and your clinicians spend time on patient decisions instead than on administrative friction.

Whether *that* translates to better outcomes depends on how you use it—and that's a clinical question you'd need to answer with your own data. We're a tool that makes it feasible to monitor more patients consistently. What you do with that data is up to you.

We work with 90 cardiology clinics. If you want to talk to one about whether the workflow actually works in practice, we can arrange that. But we'll never tell you it's a readmission silver bullet.

Mid-pack grok-4.20 Insight-led touch to a finance buyer

81/100

“The model followed all instructions perfectly, including the negative constraints and the strict word count limit. The insight provided is well-articulated and relevant. The score is kept from being perfect only due to a slightly abrupt transition into the product pitch and a CTA that could be even lower friction for a C-level executive.”

**Subject:** Duplicate-invoice risk after acquisitions

Dear [First Name],

When companies consolidate entities post-acquisition, duplicate invoices often rise sharply. Different ERP instances, retained legacy vendors, and currency variations make it easy for the same invoice to be paid twice before the finance team can catch it.

Ferrovia connects read-only to your ERP and AP systems, flags duplicates and anomalies across subsidiaries and currencies, and provides an audit-ready trail for every flag. 160 enterprises currently use us for this.

Would you be open to a 15-minute screen share next week?

Best regards,  
[Your Name]

Lowest score glm-5 Restaurant GM, shift gaps

0/100

“The model completely failed to generate the requested email, outputting only a single digit.”

Where models still fail

The most common problems we flagged across all models.

4empty response 4constraint failure 2unprompted meta commentary 2major task miss 2disguised pitch 2high friction cta 2overlong 2constraint violation

Frequently asked

What is the best AI model for sales?

In our benchmarks, claude-opus-4.6-low ranks first for sales, scoring strong, across 14 test cases.

What is the cheapest good model for sales?

gemini-3.1-flash-lite is the best value: it clears our quality bar for sales at $0.023 per run.

Which model is fastest for sales?

gemini-3.1-flash-lite is the fastest model that still performs well for sales.

How we test

Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.

Judge: gemini-3.1-pro-preview · 700 model runs across 4 benchmarks · last tested 2026-06-30

This page is Spring Prompt, running

We just did this for every model. Do it for your prompt.

The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.

Generate test cases from your prompt — no eval set required to start.
Compare models side by side with quality, cost and latency in one matrix.
Optimise the winner until the scores say it's ready to ship.

Join the waitlist Browse all benchmarks

Experiment · Cold outreach email

Prompt × model results

12 test cases · 3 evals

Claude Opus

GPT-5

Gemini

7.1

6.8

7.4

8.3

7.9

8.0

9.2 ★

8.6

8.4

Best combo: v3 × Claude Opus

9.2 quality · $0.004/run · 1.8s