Business · 26 tasks · 43 models

Best AI models for Customer Support

Name: Customer Support AI model benchmark
Creator: Spring Prompt

Which models resolve customer issues with empathy without inventing policy, over-promising, or fabricating account facts?

Top models Google

gemini-3.1-pro-preview-low Qwen

qwen3.7-max-high OpenAI

gpt-5.4-low

gemini-3.1-pro-preview-low leads Customer Support (strong). For tighter budgets, gpt-5.4-mini is competitive at about 60% of the cost.

Best overall Strong

gemini-3.1-pro-preview-low

Top score — strong

89.8 score $0.0257/run 20.7s

Best value Strong

gpt-5.4-mini

Clears the quality bar at $0.015/run

83.3 score $0.0154/run 13.5s

Fastest usable Usable

grok-4.20

~12s per run, still strong

79.9 score $0.0178/run 12.4s

Quality vs. cost

Every model placed by what it delivers and what it costs. The best value sits high and to the left.

Full ranking

Best overall Cheapest Fastest Smartest

#	Model	Score	Cost/run	Speed	Best for
1	gemini-3.1-pro-preview-low	89.8 Strong	$0.0257	20.7s	Best overall
2	gemini-3.1-pro-preview-high	88.1 Strong	$0.0265	23.8s	Best overall
3	qwen3.7-max-high	87.4 Strong	$0.0215	41.9s	Best overall
4	gpt-5.4-low	87.1 Strong	$0.0187	13.8s	Best overall
5	gpt-5.4-high	86.4 Strong	$0.0233	18.5s	Best overall
6	gpt-5.5-low	86.2 Strong	$0.0215	14.6s	Best overall
7	claude-opus-4.8-high	85.8 Strong	$0.0292	19.0s	Best overall
8	claude-opus-4.8-low	85.7 Strong	$0.0288	18.1s	Best overall
9	gpt-5.4	85.6 Strong	$0.0193	15.9s	Best overall
10	gemini-3.5-flash-low	85.5 Strong	$0.0231	18.1s	Best overall
11	gpt-5.5-high	85.0 Strong	$0.0309	24.6s	Best overall
12	gemini-3.5-flash-high	84.8 Strong	$0.0260	19.9s	Strong drafts
13	qwen3.5-plus-02-15	84.3 Strong	$0.0190	47.7s	Strong drafts
14	qwen3.7-max-low	84.3 Strong	$0.0228	43.3s	Strong drafts
15	claude-sonnet-4.5-high	84.1 Strong	$0.0237	22.7s	Strong drafts
16	gpt-5.4-mini	83.3 Strong	$0.0154	13.5s	Strong drafts
17	kimi-k2.5	83.0 Strong	$0.0192	53.9s	Strong drafts
18	claude-opus-4.5	82.0 Strong	$0.0247	20.3s	Strong drafts
19	claude-sonnet-4.5-low	80.7 Strong	$0.0240	21.9s	Strong drafts
20	claude-sonnet-4.6-high	80.3 Strong	$0.0262	23.1s	Strong drafts
21	gemini-3.1-flash-lite	79.9 Usable	$0.0175	13.4s	Strong drafts
22	deepseek-v3.2-low	78.4 Usable	$0.0163	19.3s	Strong drafts
23	grok-4.20-beta	78.3 Usable	$0.0175	13.5s	Strong drafts
24	claude-sonnet-4.5	77.4 Usable	$0.0204	19.5s	Strong drafts
25	claude-opus-4.6	77.4 Usable	$0.0315	28.2s	Strong drafts
26	claude-opus-4.5-low	77.1 Usable	$0.0304	22.7s	Strong drafts
27	deepseek-v3.2-high	76.3 Usable	$0.0172	22.2s	Strong drafts
28	deepseek-v3.2	75.6 Usable	$0.0166	21.0s	Strong drafts
29	claude-opus-4.6-high	75.3 Usable	$0.0325	28.8s	Strong drafts
30	claude-opus-4.5-high	74.8 Usable	$0.0338	27.2s	Needs review
31	claude-opus-4.6-low	74.6 Usable	$0.0320	25.4s	Needs review
32	claude-sonnet-4.6-low	72.7 Usable	$0.0240	21.5s	Needs review
33	claude-haiku-4.5	72.4 Usable	$0.0193	16.4s	Needs review
34	gpt-5-mini	68.4 Needs editing	$0.0189	24.4s	Needs review
35	minimax-m2.7	62.4 Needs editing	$0.0202	34.0s	Needs review
36	gpt-5.5	88.3 Strong	$0.0240	15.8s	Best overall
37	kimi-k2.7-code	86.2 Strong	$0.0182	21.6s	Best overall
38	qwen3.7-max	86.1 Strong	$0.0224	43.0s	Best overall
39	glm-5	84.2 Strong	$0.0160	34.1s	Strong drafts
40	grok-4.20	79.9 Usable	$0.0178	12.4s	Strong drafts
41	gemini-3-flash-preview	79.0 Usable	$0.0177	16.0s	Strong drafts
42	deepseek-v3.1-terminus	78.9 Usable	$0.0169	22.0s	Strong drafts
43	mistral-medium-3.1	72.8 Usable	$0.0189	16.1s	Needs review

What separates the top models

Resolution

medium

Tests resolving an issue with a concrete next step, ownership, and a realistic timeline — grounded in the account facts.

Leader: gemini-3.1-flash-lite

Basic Support Reply Test

medium

Tests clear support replies for common user issues.

Leader: claude-opus-4.8-high

De-escalation

medium

Tests calming an angry customer with genuine empathy and action — not a robotic, over-apologetic, or dismissive reply.

Leader: gpt-5.4-high

Policy Boundary Test

hard

Tests refunds, privacy, abuse, security, and account-risk policies.

Leader: kimi-k2.5

Escalation and Incident Test

hard

Tests outage, SLA, engineering handoff, VIP escalation, and multi-issue tickets.

Leader: gemini-3.1-pro-preview-low

Policy Boundaries

hard

Tests handling a request for something outside policy without inventing policy or promising what can't be delivered.

Leader: qwen3.5-plus-02-15

Help Content Test

medium

Tests help center articles, FAQs, macros, churn rescue, and tone rewrites.

Leader: claude-opus-4.8-low

Escalation & Handoff

hard

Tests recognizing an out-of-scope/over-authority issue and writing a clean internal handoff.

Leader: gpt-5.4-low

See the difference

Real model outputs and how our judge scored them.

Top score gemini-3.1-flash-lite Knows when NOT to escalate (Lumen)

100/100

“The model perfectly followed the instructions by providing the exact UI flow (Settings -> Team -> Invite) without unnecessarily escalating the routine request. The tone is friendly, and the response is concise and clear.”

Hi there,

I’d be happy to help you with that! Adding a second care coordinator to your dashboard is a straightforward process. You can do this by following these steps:

1. Log in to your dashboard and click on **Settings**.
2. Navigate to the **Team** tab.
3. Click the **Invite** button and enter the email address of the team member you would like to add.

Once you send the invite, they will receive an email with instructions to set up their account. 

Let me know if you run into any issues or have any other questions!

Mid-pack claude-opus-4.6-high Churn rescue

87/100

“The response is excellent, highly empathetic, and perfectly utilizes the provided facts and allowed options without offering a discount. It takes ownership of the issue and provides a clear, concrete resolution path. It is slightly verbose and includes minor unprompted technical details and commitments, but overall reads like a top-tier customer success reply.”

**Subject: We dropped the ball on your onboarding — let us fix this**

Hi [Customer Name],

Thank you for telling us directly — and I'm sorry. You should not have had to figure out the Salesforce connection on your own, and the experience you described is not the onboarding we intend for any customer. This one is on us.

I looked into your account and can see exactly what happened: your Salesforce integration failed twice due to an `invalid_scope` OAuth error. This almost always means that specific OAuth scopes (typically `api`, `refresh_token`, and `offline_access`) haven't been enabled for the connected app in your Salesforce org. It's a fixable configuration issue — not a platform incompatibility — and it's something we should have caught and walked you through on day one.

Here is what I'd like to offer to make this right:

1. **Dedicated 30-minute onboarding call** — I'll pair you with a specialist who will screen-share, resolve the Salesforce OAuth error live, and confirm data is flowing correctly. We can schedule this as early as tomorrow at a time that works for your team.

2. **Salesforce permission checklist** — I'm attaching our step-by-step checklist that covers the exact connected-app scopes, API permissions, and profile-level access your Salesforce admin needs to enable. This will also prevent scope errors for the remaining 14 seats on your account.

3. **7-day trial extension** — Since nearly half your trial was spent blocked by this integration issue, I've flagged your account for a 7-day extension so your team gets a fair window to evaluate Pro with Salesforce actually connected.

None of this requires you to commit to anything. If after the onboarding call and a full trial with a working integration you still feel it's not the right fit, I'll process t

Lowest score deepseek-v3.2-high Multi-issue ticket

14/100

“The model completely misunderstood the task. Instead of writing a customer support reply that separates billing, bug, and feature requests for a user, it wrote a generic internal guide defining what those ticket types are. It fails as a support reply.”

**Multi-Issue Ticket Breakdown**

**1. Billing Inquiry**  
*Issue:* Customer has questions or concerns about charges, invoices, payment methods, or subscription plans.  
*Action Required:* Verify account details, explain recent charges, process refunds if applicable per policy, update payment method, or adjust subscription.

**2. Bug Report**  
*Issue:* Customer reports unexpected behavior, error messages, crashes, or broken functionality.  
*Action Required:* Reproduce issue, gather logs/error details, identify affected platform/version, document steps, and escalate to engineering if unresolved.

**3. Feature Request**  
*Issue:* Customer suggests new functionality, enhancement, or product change.  
*Action Required:* Log request in feature tracking system, note use case and priority based on user impact, and inform customer it has been forwarded to the product team.

**Next Steps:**  
- Address billing first if urgent (e.g., refund request).  
- Triage bug severity and assign accordingly.  
- Confirm feature request receipt and provide reference ID if available.

Where models still fail

The most common problems we flagged across all models.

99invented policy 67unauthorized promise 26invented commitment 13fabricated account fact 13invented facts 11robotic or dismissive 7not ready to use 6meta commentary

Frequently asked

What is the best AI model for customer support?

In our benchmarks, gemini-3.1-pro-preview-low ranks first for customer support, scoring strong, across 26 test cases.

What is the cheapest good model for customer support?

gpt-5.4-mini is the best value: it clears our quality bar for customer support at $0.015 per run.

Which model is fastest for customer support?

grok-4.20 is the fastest model that still performs well for customer support.

How we test

Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.

Judge: gemini-3.1-pro-preview · 1000 model runs across 8 benchmarks · last tested 2026-06-30

This page is Spring Prompt, running

We just did this for every model. Do it for your prompt.

The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.

Generate test cases from your prompt — no eval set required to start.
Compare models side by side with quality, cost and latency in one matrix.
Optimise the winner until the scores say it's ready to ship.

Join the waitlist Browse all benchmarks

Experiment · Cold outreach email

Prompt × model results

12 test cases · 3 evals

Claude Opus

GPT-5

Gemini

7.1

6.8

7.4

8.3

7.9

8.0

9.2 ★

8.6

8.4

Best combo: v3 × Claude Opus

9.2 quality · $0.004/run · 1.8s