Business · 10 tasks · 44 models

Best AI models for Translation & Localization

Name: Translation & Localization AI model benchmark
Creator: Spring Prompt

Which models translate and localize accurately — right register, intact placeholders/brands, correct locale formats — without false friends or translationese?

Top models Qwen

qwen3.7-max Anthropic

claude-opus-4.6-low OpenAI

gpt-5.5-high

qwen3.7-max leads Translation & Localization (excellent). For tighter budgets, gpt-5-mini is competitive at about 83% of the cost.

Best overall Excellent

qwen3.7-max

Top score — excellent

100.0 score $0.0142/run 26.2s

Best value Excellent

gpt-5-mini

Clears the quality bar at $0.012/run

90.9 score $0.0118/run 16.7s

Fastest usable Strong

grok-4.20

~11s per run, still strong

87.1 score $0.0136/run 10.6s

Quality vs. cost

Every model placed by what it delivers and what it costs. The best value sits high and to the left.

Full ranking

Best overall Cheapest Fastest Smartest

#	Model	Score	Cost/run	Speed	Best for
1	qwen3.7-max	100.0 Excellent	$0.0142	26.2s	Best overall
2	claude-opus-4.6-low	100.0 Excellent	$0.0258	21.4s	Best overall
3	gpt-5.5-high	99.6 Excellent	$0.0227	16.9s	Best overall
4	claude-opus-4.5-low	99.6 Excellent	$0.0290	21.5s	Best overall
5	claude-opus-4.6-high	99.6 Excellent	$0.0301	25.3s	Best overall
6	gpt-5.5	99.4 Excellent	$0.0194	15.7s	Best overall
7	gpt-5.5-low	98.9 Excellent	$0.0172	13.2s	Best overall
8	claude-sonnet-4.6-low	98.9 Excellent	$0.0209	21.0s	Best overall
9	gemini-3.1-pro-preview-high	98.9 Excellent	$0.0259	24.9s	Best overall
10	claude-opus-4.5	98.7 Excellent	$0.0167	17.3s	Best overall
11	gpt-5.4-high	98.6 Excellent	$0.0226	18.0s	Best overall
12	gemini-3.1-pro-preview	98.5 Excellent	$0.0312	26.0s	Best overall
13	glm-5	97.5 Excellent	$0.0164	79.4s	Best overall
14	kimi-k2.5	95.5 Excellent	$0.0163	63.9s	Best overall
15	gemini-3.1-pro-preview-low	95.5 Excellent	$0.0232	19.4s	Best overall
16	qwen3.7-max-high	95.2 Excellent	$0.0132	26.0s	Best overall
17	claude-sonnet-4.5-high	94.7 Excellent	$0.0229	22.5s	Best overall
18	kimi-k2.7-code	94.5 Excellent	$0.0151	34.1s	Best overall
19	claude-opus-4.8-high	94.5 Excellent	$0.0246	15.9s	Best overall
20	claude-sonnet-4.6-high	94.3 Excellent	$0.0227	23.3s	Best overall
21	mistral-medium-3.1	94.2 Excellent	$0.0144	13.1s	Best overall
22	claude-sonnet-4.5	94.2 Excellent	$0.0180	17.3s	Best overall
23	claude-opus-4.6	94.1 Excellent	$0.0185	17.6s	Best overall
24	qwen3.5-plus-02-15	94.0 Excellent	$0.0186	93.1s	Best overall
25	gemini-3.5-flash-low	93.3 Excellent	$0.0195	16.2s	Best overall
26	gemini-3.5-flash-high	92.4 Excellent	$0.0232	18.5s	Best overall
27	gemini-3-flash-preview	92.1 Excellent	$0.0166	15.5s	Best overall
28	claude-sonnet-4.5-low	92.0 Excellent	$0.0224	20.8s	Best overall
29	claude-opus-4.5-high	91.6 Excellent	$0.0285	20.9s	Best overall
30	claude-haiku-4.5	91.0 Excellent	$0.0152	12.4s	Best overall
31	gpt-5-mini	90.9 Excellent	$0.0118	16.7s	Best overall
32	qwen3.7-max-low	90.9 Excellent	$0.0169	28.1s	Best overall
33	gpt-5.4-low	90.7 Excellent	$0.0140	11.7s	Best overall
34	gemini-3.1-flash-lite	90.4 Excellent	$0.0164	12.5s	Best overall
35	claude-opus-4.8-low	90.2 Excellent	$0.0211	14.1s	Best overall
36	gpt-5.4	89.5 Strong	$0.0149	13.6s	Best overall
37	deepseek-v3.2-high	89.4 Strong	$0.0122	14.6s	Best overall
38	grok-4.20-beta	88.4 Strong	$0.0132	10.8s	Best overall
39	deepseek-v3.2-low	88.3 Strong	$0.0130	17.5s	Best overall
40	gpt-5.4-mini	87.6 Strong	$0.0141	12.8s	Best overall
41	grok-4.20	87.1 Strong	$0.0136	10.6s	Best overall
42	deepseek-v3.2	86.9 Strong	$0.0150	17.9s	Best overall
43	deepseek-v3.1-terminus	85.1 Strong	$0.0151	21.4s	Best overall
44	minimax-m2.7	83.8 Strong	$0.0124	25.5s	Strong drafts

What separates the top models

Business Translation

medium

Tests translating marketing/UI copy while preserving meaning, brand names, and placeholders, in natural (non-literal) target language.

Leader: gemini-3.1-flash-lite

Localization

hard

Tests adapting dates, numbers, currency, and units to the target locale (converted, not relabeled).

Leader: claude-sonnet-4.6-low

Register & Formality

hard

Tests using the correct, consistent formality (T-V / keigo direction) for a business context.

Leader: gpt-5-mini

Catch the Translation Error

hard

Tests finding, classifying, and fixing errors in a flawed translation — false friends, register, broken placeholders, locale formats.

Leader: qwen3.7-max-high

See the difference

Real model outputs and how our judge scored them.

Top score claude-haiku-4.5 Find the register error (EN→German translation)

100/100

“The model successfully caught the planted register error, classified it correctly, and provided an accurate and natural fix without hallucinating other errors.”

# Analysis

**Issue Identified:** ❌ Not appropriate for formal B2B

The translation uses **"du" (informal you)**, which is incorrect for formal business communication in German.

## Corrected Version

**"Könnten Sie bitte Ihren Termin bestätigen?"**

Or alternatively:

**"Können Sie bitte Ihren Termin bestätigen?"**

## Key Differences

| Aspect | Inappropriate | Appropriate |
|--------|---------------|-------------|
| Pronoun | du/dein | Sie/Ihr |
| Tone | Casual, friendly | Professional, respectful |
| Context | Colleagues/friends | Business clients, partners |

## Why This Matters

In German business communication, **"Sie"** is the standard formal address. Using "du" in B2B contexts can appear unprofessional or disrespectful unless there's an established informal relationship. The corrected version maintains appropriate professional distance and respect.

Mid-pack kimi-k2.5 Support reply with a false-friend trap (EN→German)

100/100

“The translation is perfectly accurate, fluent, and correctly captures the tone and meaning of the source text without using false friends. The additional notes provided by the model are accurate and add value.”

 **Die Verzögerung ist uns peinlich, und wir werden Ihre Rückerstattung heute noch überweisen.**

**Important note on "embarrassed":**
Avoid translating this as **"verlegen"** — while "verlegen" can mean "embarrassed," it also means "delayed" or "misplaced" in German, which would create unfortunate confusion in this context!

**Alternative options depending on tone:**
- More formal: *"Die Verzögerung ist uns höchst unangenehm..."* (The delay is extremely unpleasant to us...)
- More apologetic: *"Es ist uns peinlich, dass es zu dieser Verzögerung gekommen ist..."* (It is embarrassing to us that this delay occurred...)

**Note:** German business communication often uses "bedauern" (regret) or "entschuldigen" (apologize) instead, but "peinlich" accurately conveys the specific sentiment of embarrassment/awkwardness about the situation.

Lowest score grok-4.20 Find the errors (EN→Spanish translation)

30/100

“While the model successfully caught the critical false friend 'embarazada', it failed the task by missing the other two major planted errors: the broken placeholder and the unlocalized date/currency formats. Additionally, it over-edited the text by flagging grammatically correct and natural phrasing as errors.”

**Error Analysis**

### 1. **Critical Lexical Error (False Friend)**
- **Source:** "I was embarrassed."
- **Translation:** "Estaba embarazada."
- **Classification:** False friend / Lexical error (Mistranslation)
- **Explanation:** "Embarrassed" means *avergonzado/a* (feeling shame). "Embarazada" means *pregnant*. This is a classic and very serious false friend in Spanish.
- **Fix:** "Me sentí avergonzado/a." or "Estaba avergonzado/a."

### 2. **Verb Tense / Aspect Error**
- **Source:** "I was embarrassed."
- **Translation:** "Estaba embarazada."
- **Classification:** Tense/Aspect error
- **Explanation:** The English uses a stative past ("I was embarrassed") describing a feeling at a specific moment. The Spanish "Estaba" (imperfect) is not wrong grammatically, but the most natural equivalent in this context is the preterite or a reflexive construction.
- **Fix:** "Me sentí avergonzado/a." (most natural) or "Estuve avergonzado/a."

### 3. **Agreement / Gender Error**
- **Source:** "Welcome, {name}!"
- **Translation:** "¡Bienvenido, {nombre}!"
- **Classification:** Gender agreement error
- **Explanation:** "Bienvenido" is masculine. The placeholder `{name}` can refer to any gender. In Spanish customer communications, the inclusive/neutral form is preferred.
- **Fix:** "¡Bienvenido/a, {nombre}!" or better: "**Te damos la bienvenida**, {nombre}!" (avoids gendering the adjective entirely).

### 4. **Minor Stylistic / Naturalness Issue**
- **Source:** "Your order of $1,500 ships on 03/04/2025."
- **Translation:** "Su pedido de $1,500 se envía el 03/04/2025."
- **Classification:** Register/Naturalness issue
- **Explanation:** While grammatically correct, "se envía" sounds slightly impersonal. In customer service Spanish, it's more common and natural to use the active voice or a

Where models still fail

The most common problems we flagged across all models.

55source locale format 38omission or addition 15translationese 9broken placeholder 8mistranslation 6wrong register

Frequently asked

What is the best AI model for translation & localization?

In our benchmarks, qwen3.7-max ranks first for translation & localization, scoring excellent, across 10 test cases.

What is the cheapest good model for translation & localization?

gpt-5-mini is the best value: it clears our quality bar for translation & localization at $0.012 per run.

Which model is fastest for translation & localization?

grok-4.20 is the fastest model that still performs well for translation & localization.

How we test

Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.

Judge: gemini-3.1-pro-preview · 460 model runs across 4 benchmarks · last tested 2026-06-29

This page is Spring Prompt, running

We just did this for every model. Do it for your prompt.

The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.

Generate test cases from your prompt — no eval set required to start.
Compare models side by side with quality, cost and latency in one matrix.
Optimise the winner until the scores say it's ready to ship.

Join the waitlist Browse all benchmarks

Experiment · Cold outreach email

Prompt × model results

12 test cases · 3 evals

Claude Opus

GPT-5

Gemini

7.1

6.8

7.4

8.3

7.9

8.0

9.2 ★

8.6

8.4

Best combo: v3 × Claude Opus

9.2 quality · $0.004/run · 1.8s