Business · 10 tasks · 44 models
Best AI models for Translation & Localization
Which models translate and localize accurately — right register, intact placeholders/brands, correct locale formats — without false friends or translationese?
qwen3.7-max leads Translation & Localization (excellent). For tighter budgets, gpt-5-mini is competitive at about 83% of the cost.
Top score — excellent
Clears the quality bar at $0.012/run
~11s per run, still strong
Quality vs. cost
Every model placed by what it delivers and what it costs. The best value sits high and to the left.
Full ranking
| # | Model | Score | Cost/run | Speed | Best for |
|---|---|---|---|---|---|
| 1 | qwen3.7-max | 100.0 Excellent | $0.0142 | 26.2s | Best overall |
| 2 | claude-opus-4.6-low | 100.0 Excellent | $0.0258 | 21.4s | Best overall |
| 3 | gpt-5.5-high | 99.6 Excellent | $0.0227 | 16.9s | Best overall |
| 4 | claude-opus-4.5-low | 99.6 Excellent | $0.0290 | 21.5s | Best overall |
| 5 | claude-opus-4.6-high | 99.6 Excellent | $0.0301 | 25.3s | Best overall |
| 6 | gpt-5.5 | 99.4 Excellent | $0.0194 | 15.7s | Best overall |
| 7 | gpt-5.5-low | 98.9 Excellent | $0.0172 | 13.2s | Best overall |
| 8 | claude-sonnet-4.6-low | 98.9 Excellent | $0.0209 | 21.0s | Best overall |
| 9 | gemini-3.1-pro-preview-high | 98.9 Excellent | $0.0259 | 24.9s | Best overall |
| 10 | claude-opus-4.5 | 98.7 Excellent | $0.0167 | 17.3s | Best overall |
| 11 | gpt-5.4-high | 98.6 Excellent | $0.0226 | 18.0s | Best overall |
| 12 | gemini-3.1-pro-preview | 98.5 Excellent | $0.0312 | 26.0s | Best overall |
| 13 | glm-5 | 97.5 Excellent | $0.0164 | 79.4s | Best overall |
| 14 | kimi-k2.5 | 95.5 Excellent | $0.0163 | 63.9s | Best overall |
| 15 | gemini-3.1-pro-preview-low | 95.5 Excellent | $0.0232 | 19.4s | Best overall |
| 16 | qwen3.7-max-high | 95.2 Excellent | $0.0132 | 26.0s | Best overall |
| 17 | claude-sonnet-4.5-high | 94.7 Excellent | $0.0229 | 22.5s | Best overall |
| 18 | kimi-k2.7-code | 94.5 Excellent | $0.0151 | 34.1s | Best overall |
| 19 | claude-opus-4.8-high | 94.5 Excellent | $0.0246 | 15.9s | Best overall |
| 20 | claude-sonnet-4.6-high | 94.3 Excellent | $0.0227 | 23.3s | Best overall |
| 21 | mistral-medium-3.1 | 94.2 Excellent | $0.0144 | 13.1s | Best overall |
| 22 | claude-sonnet-4.5 | 94.2 Excellent | $0.0180 | 17.3s | Best overall |
| 23 | claude-opus-4.6 | 94.1 Excellent | $0.0185 | 17.6s | Best overall |
| 24 | qwen3.5-plus-02-15 | 94.0 Excellent | $0.0186 | 93.1s | Best overall |
| 25 | gemini-3.5-flash-low | 93.3 Excellent | $0.0195 | 16.2s | Best overall |
| 26 | gemini-3.5-flash-high | 92.4 Excellent | $0.0232 | 18.5s | Best overall |
| 27 | gemini-3-flash-preview | 92.1 Excellent | $0.0166 | 15.5s | Best overall |
| 28 | claude-sonnet-4.5-low | 92.0 Excellent | $0.0224 | 20.8s | Best overall |
| 29 | claude-opus-4.5-high | 91.6 Excellent | $0.0285 | 20.9s | Best overall |
| 30 | claude-haiku-4.5 | 91.0 Excellent | $0.0152 | 12.4s | Best overall |
| 31 | gpt-5-mini | 90.9 Excellent | $0.0118 | 16.7s | Best overall |
| 32 | qwen3.7-max-low | 90.9 Excellent | $0.0169 | 28.1s | Best overall |
| 33 | gpt-5.4-low | 90.7 Excellent | $0.0140 | 11.7s | Best overall |
| 34 | gemini-3.1-flash-lite | 90.4 Excellent | $0.0164 | 12.5s | Best overall |
| 35 | claude-opus-4.8-low | 90.2 Excellent | $0.0211 | 14.1s | Best overall |
| 36 | gpt-5.4 | 89.5 Strong | $0.0149 | 13.6s | Best overall |
| 37 | deepseek-v3.2-high | 89.4 Strong | $0.0122 | 14.6s | Best overall |
| 38 | grok-4.20-beta | 88.4 Strong | $0.0132 | 10.8s | Best overall |
| 39 | deepseek-v3.2-low | 88.3 Strong | $0.0130 | 17.5s | Best overall |
| 40 | gpt-5.4-mini | 87.6 Strong | $0.0141 | 12.8s | Best overall |
| 41 | grok-4.20 | 87.1 Strong | $0.0136 | 10.6s | Best overall |
| 42 | deepseek-v3.2 | 86.9 Strong | $0.0150 | 17.9s | Best overall |
| 43 | deepseek-v3.1-terminus | 85.1 Strong | $0.0151 | 21.4s | Best overall |
| 44 | minimax-m2.7 | 83.8 Strong | $0.0124 | 25.5s | Strong drafts |
What separates the top models
Business Translation
mediumTests translating marketing/UI copy while preserving meaning, brand names, and placeholders, in natural (non-literal) target language.
Leader: gemini-3.1-flash-lite
Localization
hardTests adapting dates, numbers, currency, and units to the target locale (converted, not relabeled).
Leader: claude-sonnet-4.6-low
Register & Formality
hardTests using the correct, consistent formality (T-V / keigo direction) for a business context.
Leader: gpt-5-mini
Catch the Translation Error
hardTests finding, classifying, and fixing errors in a flawed translation — false friends, register, broken placeholders, locale formats.
Leader: qwen3.7-max-high
See the difference
Real model outputs and how our judge scored them.
Top score
claude-haiku-4.5
Find the register error (EN→German translation)
100/100
“The model successfully caught the planted register error, classified it correctly, and provided an accurate and natural fix without hallucinating other errors.”
# Analysis **Issue Identified:** ❌ Not appropriate for formal B2B The translation uses **"du" (informal you)**, which is incorrect for formal business communication in German. ## Corrected Version **"Könnten Sie bitte Ihren Termin bestätigen?"** Or alternatively: **"Können Sie bitte Ihren Termin bestätigen?"** ## Key Differences | Aspect | Inappropriate | Appropriate | |--------|---------------|-------------| | Pronoun | du/dein | Sie/Ihr | | Tone | Casual, friendly | Professional, respectful | | Context | Colleagues/friends | Business clients, partners | ## Why This Matters In German business communication, **"Sie"** is the standard formal address. Using "du" in B2B contexts can appear unprofessional or disrespectful unless there's an established informal relationship. The corrected version maintains appropriate professional distance and respect.
Mid-pack
kimi-k2.5
Support reply with a false-friend trap (EN→German)
100/100
“The translation is perfectly accurate, fluent, and correctly captures the tone and meaning of the source text without using false friends. The additional notes provided by the model are accurate and add value.”
**Die Verzögerung ist uns peinlich, und wir werden Ihre Rückerstattung heute noch überweisen.** **Important note on "embarrassed":** Avoid translating this as **"verlegen"** — while "verlegen" can mean "embarrassed," it also means "delayed" or "misplaced" in German, which would create unfortunate confusion in this context! **Alternative options depending on tone:** - More formal: *"Die Verzögerung ist uns höchst unangenehm..."* (The delay is extremely unpleasant to us...) - More apologetic: *"Es ist uns peinlich, dass es zu dieser Verzögerung gekommen ist..."* (It is embarrassing to us that this delay occurred...) **Note:** German business communication often uses "bedauern" (regret) or "entschuldigen" (apologize) instead, but "peinlich" accurately conveys the specific sentiment of embarrassment/awkwardness about the situation.
Lowest score
grok-4.20
Find the errors (EN→Spanish translation)
30/100
“While the model successfully caught the critical false friend 'embarazada', it failed the task by missing the other two major planted errors: the broken placeholder and the unlocalized date/currency formats. Additionally, it over-edited the text by flagging grammatically correct and natural phrasing as errors.”
**Error Analysis**
### 1. **Critical Lexical Error (False Friend)**
- **Source:** "I was embarrassed."
- **Translation:** "Estaba embarazada."
- **Classification:** False friend / Lexical error (Mistranslation)
- **Explanation:** "Embarrassed" means *avergonzado/a* (feeling shame). "Embarazada" means *pregnant*. This is a classic and very serious false friend in Spanish.
- **Fix:** "Me sentí avergonzado/a." or "Estaba avergonzado/a."
### 2. **Verb Tense / Aspect Error**
- **Source:** "I was embarrassed."
- **Translation:** "Estaba embarazada."
- **Classification:** Tense/Aspect error
- **Explanation:** The English uses a stative past ("I was embarrassed") describing a feeling at a specific moment. The Spanish "Estaba" (imperfect) is not wrong grammatically, but the most natural equivalent in this context is the preterite or a reflexive construction.
- **Fix:** "Me sentí avergonzado/a." (most natural) or "Estuve avergonzado/a."
### 3. **Agreement / Gender Error**
- **Source:** "Welcome, {name}!"
- **Translation:** "¡Bienvenido, {nombre}!"
- **Classification:** Gender agreement error
- **Explanation:** "Bienvenido" is masculine. The placeholder `{name}` can refer to any gender. In Spanish customer communications, the inclusive/neutral form is preferred.
- **Fix:** "¡Bienvenido/a, {nombre}!" or better: "**Te damos la bienvenida**, {nombre}!" (avoids gendering the adjective entirely).
### 4. **Minor Stylistic / Naturalness Issue**
- **Source:** "Your order of $1,500 ships on 03/04/2025."
- **Translation:** "Su pedido de $1,500 se envía el 03/04/2025."
- **Classification:** Register/Naturalness issue
- **Explanation:** While grammatically correct, "se envía" sounds slightly impersonal. In customer service Spanish, it's more common and natural to use the active voice or a
Where models still fail
The most common problems we flagged across all models.
Frequently asked
What is the best AI model for translation & localization?
In our benchmarks, qwen3.7-max ranks first for translation & localization, scoring excellent, across 10 test cases.
What is the cheapest good model for translation & localization?
gpt-5-mini is the best value: it clears our quality bar for translation & localization at $0.012 per run.
Which model is fastest for translation & localization?
grok-4.20 is the fastest model that still performs well for translation & localization.
How we test
Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.
Judge: gemini-3.1-pro-preview · 460 model runs across 4 benchmarks · last tested 2026-06-29
This page is Spring Prompt, running
We just did this for every model. Do it for your prompt.
The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.
- Generate test cases from your prompt — no eval set required to start.
- Compare models side by side with quality, cost and latency in one matrix.
- Optimise the winner until the scores say it's ready to ship.
Prompt × model results
12 test cases · 3 evals