Business · 31 tasks · 52 models
Best AI models for Coding
Which models fix the root cause, catch the real security bug, and don't write code that's subtly wrong or hallucinated?
claude-opus-4.8-medium leads Coding (excellent). For tighter budgets, glm-5-max is competitive at about 32% of the cost.
Top score — excellent
Clears the quality bar at $0.017/run
~13s per run, still strong
Quality vs. cost
Every model placed by what it delivers and what it costs. The best value sits high and to the left.
Full ranking
| # | Model | Score | Cost/run | Speed | Best for | AA Coding |
|---|---|---|---|---|---|---|
| 1 | claude-opus-4.8-medium | 92.3 Excellent | $0.0535 | 27.2s | Best overall | 74.3 |
| 2 | gpt-5.5-medium | 92.2 Excellent | $0.0545 | 27.7s | Best overall | 74.9 |
| 3 | gpt-5.5-max | 91.5 Excellent | $0.0935 | 51.0s | Best overall | 74.9 |
| 4 | claude-opus-4.8-max | 91.3 Excellent | $0.1343 | 65.1s | Best overall | 74.3 |
| 5 | kimi-k2.5-max | 89.7 Strong | $0.0251 | 81.9s | Best overall | — |
| 6 | gemini-3.5-flash-medium | 87.8 Strong | $0.0281 | 19.6s | Best overall | 70.1 |
| 7 | claude-sonnet-4.6-max | 87.0 Strong | $0.0513 | 42.6s | Best overall | 63 |
| 8 | gpt-5.4-mini-max | 86.5 Strong | $0.0257 | 25.2s | Best overall | 56.1 |
| 9 | qwen3.7-max-max | 86.0 Strong | $0.0277 | 55.8s | Best overall | 66 |
| 10 | gpt-5.4-medium | 85.8 Strong | $0.0367 | 21.6s | Best overall | 71.1 |
| 11 | qwen3.7-max-medium | 85.8 Strong | $0.0255 | 53.8s | Best overall | 66 |
| 12 | gemini-3-flash-preview-medium | 85.3 Strong | $0.0216 | 19.5s | Best overall | — |
| 13 | kimi-k2.7-code-max | 85.2 Strong | $0.0255 | 37.6s | Best overall | 60.8 |
| 14 | claude-opus-4.6-medium | 84.5 Strong | $0.0675 | 42.7s | Strong drafts | — |
| 15 | kimi-k2.7-code-medium | 84.0 Strong | $0.0243 | 42.7s | Strong drafts | 60.8 |
| 16 | claude-opus-4.6-max | 84.0 Strong | $0.0696 | 43.9s | Strong drafts | — |
| 17 | claude-opus-4.5-max | 83.3 Strong | $0.1034 | 53.4s | Strong drafts | — |
| 18 | gemini-3.1-flash-lite-medium | 82.8 Strong | $0.0205 | 19.9s | Strong drafts | — |
| 19 | claude-sonnet-4.5-max | 82.1 Strong | $0.0513 | 42.0s | Strong drafts | — |
| 20 | gemini-3.1-flash-lite-max | 81.8 Strong | $0.0178 | 16.8s | Strong drafts | — |
| 21 | claude-opus-4.5-medium | 81.6 Strong | $0.0931 | 47.9s | Strong drafts | — |
| 22 | gemini-3.1-pro-preview-max | 81.3 Strong | $0.0361 | 25.3s | Strong drafts | 68.8 |
| 23 | gpt-5.4-mini-medium | 81.0 Strong | $0.0209 | 18.2s | Strong drafts | 56.1 |
| 24 | glm-5-max | 80.6 Strong | $0.0172 | 68.2s | Strong drafts | — |
| 25 | qwen3.5-plus-02-15-max | 80.5 Strong | $0.0220 | 63.5s | Strong drafts | — |
| 26 | gemini-3.5-flash-max | 80.2 Strong | $0.0418 | 27.6s | Strong drafts | 70.1 |
| 27 | gpt-5.4-max | 80.0 Usable | $0.0472 | 26.8s | Strong drafts | 71.1 |
| 28 | claude-sonnet-4.6-medium | 79.3 Usable | $0.0552 | 45.2s | Strong drafts | 63 |
| 29 | gemini-3.1-flash-lite-preview-max | 79.0 Usable | $0.0198 | 18.6s | Strong drafts | 34.7 |
| 30 | glm-5-medium | 79.0 Usable | $0.0200 | 64.8s | Strong drafts | — |
| 31 | gemini-3-flash-preview-max | 78.5 Usable | $0.0224 | 21.8s | Strong drafts | — |
| 32 | kimi-k2.5-medium | 78.2 Usable | $0.0234 | 71.8s | Strong drafts | — |
| 33 | grok-4.20-medium | 78.2 Usable | $0.0201 | 15.2s | Strong drafts | — |
| 34 | qwen3.5-plus-02-15-medium | 77.7 Usable | $0.0235 | 66.6s | Strong drafts | — |
| 35 | grok-4.20-max | 77.3 Usable | $0.0205 | 15.2s | Strong drafts | — |
| 36 | claude-sonnet-4.5-medium | 77.1 Usable | $0.0456 | 37.8s | Strong drafts | — |
| 37 | claude-haiku-4.5-max | 76.8 Usable | $0.0343 | 32.0s | Strong drafts | 43.9 |
| 38 | gemini-3.1-flash-lite-preview-medium | 76.7 Usable | $0.0189 | 18.3s | Strong drafts | 34.7 |
| 39 | deepseek-v3.1-terminus-medium | 76.0 Usable | $0.0177 | 39.2s | Strong drafts | — |
| 40 | gpt-5-mini-medium | 75.7 Usable | $0.0208 | 28.4s | Strong drafts | — |
| 41 | deepseek-v3.2-medium | 73.4 Usable | $0.0246 | 42.8s | Needs review | — |
| 42 | gpt-5-mini-max | 73.0 Usable | $0.0257 | 50.6s | Needs review | — |
| 43 | grok-4.20-beta-max | 72.8 Usable | $0.0181 | 13.1s | Needs review | — |
| 44 | gemini-3.1-pro-preview-medium | 72.6 Usable | $0.0353 | 25.1s | Needs review | 68.8 |
| 45 | mistral-medium-3.1-medium | 72.5 Usable | $0.0208 | 20.8s | Needs review | — |
| 46 | claude-haiku-4.5-medium | 72.2 Usable | $0.0312 | 29.3s | Needs review | 43.9 |
| 47 | mistral-medium-3.1-max | 71.5 Usable | $0.0213 | 19.7s | Needs review | — |
| 48 | deepseek-v3.1-terminus-max | 68.8 Needs editing | $0.0182 | 44.7s | Needs review | — |
| 49 | deepseek-v3.2-max | 68.6 Needs editing | $0.0162 | 43.2s | Needs review | — |
| 50 | minimax-m2.7-max | 67.9 Needs editing | $0.0210 | 42.9s | Needs review | 52.6 |
| 51 | grok-4.20-beta-medium | 63.8 Needs editing | $0.0184 | 12.6s | Needs review | — |
| 52 | minimax-m2.7-medium | 62.8 Needs editing | $0.0229 | 45.7s | Needs review | 52.6 |
“AA Coding” is a third-party benchmark shown for context — independent of our tests. Source: Artificial Analysis (artificialanalysis.ai) via OpenRouter (openrouter.ai/rankings). · Source: Design Arena (www.designarena.ai) via OpenRouter (openrouter.ai/rankings).
What separates the top models
Bug Fixing
mediumTests diagnosing and fixing the root cause of a bug minimally, handling the edge case, without changing the public API.
Code Review and Risk Test
hardTests security, migration, race-condition, and operational review quality.
Code Review & Security
hardTests surfacing the real high-severity issue (security/concurrency/operational) above style nits.
Leader: qwen3.5-plus-02-15-medium
Secure Implementation
hardTests implementing security-sensitive code correctly — parameterized queries, constant-time compares, replay protection.
Leader: gpt-5.5-max
API and Data Code Test
mediumTests practical API, SQL, validation, and data-manipulation code.
Leader: gpt-5.5-medium
Refactoring
mediumTests behavior-preserving refactoring that reduces complexity without changing observable behavior or mixing in features.
Leader: kimi-k2.7-code-medium
Code Quality and Testing Test
hardTests refactoring, unit tests, type safety, performance, and multi-file planning.
Leader: claude-opus-4.8-medium
See the difference
Real model outputs and how our judge scored them.
Top score
grok-4.20-max
TypeScript type narrowing
100/100
“The model provides a flawless, production-ready solution. It correctly uses idiomatic TypeScript discriminated union narrowing to handle the null and bank cases without using `any`. The explanation is concise, accurate, and directly addresses the prompt's requirements.”
**Fixed Implementation:**
```ts
type Card = { kind: "card"; last4: string; brand: string };
type Bank = { kind: "bank"; accountLast4: string; routingLast4?: string };
type PaymentMethod = Card | Bank | null;
export function label(method: PaymentMethod): string {
if (method === null) {
return "No payment method";
}
if (method.kind === "bank") {
return "Bank ending " + method.accountLast4;
}
// TypeScript now knows `method` is `Card`
return method.brand + " ending " + method.last4;
}
```
### How it handles null and bank accounts
- **Null**: Explicitly checked first and returns a safe fallback string. This eliminates the `null` case from the union.
- **Bank**: Narrowed using the discriminant `method.kind === "bank"`. This gives TypeScript a precise `Bank` type, allowing safe access to `accountLast4` (the correct field for banks) instead of the non-existent `last4` or `brand`.
- **Card**: After the two checks, TypeScript’s control-flow narrowing guarantees `method` is a `Card`, so `method.brand` and `method.last4` are used without error.
This solution uses **discriminated union narrowing** — the clean, idiomatic TypeScript approach — without `any`, type assertions, or forced property access.
Mid-pack
claude-sonnet-4.6-max
Refactor long function
89/100
“The refactoring is correct, minimal, and perfectly preserves all behavior and edge cases. Code quality is good but could be slightly improved by using a generator expression for the subtotal and adding type hints (as type safety is mentioned in the benchmark description). No tests were provided, though they were not explicitly requested in the task prompt.”
## Refactored Function
```python
def _calc_subtotal(lines):
subtotal = 0
for line in lines:
if line["qty"] <= 0:
continue
subtotal += line["qty"] * line["unit_cents"]
return subtotal
def _calc_discount(subtotal, coupon):
if not coupon:
return 0
if coupon["type"] == "percent":
return int(subtotal * coupon["value"] / 100)
if coupon["type"] == "fixed":
return coupon["value"]
return 0
def _calc_tax(taxable, customer):
if customer["country"] == "US" and customer.get("state") in ["CA", "NY"]:
return int(taxable * 0.0825)
return 0
def invoice_total(lines, customer, coupon):
subtotal = _calc_subtotal(lines)
discount = _calc_discount(subtotal, coupon)
taxable = max(0, subtotal - discount)
tax = _calc_tax(taxable, customer)
return {
"subtotal": subtotal,
"discount": discount,
"tax": tax,
"total": taxable + tax,
}
```
---
### What changed and why
| Helper | Responsibility extracted |
|---|---|
| `_calc_subtotal` | Iterates lines, skips invalid qty, sums line totals |
| `_calc_discount` | Handles `percent` / `fixed` coupon logic and the `None` guard |
| `_calc_tax` | Encapsulates US state tax rule |
| `invoice_total` | Now reads as a clear four-step pipeline; public API is identical |
### What did **not** change
- Function signature: `invoice_total(lines, customer, coupon)`
- Return dict keys and values: `subtotal`, `discount`, `tax`, `total`
- All edge-case behavior: zero/negative qty skipped, `max(0, …)` floor on taxable amount, integer rounding on percent discounts and tax
Lowest score
gemini-3.1-pro-preview-medium
HMAC webhook verification with replay protection
0/100
“The model's response is incomplete and cuts off before implementing the actual webhook verification logic, including the timestamp check, signature verification, and replay protection. As a result, it fails all requirements.”
Here is a complete, production-ready Python implementation using only the standard library.
It includes a thread-safe, self-cleaning TTL cache to prevent replay attacks, strict timestamp validation to prevent delayed attacks, and `hmac.compare_digest` to prevent timing attacks.
```python
import hmac
import hashlib
import time
import threading
from typing import Dict
# --- Custom Exceptions ---
class WebhookError(Exception): pass
class InvalidHeaderError(WebhookError): pass
class TimestampExpiredError(WebhookError): pass
class ReplayAttackError(WebhookError): pass
class SignatureMismatchError(WebhookError): pass
# --- Replay Prevention Cache ---
class EventIdCache:
"""A thread-safe, self-cleaning cache to store seen Event IDs."""
def __init__(self, ttl_seconds: int = 300):
self.cache: Dict[str, float] = {}
self.ttl = ttl_seconds
self.lock = threading.Lock()
def check_and_add(self, event_id: str) -> bool:
"""Returns True if the event is new and added, False if it's a replay."""
current_time = time.time()
with self.lock:
# Lazy cleanup of expired events to prevent memory leaks
expired_keys = [k for k, v in self.cache.items() if current_time - v > self.ttl]
for k in expired_keys:
del self.cache[k]
# Check for replay
if event_id in self.cache:
return False
# Add new event
self.cache[event_id] = current_time
return True
# Initialize a global cache (5 minutes TTL matches our timestamp tolerance)
seen_events_cache = EventIdCache(ttl_seconds=300)
# --- Webhook Verification ---
def verify_webhook(
headers: Dict[str, str],
raw_body: bytes,
secret: byte
Where models still fail
The most common problems we flagged across all models.
Frequently asked
What is the best AI model for coding?
In our benchmarks, claude-opus-4.8-medium ranks first for coding, scoring excellent, across 31 test cases.
What is the cheapest good model for coding?
glm-5-max is the best value: it clears our quality bar for coding at $0.017 per run.
Which model is fastest for coding?
grok-4.20-beta-max is the fastest model that still performs well for coding.
How we test
Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.
Judge: gemini-3.1-pro-preview · 1000 model runs across 7 benchmarks · last tested 2026-06-30
This page is Spring Prompt, running
We just did this for every model. Do it for your prompt.
The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.
- Generate test cases from your prompt — no eval set required to start.
- Compare models side by side with quality, cost and latency in one matrix.
- Optimise the winner until the scores say it's ready to ship.
Prompt × model results
12 test cases · 3 evals