Business · 31 tasks · 52 models

Best AI models for Coding

Name: Coding AI model benchmark
Creator: Spring Prompt

Which models fix the root cause, catch the real security bug, and don't write code that's subtly wrong or hallucinated?

Top models Anthropic

claude-opus-4.8-medium OpenAI

gpt-5.5-medium Moonshot

kimi-k2.5-max

claude-opus-4.8-medium leads Coding (excellent). For tighter budgets, glm-5-max is competitive at about 32% of the cost.

Best overall Excellent

claude-opus-4.8-medium

Top score — excellent

92.3 score $0.0535/run 27.2s

Best value Strong

glm-5-max

Clears the quality bar at $0.017/run

80.6 score $0.0172/run 68.2s

Fastest usable Usable

grok-4.20-beta-max

~13s per run, still strong

72.8 score $0.0181/run 13.1s

Quality vs. cost

Every model placed by what it delivers and what it costs. The best value sits high and to the left.

Full ranking

Best overall Cheapest Fastest Smartest

#	Model	Score	Cost/run	Speed	Best for	AA Coding
1	claude-opus-4.8-medium	92.3 Excellent	$0.0535	27.2s	Best overall	74.3
2	gpt-5.5-medium	92.2 Excellent	$0.0545	27.7s	Best overall	74.9
3	gpt-5.5-max	91.5 Excellent	$0.0935	51.0s	Best overall	74.9
4	claude-opus-4.8-max	91.3 Excellent	$0.1343	65.1s	Best overall	74.3
5	kimi-k2.5-max	89.7 Strong	$0.0251	81.9s	Best overall	—
6	gemini-3.5-flash-medium	87.8 Strong	$0.0281	19.6s	Best overall	70.1
7	claude-sonnet-4.6-max	87.0 Strong	$0.0513	42.6s	Best overall	63
8	gpt-5.4-mini-max	86.5 Strong	$0.0257	25.2s	Best overall	56.1
9	qwen3.7-max-max	86.0 Strong	$0.0277	55.8s	Best overall	66
10	gpt-5.4-medium	85.8 Strong	$0.0367	21.6s	Best overall	71.1
11	qwen3.7-max-medium	85.8 Strong	$0.0255	53.8s	Best overall	66
12	gemini-3-flash-preview-medium	85.3 Strong	$0.0216	19.5s	Best overall	—
13	kimi-k2.7-code-max	85.2 Strong	$0.0255	37.6s	Best overall	60.8
14	claude-opus-4.6-medium	84.5 Strong	$0.0675	42.7s	Strong drafts	—
15	kimi-k2.7-code-medium	84.0 Strong	$0.0243	42.7s	Strong drafts	60.8
16	claude-opus-4.6-max	84.0 Strong	$0.0696	43.9s	Strong drafts	—
17	claude-opus-4.5-max	83.3 Strong	$0.1034	53.4s	Strong drafts	—
18	gemini-3.1-flash-lite-medium	82.8 Strong	$0.0205	19.9s	Strong drafts	—
19	claude-sonnet-4.5-max	82.1 Strong	$0.0513	42.0s	Strong drafts	—
20	gemini-3.1-flash-lite-max	81.8 Strong	$0.0178	16.8s	Strong drafts	—
21	claude-opus-4.5-medium	81.6 Strong	$0.0931	47.9s	Strong drafts	—
22	gemini-3.1-pro-preview-max	81.3 Strong	$0.0361	25.3s	Strong drafts	68.8
23	gpt-5.4-mini-medium	81.0 Strong	$0.0209	18.2s	Strong drafts	56.1
24	glm-5-max	80.6 Strong	$0.0172	68.2s	Strong drafts	—
25	qwen3.5-plus-02-15-max	80.5 Strong	$0.0220	63.5s	Strong drafts	—
26	gemini-3.5-flash-max	80.2 Strong	$0.0418	27.6s	Strong drafts	70.1
27	gpt-5.4-max	80.0 Usable	$0.0472	26.8s	Strong drafts	71.1
28	claude-sonnet-4.6-medium	79.3 Usable	$0.0552	45.2s	Strong drafts	63
29	gemini-3.1-flash-lite-preview-max	79.0 Usable	$0.0198	18.6s	Strong drafts	34.7
30	glm-5-medium	79.0 Usable	$0.0200	64.8s	Strong drafts	—
31	gemini-3-flash-preview-max	78.5 Usable	$0.0224	21.8s	Strong drafts	—
32	kimi-k2.5-medium	78.2 Usable	$0.0234	71.8s	Strong drafts	—
33	grok-4.20-medium	78.2 Usable	$0.0201	15.2s	Strong drafts	—
34	qwen3.5-plus-02-15-medium	77.7 Usable	$0.0235	66.6s	Strong drafts	—
35	grok-4.20-max	77.3 Usable	$0.0205	15.2s	Strong drafts	—
36	claude-sonnet-4.5-medium	77.1 Usable	$0.0456	37.8s	Strong drafts	—
37	claude-haiku-4.5-max	76.8 Usable	$0.0343	32.0s	Strong drafts	43.9
38	gemini-3.1-flash-lite-preview-medium	76.7 Usable	$0.0189	18.3s	Strong drafts	34.7
39	deepseek-v3.1-terminus-medium	76.0 Usable	$0.0177	39.2s	Strong drafts	—
40	gpt-5-mini-medium	75.7 Usable	$0.0208	28.4s	Strong drafts	—
41	deepseek-v3.2-medium	73.4 Usable	$0.0246	42.8s	Needs review	—
42	gpt-5-mini-max	73.0 Usable	$0.0257	50.6s	Needs review	—
43	grok-4.20-beta-max	72.8 Usable	$0.0181	13.1s	Needs review	—
44	gemini-3.1-pro-preview-medium	72.6 Usable	$0.0353	25.1s	Needs review	68.8
45	mistral-medium-3.1-medium	72.5 Usable	$0.0208	20.8s	Needs review	—
46	claude-haiku-4.5-medium	72.2 Usable	$0.0312	29.3s	Needs review	43.9
47	mistral-medium-3.1-max	71.5 Usable	$0.0213	19.7s	Needs review	—
48	deepseek-v3.1-terminus-max	68.8 Needs editing	$0.0182	44.7s	Needs review	—
49	deepseek-v3.2-max	68.6 Needs editing	$0.0162	43.2s	Needs review	—
50	minimax-m2.7-max	67.9 Needs editing	$0.0210	42.9s	Needs review	52.6
51	grok-4.20-beta-medium	63.8 Needs editing	$0.0184	12.6s	Needs review	—
52	minimax-m2.7-medium	62.8 Needs editing	$0.0229	45.7s	Needs review	52.6

“AA Coding” is a third-party benchmark shown for context — independent of our tests. Source: Artificial Analysis (artificialanalysis.ai) via OpenRouter (openrouter.ai/rankings). · Source: Design Arena (www.designarena.ai) via OpenRouter (openrouter.ai/rankings).

What separates the top models

Bug Fixing

medium

Tests diagnosing and fixing the root cause of a bug minimally, handling the edge case, without changing the public API.

Code Review and Risk Test

hard

Tests security, migration, race-condition, and operational review quality.

Code Review & Security

hard

Tests surfacing the real high-severity issue (security/concurrency/operational) above style nits.

Leader: qwen3.5-plus-02-15-medium

Secure Implementation

hard

Tests implementing security-sensitive code correctly — parameterized queries, constant-time compares, replay protection.

Leader: gpt-5.5-max

API and Data Code Test

medium

Tests practical API, SQL, validation, and data-manipulation code.

Leader: gpt-5.5-medium

Refactoring

medium

Tests behavior-preserving refactoring that reduces complexity without changing observable behavior or mixing in features.

Leader: kimi-k2.7-code-medium

Code Quality and Testing Test

hard

Tests refactoring, unit tests, type safety, performance, and multi-file planning.

Leader: claude-opus-4.8-medium

See the difference

Real model outputs and how our judge scored them.

Top score grok-4.20-max TypeScript type narrowing

100/100

“The model provides a flawless, production-ready solution. It correctly uses idiomatic TypeScript discriminated union narrowing to handle the null and bank cases without using `any`. The explanation is concise, accurate, and directly addresses the prompt's requirements.”

**Fixed Implementation:**

```ts
type Card = { kind: "card"; last4: string; brand: string };
type Bank = { kind: "bank"; accountLast4: string; routingLast4?: string };
type PaymentMethod = Card | Bank | null;

export function label(method: PaymentMethod): string {
  if (method === null) {
    return "No payment method";
  }
  if (method.kind === "bank") {
    return "Bank ending " + method.accountLast4;
  }
  // TypeScript now knows `method` is `Card`
  return method.brand + " ending " + method.last4;
}
```

### How it handles null and bank accounts

- **Null**: Explicitly checked first and returns a safe fallback string. This eliminates the `null` case from the union.
- **Bank**: Narrowed using the discriminant `method.kind === "bank"`. This gives TypeScript a precise `Bank` type, allowing safe access to `accountLast4` (the correct field for banks) instead of the non-existent `last4` or `brand`.
- **Card**: After the two checks, TypeScript’s control-flow narrowing guarantees `method` is a `Card`, so `method.brand` and `method.last4` are used without error.

This solution uses **discriminated union narrowing** — the clean, idiomatic TypeScript approach — without `any`, type assertions, or forced property access.

Mid-pack claude-sonnet-4.6-max Refactor long function

89/100

“The refactoring is correct, minimal, and perfectly preserves all behavior and edge cases. Code quality is good but could be slightly improved by using a generator expression for the subtotal and adding type hints (as type safety is mentioned in the benchmark description). No tests were provided, though they were not explicitly requested in the task prompt.”

## Refactored Function

```python
def _calc_subtotal(lines):
    subtotal = 0
    for line in lines:
        if line["qty"] <= 0:
            continue
        subtotal += line["qty"] * line["unit_cents"]
    return subtotal


def _calc_discount(subtotal, coupon):
    if not coupon:
        return 0
    if coupon["type"] == "percent":
        return int(subtotal * coupon["value"] / 100)
    if coupon["type"] == "fixed":
        return coupon["value"]
    return 0


def _calc_tax(taxable, customer):
    if customer["country"] == "US" and customer.get("state") in ["CA", "NY"]:
        return int(taxable * 0.0825)
    return 0


def invoice_total(lines, customer, coupon):
    subtotal = _calc_subtotal(lines)
    discount = _calc_discount(subtotal, coupon)
    taxable = max(0, subtotal - discount)
    tax = _calc_tax(taxable, customer)
    return {
        "subtotal": subtotal,
        "discount": discount,
        "tax": tax,
        "total": taxable + tax,
    }
```

---

### What changed and why

| Helper | Responsibility extracted |
|---|---|
| `_calc_subtotal` | Iterates lines, skips invalid qty, sums line totals |
| `_calc_discount` | Handles `percent` / `fixed` coupon logic and the `None` guard |
| `_calc_tax` | Encapsulates US state tax rule |
| `invoice_total` | Now reads as a clear four-step pipeline; public API is identical |

### What did **not** change
- Function signature: `invoice_total(lines, customer, coupon)`
- Return dict keys and values: `subtotal`, `discount`, `tax`, `total`
- All edge-case behavior: zero/negative qty skipped, `max(0, …)` floor on taxable amount, integer rounding on percent discounts and tax

Lowest score gemini-3.1-pro-preview-medium HMAC webhook verification with replay protection

0/100

“The model's response is incomplete and cuts off before implementing the actual webhook verification logic, including the timestamp check, signature verification, and replay protection. As a result, it fails all requirements.”

Here is a complete, production-ready Python implementation using only the standard library. 

It includes a thread-safe, self-cleaning TTL cache to prevent replay attacks, strict timestamp validation to prevent delayed attacks, and `hmac.compare_digest` to prevent timing attacks.

```python
import hmac
import hashlib
import time
import threading
from typing import Dict

# --- Custom Exceptions ---
class WebhookError(Exception): pass
class InvalidHeaderError(WebhookError): pass
class TimestampExpiredError(WebhookError): pass
class ReplayAttackError(WebhookError): pass
class SignatureMismatchError(WebhookError): pass

# --- Replay Prevention Cache ---
class EventIdCache:
    """A thread-safe, self-cleaning cache to store seen Event IDs."""
    def __init__(self, ttl_seconds: int = 300):
        self.cache: Dict[str, float] = {}
        self.ttl = ttl_seconds
        self.lock = threading.Lock()

    def check_and_add(self, event_id: str) -> bool:
        """Returns True if the event is new and added, False if it's a replay."""
        current_time = time.time()
        
        with self.lock:
            # Lazy cleanup of expired events to prevent memory leaks
            expired_keys = [k for k, v in self.cache.items() if current_time - v > self.ttl]
            for k in expired_keys:
                del self.cache[k]

            # Check for replay
            if event_id in self.cache:
                return False
            
            # Add new event
            self.cache[event_id] = current_time
            return True

# Initialize a global cache (5 minutes TTL matches our timestamp tolerance)
seen_events_cache = EventIdCache(ttl_seconds=300)

# --- Webhook Verification ---
def verify_webhook(
    headers: Dict[str, str], 
    raw_body: bytes, 
    secret: byte

Where models still fail

The most common problems we flagged across all models.

24silently wrong 16empty response 11broken code 10major task miss 9changed public api 9logic error 6incomplete output 5hallucinated api

Frequently asked

What is the best AI model for coding?

In our benchmarks, claude-opus-4.8-medium ranks first for coding, scoring excellent, across 31 test cases.

What is the cheapest good model for coding?

glm-5-max is the best value: it clears our quality bar for coding at $0.017 per run.

Which model is fastest for coding?

grok-4.20-beta-max is the fastest model that still performs well for coding.

How we test

Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.

Judge: gemini-3.1-pro-preview · 1000 model runs across 7 benchmarks · last tested 2026-06-30

This page is Spring Prompt, running

We just did this for every model. Do it for your prompt.

The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.

Generate test cases from your prompt — no eval set required to start.
Compare models side by side with quality, cost and latency in one matrix.
Optimise the winner until the scores say it's ready to ship.

Join the waitlist Browse all benchmarks

Experiment · Cold outreach email

Prompt × model results

12 test cases · 3 evals

Claude Opus

GPT-5

Gemini

7.1

6.8

7.4

8.3

7.9

8.0

9.2 ★

8.6

8.4

Best combo: v3 × Claude Opus

9.2 quality · $0.004/run · 1.8s