Confirm Action

Are you sure you want to proceed?

Business · 31 tasks · 52 models

Best AI models for Coding

Which models fix the root cause, catch the real security bug, and don't write code that's subtly wrong or hallucinated?

Top models Anthropic claude-opus-4.8-medium OpenAI gpt-5.5-medium Moonshot kimi-k2.5-max

claude-opus-4.8-medium leads Coding (excellent). For tighter budgets, glm-5-max is competitive at about 32% of the cost.

Best overall Excellent
claude-opus-4.8-medium

Top score — excellent

92.3 score $0.0535/run 27.2s
Best value Strong
glm-5-max

Clears the quality bar at $0.017/run

80.6 score $0.0172/run 68.2s
Fastest usable Usable
grok-4.20-beta-max

~13s per run, still strong

72.8 score $0.0181/run 13.1s

Quality vs. cost

Every model placed by what it delivers and what it costs. The best value sits high and to the left.

# Model Score
1 claude-opus-4.8-medium 92.3 Excellent
2 gpt-5.5-medium 92.2 Excellent
3 gpt-5.5-max 91.5 Excellent
4 claude-opus-4.8-max 91.3 Excellent
5 kimi-k2.5-max 89.7 Strong
6 gemini-3.5-flash-medium 87.8 Strong
7 claude-sonnet-4.6-max 87.0 Strong
8 gpt-5.4-mini-max 86.5 Strong

“AA Coding” is a third-party benchmark shown for context — independent of our tests. Source: Artificial Analysis (artificialanalysis.ai) via OpenRouter (openrouter.ai/rankings). · Source: Design Arena (www.designarena.ai) via OpenRouter (openrouter.ai/rankings).

What separates the top models

Bug Fixing

medium

Tests diagnosing and fixing the root cause of a bug minimally, handling the edge case, without changing the public API.

Code Review and Risk Test

hard

Tests security, migration, race-condition, and operational review quality.

Code Review & Security

hard

Tests surfacing the real high-severity issue (security/concurrency/operational) above style nits.

Leader: qwen3.5-plus-02-15-medium

Secure Implementation

hard

Tests implementing security-sensitive code correctly — parameterized queries, constant-time compares, replay protection.

Leader: gpt-5.5-max

API and Data Code Test

medium

Tests practical API, SQL, validation, and data-manipulation code.

Leader: gpt-5.5-medium

Refactoring

medium

Tests behavior-preserving refactoring that reduces complexity without changing observable behavior or mixing in features.

Leader: kimi-k2.7-code-medium

Code Quality and Testing Test

hard

Tests refactoring, unit tests, type safety, performance, and multi-file planning.

Leader: claude-opus-4.8-medium

See the difference

Real model outputs and how our judge scored them.

Top score grok-4.20-max TypeScript type narrowing
100/100

“The model provides a flawless, production-ready solution. It correctly uses idiomatic TypeScript discriminated union narrowing to handle the null and bank cases without using `any`. The explanation is concise, accurate, and directly addresses the prompt's requirements.”

**Fixed Implementation:**

```ts
type Card = { kind: "card"; last4: string; brand: string };
type Bank = { kind: "bank"; accountLast4: string; routingLast4?: string };
type PaymentMethod = Card | Bank | null;

export function label(method: PaymentMethod): string {
  if (method === null) {
    return "No payment method";
  }
  if (method.kind === "bank") {
    return "Bank ending " + method.accountLast4;
  }
  // TypeScript now knows `method` is `Card`
  return method.brand + " ending " + method.last4;
}
```

### How it handles null and bank accounts

- **Null**: Explicitly checked first and returns a safe fallback string. This eliminates the `null` case from the union.
- **Bank**: Narrowed using the discriminant `method.kind === "bank"`. This gives TypeScript a precise `Bank` type, allowing safe access to `accountLast4` (the correct field for banks) instead of the non-existent `last4` or `brand`.
- **Card**: After the two checks, TypeScript’s control-flow narrowing guarantees `method` is a `Card`, so `method.brand` and `method.last4` are used without error.

This solution uses **discriminated union narrowing** — the clean, idiomatic TypeScript approach — without `any`, type assertions, or forced property access.
Mid-pack claude-sonnet-4.6-max Refactor long function
89/100

“The refactoring is correct, minimal, and perfectly preserves all behavior and edge cases. Code quality is good but could be slightly improved by using a generator expression for the subtotal and adding type hints (as type safety is mentioned in the benchmark description). No tests were provided, though they were not explicitly requested in the task prompt.”

## Refactored Function

```python
def _calc_subtotal(lines):
    subtotal = 0
    for line in lines:
        if line["qty"] <= 0:
            continue
        subtotal += line["qty"] * line["unit_cents"]
    return subtotal


def _calc_discount(subtotal, coupon):
    if not coupon:
        return 0
    if coupon["type"] == "percent":
        return int(subtotal * coupon["value"] / 100)
    if coupon["type"] == "fixed":
        return coupon["value"]
    return 0


def _calc_tax(taxable, customer):
    if customer["country"] == "US" and customer.get("state") in ["CA", "NY"]:
        return int(taxable * 0.0825)
    return 0


def invoice_total(lines, customer, coupon):
    subtotal = _calc_subtotal(lines)
    discount = _calc_discount(subtotal, coupon)
    taxable = max(0, subtotal - discount)
    tax = _calc_tax(taxable, customer)
    return {
        "subtotal": subtotal,
        "discount": discount,
        "tax": tax,
        "total": taxable + tax,
    }
```

---

### What changed and why

| Helper | Responsibility extracted |
|---|---|
| `_calc_subtotal` | Iterates lines, skips invalid qty, sums line totals |
| `_calc_discount` | Handles `percent` / `fixed` coupon logic and the `None` guard |
| `_calc_tax` | Encapsulates US state tax rule |
| `invoice_total` | Now reads as a clear four-step pipeline; public API is identical |

### What did **not** change
- Function signature: `invoice_total(lines, customer, coupon)`
- Return dict keys and values: `subtotal`, `discount`, `tax`, `total`
- All edge-case behavior: zero/negative qty skipped, `max(0, …)` floor on taxable amount, integer rounding on percent discounts and tax
Lowest score gemini-3.1-pro-preview-medium HMAC webhook verification with replay protection
0/100

“The model's response is incomplete and cuts off before implementing the actual webhook verification logic, including the timestamp check, signature verification, and replay protection. As a result, it fails all requirements.”

Here is a complete, production-ready Python implementation using only the standard library. 

It includes a thread-safe, self-cleaning TTL cache to prevent replay attacks, strict timestamp validation to prevent delayed attacks, and `hmac.compare_digest` to prevent timing attacks.

```python
import hmac
import hashlib
import time
import threading
from typing import Dict

# --- Custom Exceptions ---
class WebhookError(Exception): pass
class InvalidHeaderError(WebhookError): pass
class TimestampExpiredError(WebhookError): pass
class ReplayAttackError(WebhookError): pass
class SignatureMismatchError(WebhookError): pass

# --- Replay Prevention Cache ---
class EventIdCache:
    """A thread-safe, self-cleaning cache to store seen Event IDs."""
    def __init__(self, ttl_seconds: int = 300):
        self.cache: Dict[str, float] = {}
        self.ttl = ttl_seconds
        self.lock = threading.Lock()

    def check_and_add(self, event_id: str) -> bool:
        """Returns True if the event is new and added, False if it's a replay."""
        current_time = time.time()
        
        with self.lock:
            # Lazy cleanup of expired events to prevent memory leaks
            expired_keys = [k for k, v in self.cache.items() if current_time - v > self.ttl]
            for k in expired_keys:
                del self.cache[k]

            # Check for replay
            if event_id in self.cache:
                return False
            
            # Add new event
            self.cache[event_id] = current_time
            return True

# Initialize a global cache (5 minutes TTL matches our timestamp tolerance)
seen_events_cache = EventIdCache(ttl_seconds=300)

# --- Webhook Verification ---
def verify_webhook(
    headers: Dict[str, str], 
    raw_body: bytes, 
    secret: byte

Where models still fail

The most common problems we flagged across all models.

24silently wrong 16empty response 11broken code 10major task miss 9changed public api 9logic error 6incomplete output 5hallucinated api

Frequently asked

What is the best AI model for coding?

In our benchmarks, claude-opus-4.8-medium ranks first for coding, scoring excellent, across 31 test cases.

What is the cheapest good model for coding?

glm-5-max is the best value: it clears our quality bar for coding at $0.017 per run.

Which model is fastest for coding?

grok-4.20-beta-max is the fastest model that still performs well for coding.

How we test

Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.

Judge: gemini-3.1-pro-preview · 1000 model runs across 7 benchmarks · last tested 2026-06-30

This page is Spring Prompt, running

We just did this for every model. Do it for your prompt.

The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.

  • Generate test cases from your prompt — no eval set required to start.
  • Compare models side by side with quality, cost and latency in one matrix.
  • Optimise the winner until the scores say it's ready to ship.
Experiment · Cold outreach email

Prompt × model results

12 test cases · 3 evals
Claude Opus
GPT-5
Gemini
v1
7.1
6.8
7.4
v2
8.3
7.9
8.0
v3
9.2
8.6
8.4
Best combo: v3 × Claude Opus
9.2 quality · $0.004/run · 1.8s