Confirm Action

Are you sure you want to proceed?

Business · 28 tasks · 43 models

Best AI models for Structured Output

Which models produce valid, schema-correct JSON with grounded values — and use null instead of inventing data when the input is missing it?

Top models Qwen qwen3.7-max-low OpenAI gpt-5.5-low OpenAI gpt-5.4-low

qwen3.7-max-low leads Structured Output (excellent). For tighter budgets, grok-4.20 is competitive at about 19% of the cost.

Best overall Excellent
qwen3.7-max-low

Top score — excellent

99.7 score $0.0166/run 36.1s
Best value Strong
grok-4.20

Clears the quality bar at $3.08/1k/run

88.5 score $0.0031/run 4.3s

Quality vs. cost

Every model placed by what it delivers and what it costs. The best value sits high and to the left.

# Model Score
1 qwen3.7-max-low 99.7 Excellent
2 gpt-5.5-low 99.6 Excellent
3 qwen3.7-max-high 99.2 Excellent
4 gpt-5.4-low 98.8 Excellent
5 gpt-5.5-high 98.6 Excellent
6 gpt-5.4-high 97.9 Excellent
7 gpt-5-mini 97.9 Excellent
8 gpt-5.4-mini 97.7 Excellent

“AA Coding” is a third-party benchmark shown for context — independent of our tests. Source: Artificial Analysis (artificialanalysis.ai) via OpenRouter (openrouter.ai/rankings). · Source: Design Arena (www.designarena.ai) via OpenRouter (openrouter.ai/rankings).

What separates the top models

Schema Adherence

medium

Tests producing valid JSON that exactly matches a required schema (types, required fields, enums, no extras).

Leader: gemini-3.1-flash-lite

Extraction

hard

Tests extracting the correct values from messy input into a schema — right instance, granularity, and normalization.

Leader: qwen3.7-max-low

Missing & Ambiguous Data

hard

Tests using null/sentinels for absent values instead of fabricating them, and not dropping fields.

Leader: gemini-3.1-flash-lite

Transformation

medium

Tests reshaping/transforming structured data into a target schema without losing or inventing values.

Leader: gemini-3.1-pro-preview-low

Noisy Structured Output Test

hard

Tests noisy OCR, ambiguity, PII, multi-document extraction, and JSON repair.

Leader: gpt-5.5

See the difference

Real model outputs and how our judge scored them.

Top score gemini-3.1-flash-lite JSON repair
100/100

“The model perfectly executed the task, repairing the malformed JSON by adding missing quotes and removing trailing commas, while strictly adhering to the constraint of returning only the JSON without any markdown or conversational filler.”

{
  "customer_id": "C-42",
  "plan": "Pro",
  "seats": 12,
  "tags": [
    "priority",
    "renewal"
  ],
  "active": true,
  "notes": "renewal call on Friday"
}
Mid-pack gpt-5-mini Ambiguous value handling
94/100

“The model followed the instructions exceptionally well, outputting strict JSON without markdown wrappers and structuring the confidence mapping logically. It correctly nulled the explicitly unconfirmed email, though arguably 'maybe' could also trigger a null value for plan_interest under a strict interpretation of the prompt.”

{
  "company": "Brightline",
  "contact_name": "Jordan",
  "email": null,
  "plan_interest": "Enterprise",
  "budget_usd": 50000,
  "confidence": {
    "company": "high",
    "contact_name": "high",
    "email": "low",
    "plan_interest": "low",
    "budget_usd": "low"
  }
}
Lowest score minimax-m2.7 PII redaction JSON
5/100

“The model response is severely truncated, resulting in invalid JSON and completely missing the required redactions array. It fails the core task.”

```json
{
  "redacted_text": "Contact [NAME] at [EMAIL] or [PHONE]. Ship samples to [STREET_ADDRESS], [

Where models still fail

The most common problems we flagged across all models.

396wrapper text 37invalid json 14fabricated value 12wrong instance 11hard constraint failure 9constraint failure 9extraneous text 7schema violation

Frequently asked

What is the best AI model for structured output?

In our benchmarks, qwen3.7-max-low ranks first for structured output, scoring excellent, across 28 test cases.

What is the cheapest good model for structured output?

grok-4.20 is the best value: it clears our quality bar for structured output at $3.08/1k per run.

Which model is fastest for structured output?

grok-4.20 is the fastest model that still performs well for structured output.

How we test

Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.

Judge: gemini-3.1-pro-preview · 1000 model runs across 5 benchmarks · last tested 2026-06-30

This page is Spring Prompt, running

We just did this for every model. Do it for your prompt.

The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.

  • Generate test cases from your prompt — no eval set required to start.
  • Compare models side by side with quality, cost and latency in one matrix.
  • Optimise the winner until the scores say it's ready to ship.
Experiment · Cold outreach email

Prompt × model results

12 test cases · 3 evals
Claude Opus
GPT-5
Gemini
v1
7.1
6.8
7.4
v2
8.3
7.9
8.0
v3
9.2
8.6
8.4
Best combo: v3 × Claude Opus
9.2 quality · $0.004/run · 1.8s