Business · 28 tasks · 43 models

Best AI models for Structured Output

Name: Structured Output AI model benchmark
Creator: Spring Prompt

Which models produce valid, schema-correct JSON with grounded values — and use null instead of inventing data when the input is missing it?

Top models Qwen

qwen3.7-max-low OpenAI

gpt-5.5-low OpenAI

gpt-5.4-low

qwen3.7-max-low leads Structured Output (excellent). For tighter budgets, grok-4.20 is competitive at about 19% of the cost.

Best overall Excellent

qwen3.7-max-low

Top score — excellent

99.7 score $0.0166/run 36.1s

Best value Strong

grok-4.20

Clears the quality bar at $3.08/1k/run

88.5 score $0.0031/run 4.3s

Quality vs. cost

Every model placed by what it delivers and what it costs. The best value sits high and to the left.

Full ranking

Best overall Cheapest Fastest Smartest

#	Model	Score	Cost/run	Speed	Best for	AA Coding
1	qwen3.7-max-low	99.7 Excellent	$0.0166	36.1s	Best overall	66
2	gpt-5.5-low	99.6 Excellent	$0.0174	12.4s	Best overall	74.9
3	qwen3.7-max-high	99.2 Excellent	$0.0187	39.2s	Best overall	66
4	gpt-5.4-low	98.8 Excellent	$0.0131	10.2s	Best overall	71.1
5	gpt-5.5-high	98.6 Excellent	$0.0210	16.1s	Best overall	74.9
6	gpt-5.4-high	97.9 Excellent	$0.0192	14.2s	Best overall	71.1
7	gpt-5-mini	97.9 Excellent	$0.0134	16.0s	Best overall	—
8	gpt-5.4-mini	97.7 Excellent	$0.0124	10.8s	Best overall	56.1
9	gemini-3.1-pro-preview-low	97.4 Excellent	$0.0222	17.9s	Best overall	68.8
10	gemini-3.1-pro-preview-high	96.6 Excellent	$0.0232	20.3s	Best overall	68.8
11	gpt-5.4	96.5 Excellent	$0.0123	10.1s	Best overall	71.1
12	gemini-3.5-flash-low	96.3 Excellent	$0.0191	15.4s	Best overall	70.1
13	gemini-3.5-flash-high	96.1 Excellent	$0.0206	15.8s	Best overall	70.1
14	gemini-3.1-flash-lite	95.5 Excellent	$0.0128	9.6s	Best overall	—
15	kimi-k2.5	95.5 Excellent	$0.0153	53.5s	Best overall	—
16	qwen3.5-plus-02-15	93.9 Excellent	$0.0177	64.6s	Best overall	—
17	minimax-m2.7	91.9 Excellent	$0.0132	18.7s	Best overall	52.6
18	grok-4.20-beta	89.8 Strong	$0.0143	11.3s	Best overall	—
19	deepseek-v3.2-high	89.8 Strong	$0.0161	17.8s	Best overall	—
20	claude-opus-4.8-high	88.9 Strong	$0.0222	15.2s	Best overall	74.3
21	deepseek-v3.2	88.5 Strong	$0.0145	16.0s	Best overall	—
22	claude-opus-4.8-low	88.2 Strong	$0.0196	14.2s	Best overall	74.3
23	claude-sonnet-4.5-low	88.1 Strong	$0.0236	20.8s	Best overall	—
24	claude-sonnet-4.5	86.3 Strong	$0.0179	15.0s	Best overall	—
25	claude-opus-4.5-high	86.0 Strong	$0.0337	21.4s	Best overall	—
26	claude-opus-4.5	86.0 Strong	$0.0183	14.1s	Best overall	—
27	deepseek-v3.2-low	84.6 Strong	$0.0141	14.5s	Strong drafts	—
28	claude-sonnet-4.5-high	84.0 Strong	$0.0280	25.0s	Strong drafts	—
29	claude-opus-4.5-low	84.0 Strong	$0.0307	20.4s	Strong drafts	—
30	claude-opus-4.6	82.5 Strong	$0.0211	16.5s	Strong drafts	—
31	claude-opus-4.6-low	78.6 Usable	$0.0272	20.5s	Strong drafts	—
32	claude-sonnet-4.6-high	78.5 Usable	$0.0234	19.5s	Strong drafts	63
33	claude-opus-4.6-high	76.7 Usable	$0.0266	20.9s	Strong drafts	—
34	claude-haiku-4.5	76.6 Usable	$0.0183	16.3s	Strong drafts	43.9
35	claude-sonnet-4.6-low	75.9 Usable	$0.0255	20.9s	Strong drafts	63
36	gpt-5.5	99.0 Excellent	$0.0180	11.0s	Best overall	74.9
37	qwen3.7-max	97.5 Excellent	$0.0274	57.0s	Best overall	66
38	kimi-k2.7-code	96.0 Excellent	$0.0180	33.5s	Best overall	60.8
39	gemini-3-flash-preview	93.5 Excellent	$0.0218	20.6s	Best overall	—
40	grok-4.20	88.5 Strong	$0.0031	4.3s	Best overall	—
41	deepseek-v3.1-terminus	88.5 Strong	$0.0190	16.8s	Best overall	—
42	mistral-medium-3.1	87.5 Strong	$0.0202	15.4s	Best overall	—
43	glm-5	76.5 Usable	$0.0257	86.5s	Strong drafts	—

“AA Coding” is a third-party benchmark shown for context — independent of our tests. Source: Artificial Analysis (artificialanalysis.ai) via OpenRouter (openrouter.ai/rankings). · Source: Design Arena (www.designarena.ai) via OpenRouter (openrouter.ai/rankings).

What separates the top models

Schema Adherence

medium

Tests producing valid JSON that exactly matches a required schema (types, required fields, enums, no extras).

Leader: gemini-3.1-flash-lite

Extraction

hard

Tests extracting the correct values from messy input into a schema — right instance, granularity, and normalization.

Leader: qwen3.7-max-low

Missing & Ambiguous Data

hard

Tests using null/sentinels for absent values instead of fabricating them, and not dropping fields.

Leader: gemini-3.1-flash-lite

Transformation

medium

Tests reshaping/transforming structured data into a target schema without losing or inventing values.

Leader: gemini-3.1-pro-preview-low

Noisy Structured Output Test

hard

Tests noisy OCR, ambiguity, PII, multi-document extraction, and JSON repair.

Leader: gpt-5.5

See the difference

Real model outputs and how our judge scored them.

Top score gemini-3.1-flash-lite JSON repair

100/100

“The model perfectly executed the task, repairing the malformed JSON by adding missing quotes and removing trailing commas, while strictly adhering to the constraint of returning only the JSON without any markdown or conversational filler.”

{
  "customer_id": "C-42",
  "plan": "Pro",
  "seats": 12,
  "tags": [
    "priority",
    "renewal"
  ],
  "active": true,
  "notes": "renewal call on Friday"
}

Mid-pack gpt-5-mini Ambiguous value handling

94/100

“The model followed the instructions exceptionally well, outputting strict JSON without markdown wrappers and structuring the confidence mapping logically. It correctly nulled the explicitly unconfirmed email, though arguably 'maybe' could also trigger a null value for plan_interest under a strict interpretation of the prompt.”

{
  "company": "Brightline",
  "contact_name": "Jordan",
  "email": null,
  "plan_interest": "Enterprise",
  "budget_usd": 50000,
  "confidence": {
    "company": "high",
    "contact_name": "high",
    "email": "low",
    "plan_interest": "low",
    "budget_usd": "low"
  }
}

Lowest score minimax-m2.7 PII redaction JSON

5/100

“The model response is severely truncated, resulting in invalid JSON and completely missing the required redactions array. It fails the core task.”

```json
{
  "redacted_text": "Contact [NAME] at [EMAIL] or [PHONE]. Ship samples to [STREET_ADDRESS], [

Where models still fail

The most common problems we flagged across all models.

396wrapper text 37invalid json 14fabricated value 12wrong instance 11hard constraint failure 9constraint failure 9extraneous text 7schema violation

Frequently asked

What is the best AI model for structured output?

In our benchmarks, qwen3.7-max-low ranks first for structured output, scoring excellent, across 28 test cases.

What is the cheapest good model for structured output?

grok-4.20 is the best value: it clears our quality bar for structured output at $3.08/1k per run.

Which model is fastest for structured output?

grok-4.20 is the fastest model that still performs well for structured output.

How we test

Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.

Judge: gemini-3.1-pro-preview · 1000 model runs across 5 benchmarks · last tested 2026-06-30

This page is Spring Prompt, running

We just did this for every model. Do it for your prompt.

The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.

Generate test cases from your prompt — no eval set required to start.
Compare models side by side with quality, cost and latency in one matrix.
Optimise the winner until the scores say it's ready to ship.

Join the waitlist Browse all benchmarks

Experiment · Cold outreach email

Prompt × model results

12 test cases · 3 evals

Claude Opus

GPT-5

Gemini

7.1

6.8

7.4

8.3

7.9

8.0

9.2 ★

8.6

8.4

Best combo: v3 × Claude Opus

9.2 quality · $0.004/run · 1.8s