Business · 28 tasks · 43 models
Cheapest AI models for Structured Output
Which models produce valid, schema-correct JSON with grounded values — and use null instead of inventing data when the input is missing it?
The cheapest capable model for Structured Output is grok-4.20, at $3.08/1k per run — and it still clears our quality bar.
Top score — excellent
Clears the quality bar at $3.08/1k/run
Quality vs. cost
Every model placed by what it delivers and what it costs. The best value sits high and to the left.
Full ranking
| # | Model | Score | Cost/run | Speed | Best for | AA Coding |
|---|---|---|---|---|---|---|
| 1 | grok-4.20 | 88.5 Strong | $0.0031 | 4.3s | Best overall | — |
| 2 | gpt-5.4 | 96.5 Excellent | $0.0123 | 10.1s | Best overall | 71.1 |
| 3 | gpt-5.4-mini | 97.7 Excellent | $0.0124 | 10.8s | Best overall | 56.1 |
| 4 | gemini-3.1-flash-lite | 95.5 Excellent | $0.0128 | 9.6s | Best overall | — |
| 5 | gpt-5.4-low | 98.8 Excellent | $0.0131 | 10.2s | Best overall | 71.1 |
| 6 | minimax-m2.7 | 91.9 Excellent | $0.0132 | 18.7s | Best overall | 52.6 |
| 7 | gpt-5-mini | 97.9 Excellent | $0.0134 | 16.0s | Best overall | — |
| 8 | deepseek-v3.2-low | 84.6 Strong | $0.0141 | 14.5s | Strong drafts | — |
| 9 | grok-4.20-beta | 89.8 Strong | $0.0143 | 11.3s | Best overall | — |
| 10 | deepseek-v3.2 | 88.5 Strong | $0.0145 | 16.0s | Best overall | — |
| 11 | kimi-k2.5 | 95.5 Excellent | $0.0153 | 53.5s | Best overall | — |
| 12 | deepseek-v3.2-high | 89.8 Strong | $0.0161 | 17.8s | Best overall | — |
| 13 | qwen3.7-max-low | 99.7 Excellent | $0.0166 | 36.1s | Best overall | 66 |
| 14 | gpt-5.5-low | 99.6 Excellent | $0.0174 | 12.4s | Best overall | 74.9 |
| 15 | qwen3.5-plus-02-15 | 93.9 Excellent | $0.0177 | 64.6s | Best overall | — |
| 16 | claude-sonnet-4.5 | 86.3 Strong | $0.0179 | 15.0s | Best overall | — |
| 17 | gpt-5.5 | 99.0 Excellent | $0.0180 | 11.0s | Best overall | 74.9 |
| 18 | kimi-k2.7-code | 96.0 Excellent | $0.0180 | 33.5s | Best overall | 60.8 |
| 19 | claude-haiku-4.5 | 76.6 Usable | $0.0183 | 16.3s | Strong drafts | 43.9 |
| 20 | claude-opus-4.5 | 86.0 Strong | $0.0183 | 14.1s | Best overall | — |
| 21 | qwen3.7-max-high | 99.2 Excellent | $0.0187 | 39.2s | Best overall | 66 |
| 22 | deepseek-v3.1-terminus | 88.5 Strong | $0.0190 | 16.8s | Best overall | — |
| 23 | gemini-3.5-flash-low | 96.3 Excellent | $0.0191 | 15.4s | Best overall | 70.1 |
| 24 | gpt-5.4-high | 97.9 Excellent | $0.0192 | 14.2s | Best overall | 71.1 |
| 25 | claude-opus-4.8-low | 88.2 Strong | $0.0196 | 14.2s | Best overall | 74.3 |
| 26 | mistral-medium-3.1 | 87.5 Strong | $0.0202 | 15.4s | Best overall | — |
| 27 | gemini-3.5-flash-high | 96.1 Excellent | $0.0206 | 15.8s | Best overall | 70.1 |
| 28 | gpt-5.5-high | 98.6 Excellent | $0.0210 | 16.1s | Best overall | 74.9 |
| 29 | claude-opus-4.6 | 82.5 Strong | $0.0211 | 16.5s | Strong drafts | — |
| 30 | gemini-3-flash-preview | 93.5 Excellent | $0.0218 | 20.6s | Best overall | — |
| 31 | claude-opus-4.8-high | 88.9 Strong | $0.0222 | 15.2s | Best overall | 74.3 |
| 32 | gemini-3.1-pro-preview-low | 97.4 Excellent | $0.0222 | 17.9s | Best overall | 68.8 |
| 33 | gemini-3.1-pro-preview-high | 96.6 Excellent | $0.0232 | 20.3s | Best overall | 68.8 |
| 34 | claude-sonnet-4.6-high | 78.5 Usable | $0.0234 | 19.5s | Strong drafts | 63 |
| 35 | claude-sonnet-4.5-low | 88.1 Strong | $0.0236 | 20.8s | Best overall | — |
| 36 | claude-sonnet-4.6-low | 75.9 Usable | $0.0255 | 20.9s | Strong drafts | 63 |
| 37 | glm-5 | 76.5 Usable | $0.0257 | 86.5s | Strong drafts | — |
| 38 | claude-opus-4.6-high | 76.7 Usable | $0.0266 | 20.9s | Strong drafts | — |
| 39 | claude-opus-4.6-low | 78.6 Usable | $0.0272 | 20.5s | Strong drafts | — |
| 40 | qwen3.7-max | 97.5 Excellent | $0.0274 | 57.0s | Best overall | 66 |
| 41 | claude-sonnet-4.5-high | 84.0 Strong | $0.0280 | 25.0s | Strong drafts | — |
| 42 | claude-opus-4.5-low | 84.0 Strong | $0.0307 | 20.4s | Strong drafts | — |
| 43 | claude-opus-4.5-high | 86.0 Strong | $0.0337 | 21.4s | Best overall | — |
“AA Coding” is a third-party benchmark shown for context — independent of our tests. Source: Artificial Analysis (artificialanalysis.ai) via OpenRouter (openrouter.ai/rankings). · Source: Design Arena (www.designarena.ai) via OpenRouter (openrouter.ai/rankings).
What separates the top models
Schema Adherence
mediumTests producing valid JSON that exactly matches a required schema (types, required fields, enums, no extras).
Leader: gemini-3.1-flash-lite
Extraction
hardTests extracting the correct values from messy input into a schema — right instance, granularity, and normalization.
Leader: qwen3.7-max-low
Missing & Ambiguous Data
hardTests using null/sentinels for absent values instead of fabricating them, and not dropping fields.
Leader: gemini-3.1-flash-lite
Transformation
mediumTests reshaping/transforming structured data into a target schema without losing or inventing values.
Leader: gemini-3.1-pro-preview-low
Noisy Structured Output Test
hardTests noisy OCR, ambiguity, PII, multi-document extraction, and JSON repair.
Leader: gpt-5.5
See the difference
Real model outputs and how our judge scored them.
Top score
gemini-3.1-flash-lite
JSON repair
100/100
“The model perfectly executed the task, repairing the malformed JSON by adding missing quotes and removing trailing commas, while strictly adhering to the constraint of returning only the JSON without any markdown or conversational filler.”
{
"customer_id": "C-42",
"plan": "Pro",
"seats": 12,
"tags": [
"priority",
"renewal"
],
"active": true,
"notes": "renewal call on Friday"
}
Mid-pack
gpt-5-mini
Ambiguous value handling
94/100
“The model followed the instructions exceptionally well, outputting strict JSON without markdown wrappers and structuring the confidence mapping logically. It correctly nulled the explicitly unconfirmed email, though arguably 'maybe' could also trigger a null value for plan_interest under a strict interpretation of the prompt.”
{
"company": "Brightline",
"contact_name": "Jordan",
"email": null,
"plan_interest": "Enterprise",
"budget_usd": 50000,
"confidence": {
"company": "high",
"contact_name": "high",
"email": "low",
"plan_interest": "low",
"budget_usd": "low"
}
}
Lowest score
minimax-m2.7
PII redaction JSON
5/100
“The model response is severely truncated, resulting in invalid JSON and completely missing the required redactions array. It fails the core task.”
```json
{
"redacted_text": "Contact [NAME] at [EMAIL] or [PHONE]. Ship samples to [STREET_ADDRESS], [
Where models still fail
The most common problems we flagged across all models.
Frequently asked
What is the best AI model for structured output?
In our benchmarks, qwen3.7-max-low ranks first for structured output, scoring excellent, across 28 test cases.
What is the cheapest good model for structured output?
grok-4.20 is the best value: it clears our quality bar for structured output at $3.08/1k per run.
Which model is fastest for structured output?
grok-4.20 is the fastest model that still performs well for structured output.
How we test
Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.
Judge: gemini-3.1-pro-preview · 1000 model runs across 5 benchmarks · last tested 2026-06-30
This page is Spring Prompt, running
We just did this for every model. Do it for your prompt.
The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.
- Generate test cases from your prompt — no eval set required to start.
- Compare models side by side with quality, cost and latency in one matrix.
- Optimise the winner until the scores say it's ready to ship.
Prompt × model results
12 test cases · 3 evals