Business · 17 tasks · 44 models
Smartest AI models for Summarization & Meeting Notes
Which models summarize meetings faithfully — capturing real outcomes without hallucinating decisions, owners, or deadlines?
The highest-quality model for Summarization & Meeting Notes is claude-opus-4.5 (excellent).
Top score — excellent
Clears the quality bar at $0.013/run
Quality vs. cost
Every model placed by what it delivers and what it costs. The best value sits high and to the left.
Full ranking
| # | Model | Score | Cost/run | Speed | Best for |
|---|---|---|---|---|---|
| 1 | claude-opus-4.5 | 99.9 Excellent | $0.0204 | 14.9s | Best overall |
| 2 | claude-sonnet-4.5 | 99.5 Excellent | $0.0182 | 15.8s | Best overall |
| 3 | gemini-3.1-pro-preview-low | 99.4 Excellent | $0.0212 | 18.8s | Best overall |
| 4 | claude-sonnet-4.5-low | 99.4 Excellent | $0.0222 | 18.8s | Best overall |
| 5 | gemini-3.1-pro-preview-high | 99.3 Excellent | $0.0224 | 18.9s | Best overall |
| 6 | claude-opus-4.5-low | 99.2 Excellent | $0.0262 | 17.7s | Best overall |
| 7 | gpt-5.5 | 99.0 Excellent | $0.0242 | 15.9s | Best overall |
| 8 | gpt-5.4-high | 98.9 Excellent | $0.0223 | 15.4s | Best overall |
| 9 | qwen3.7-max | 98.9 Excellent | $0.0199 | 33.8s | Best overall |
| 10 | gemini-3.1-pro-preview | 98.2 Excellent | $0.0263 | 19.3s | Best overall |
| 11 | gpt-5.4-mini | 98.1 Excellent | $0.0168 | 13.3s | Best overall |
| 12 | gpt-5.4-low | 98.1 Excellent | $0.0190 | 13.7s | Best overall |
| 13 | gpt-5.4 | 97.9 Excellent | $0.0197 | 14.3s | Best overall |
| 14 | gemini-3.5-flash-high | 97.6 Excellent | $0.0225 | 17.2s | Best overall |
| 15 | gpt-5.5-high | 96.9 Excellent | $0.0264 | 14.9s | Best overall |
| 16 | claude-sonnet-4.5-high | 96.8 Excellent | $0.0234 | 20.7s | Best overall |
| 17 | deepseek-v3.2 | 96.7 Excellent | $0.0138 | 17.6s | Best overall |
| 18 | deepseek-v3.1-terminus | 96.7 Excellent | $0.0146 | 23.6s | Best overall |
| 19 | claude-opus-4.6-high | 96.7 Excellent | $0.0258 | 19.6s | Best overall |
| 20 | claude-opus-4.8-high | 96.4 Excellent | $0.0286 | 16.3s | Best overall |
| 21 | qwen3.7-max-low | 96.1 Excellent | $0.0182 | 30.9s | Best overall |
| 22 | gemini-3.5-flash-low | 96.0 Excellent | $0.0191 | 15.2s | Best overall |
| 23 | qwen3.5-plus-02-15 | 95.9 Excellent | $0.0160 | 36.5s | Best overall |
| 24 | kimi-k2.7-code | 95.9 Excellent | $0.0169 | 26.6s | Best overall |
| 25 | claude-opus-4.5-high | 95.8 Excellent | $0.0285 | 19.4s | Best overall |
| 26 | gemini-3.1-flash-lite | 95.0 Excellent | $0.0133 | 10.3s | Best overall |
| 27 | deepseek-v3.2-low | 94.5 Excellent | $0.0139 | 14.2s | Best overall |
| 28 | mistral-medium-3.1 | 94.1 Excellent | $0.0160 | 15.2s | Best overall |
| 29 | gemini-3-flash-preview | 93.6 Excellent | $0.0182 | 17.7s | Best overall |
| 30 | qwen3.7-max-high | 93.5 Excellent | $0.0183 | 31.1s | Best overall |
| 31 | gpt-5.5-low | 93.3 Excellent | $0.0234 | 14.2s | Best overall |
| 32 | claude-opus-4.6 | 93.1 Excellent | $0.0252 | 19.3s | Best overall |
| 33 | claude-haiku-4.5 | 92.3 Excellent | $0.0141 | 11.8s | Best overall |
| 34 | deepseek-v3.2-high | 91.6 Excellent | $0.0157 | 16.9s | Best overall |
| 35 | claude-opus-4.6-low | 91.3 Excellent | $0.0232 | 17.5s | Best overall |
| 36 | claude-opus-4.8-low | 91.2 Excellent | $0.0262 | 15.2s | Best overall |
| 37 | grok-4.20-beta | 90.7 Excellent | $0.0167 | 11.8s | Best overall |
| 38 | gpt-5-mini | 88.2 Strong | $0.0174 | 18.9s | Best overall |
| 39 | claude-sonnet-4.6-low | 88.0 Strong | $0.0227 | 18.7s | Best overall |
| 40 | grok-4.20 | 87.7 Strong | $0.0149 | 11.4s | Best overall |
| 41 | glm-5 | 87.5 Strong | $0.0174 | 44.9s | Best overall |
| 42 | kimi-k2.5 | 86.7 Strong | $0.0180 | 29.1s | Best overall |
| 43 | minimax-m2.7 | 86.6 Strong | $0.0161 | 21.5s | Best overall |
| 44 | claude-sonnet-4.6-high | 90.9 Excellent | $0.0204 | 16.6s | Best overall |
What separates the top models
Executive Summary
mediumTests whether the model compresses a transcript into an outcome-first executive summary with clean buckets, without rehashing or editorialising.
Leader: claude-opus-4.6-low
Action-Item Extraction
mediumTests extraction of action items with owner, due date, and deliverable — marking 'not stated' rather than inventing, and excluding idle discussion.
Leader: claude-sonnet-4.5-high
Faithfulness Under Pressure
hardTests whether the model preserves deferred decisions, dissent, and caveats — refusing to manufacture consensus or state tentative outcomes as final.
Leader: gemini-3.1-flash-lite
Messy Transcript
hardTests handling of tangents, contradictions, sarcasm, and unresolved threads — using a parking lot / open-questions bucket instead of forcing closure or taking jokes literally.
Leader: gemini-3.1-flash-lite
Transcript Q&A
hardTargeted questions over a long, dense transcript with a single correct answer each — catching reversed decisions, corrected numbers, misattribution, and unresolved threads.
Leader: gemini-3.5-flash-high
See the difference
Real model outputs and how our judge scored them.
Top score
gemini-3.1-flash-lite
Did the team agree to bundle the dwell-benchmark module i...
100/100
“The model's answer perfectly matches the correct answer. It correctly identifies that no decision was made to bundle, captures Marcus's explicit rejection of that framing, and notes the agreement to revisit the decision once the cost model is available.”
No, the team did not agree to bundle the module. Marcus stated there was no packaging decision made and explicitly clarified, "Not aligned to bundle." The team agreed only to revisit the decision next month once the cost model is available.
Mid-pack
claude-opus-4.5-high
Ferrovia — overstated consensus trap
100/100
“The summary is flawless. It perfectly captures the deferred packaging decision, explicitly noting that the team did not align to bundle, and accurately records the SOC 2 blocker, the action item for Yuki, and the separation of the Meridian deals.”
## Decisions Made - **SOC 2 report**: Yuki will send the current report to Omar by Monday to unblock the Meridian deal - **Packaging (audit-trail module)**: Deferred to next QBR—no decision to bundle. The team is only aligned to revisit with data, not to bundle. - **Meridian accounts**: Keep the renewal and the new deal separate ## Still Open - **Bundle vs. add-on**: Pending Frank's storage-cost model before any decision - **Meridian deal signing**: Blocked until their security team receives the current SOC 2 report
Lowest score
glm-5
Glow & Grain — tangents, sarcasm, contradiction
0/100
“The model completely failed to generate a summary, outputting only the number '1'. It captures absolutely none of the transcript's content, decisions, or open items.”
1
Where models still fail
The most common problems we flagged across all models.
Frequently asked
What is the best AI model for summarization & meeting notes?
In our benchmarks, claude-opus-4.5 ranks first for summarization & meeting notes, scoring excellent, across 17 test cases.
What is the cheapest good model for summarization & meeting notes?
gemini-3.1-flash-lite is the best value: it clears our quality bar for summarization & meeting notes at $0.013 per run.
Which model is fastest for summarization & meeting notes?
gemini-3.1-flash-lite is the fastest model that still performs well for summarization & meeting notes.
How we test
Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.
Judge: gemini-3.1-pro-preview · 850 model runs across 5 benchmarks · last tested 2026-06-30
This page is Spring Prompt, running
We just did this for every model. Do it for your prompt.
The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.
- Generate test cases from your prompt — no eval set required to start.
- Compare models side by side with quality, cost and latency in one matrix.
- Optimise the winner until the scores say it's ready to ship.
Prompt × model results
12 test cases · 3 evals