Business · 17 tasks · 44 models

Smartest AI models for Summarization & Meeting Notes

Name: Summarization & Meeting Notes AI model benchmark
Creator: Spring Prompt

Which models summarize meetings faithfully — capturing real outcomes without hallucinating decisions, owners, or deadlines?

Top models Anthropic

claude-opus-4.5 Anthropic

claude-sonnet-4.5 Google

gemini-3.1-pro-preview-low

The highest-quality model for Summarization & Meeting Notes is claude-opus-4.5 (excellent).

Best overall ★ Excellent

claude-opus-4.5

Top score — excellent

99.9 score $0.0204/run 14.9s

Best value Excellent

gemini-3.1-flash-lite

Clears the quality bar at $0.013/run

95.0 score $0.0133/run 10.3s

Quality vs. cost

Every model placed by what it delivers and what it costs. The best value sits high and to the left.

Full ranking

Best overall Cheapest Fastest Smartest

#	Model	Score	Cost/run	Speed	Best for
1	claude-opus-4.5	99.9 Excellent	$0.0204	14.9s	Best overall
2	claude-sonnet-4.5	99.5 Excellent	$0.0182	15.8s	Best overall
3	gemini-3.1-pro-preview-low	99.4 Excellent	$0.0212	18.8s	Best overall
4	claude-sonnet-4.5-low	99.4 Excellent	$0.0222	18.8s	Best overall
5	gemini-3.1-pro-preview-high	99.3 Excellent	$0.0224	18.9s	Best overall
6	claude-opus-4.5-low	99.2 Excellent	$0.0262	17.7s	Best overall
7	gpt-5.5	99.0 Excellent	$0.0242	15.9s	Best overall
8	gpt-5.4-high	98.9 Excellent	$0.0223	15.4s	Best overall
9	qwen3.7-max	98.9 Excellent	$0.0199	33.8s	Best overall
10	gemini-3.1-pro-preview	98.2 Excellent	$0.0263	19.3s	Best overall
11	gpt-5.4-mini	98.1 Excellent	$0.0168	13.3s	Best overall
12	gpt-5.4-low	98.1 Excellent	$0.0190	13.7s	Best overall
13	gpt-5.4	97.9 Excellent	$0.0197	14.3s	Best overall
14	gemini-3.5-flash-high	97.6 Excellent	$0.0225	17.2s	Best overall
15	gpt-5.5-high	96.9 Excellent	$0.0264	14.9s	Best overall
16	claude-sonnet-4.5-high	96.8 Excellent	$0.0234	20.7s	Best overall
17	deepseek-v3.2	96.7 Excellent	$0.0138	17.6s	Best overall
18	deepseek-v3.1-terminus	96.7 Excellent	$0.0146	23.6s	Best overall
19	claude-opus-4.6-high	96.7 Excellent	$0.0258	19.6s	Best overall
20	claude-opus-4.8-high	96.4 Excellent	$0.0286	16.3s	Best overall
21	qwen3.7-max-low	96.1 Excellent	$0.0182	30.9s	Best overall
22	gemini-3.5-flash-low	96.0 Excellent	$0.0191	15.2s	Best overall
23	qwen3.5-plus-02-15	95.9 Excellent	$0.0160	36.5s	Best overall
24	kimi-k2.7-code	95.9 Excellent	$0.0169	26.6s	Best overall
25	claude-opus-4.5-high	95.8 Excellent	$0.0285	19.4s	Best overall
26	gemini-3.1-flash-lite	95.0 Excellent	$0.0133	10.3s	Best overall
27	deepseek-v3.2-low	94.5 Excellent	$0.0139	14.2s	Best overall
28	mistral-medium-3.1	94.1 Excellent	$0.0160	15.2s	Best overall
29	gemini-3-flash-preview	93.6 Excellent	$0.0182	17.7s	Best overall
30	qwen3.7-max-high	93.5 Excellent	$0.0183	31.1s	Best overall
31	gpt-5.5-low	93.3 Excellent	$0.0234	14.2s	Best overall
32	claude-opus-4.6	93.1 Excellent	$0.0252	19.3s	Best overall
33	claude-haiku-4.5	92.3 Excellent	$0.0141	11.8s	Best overall
34	deepseek-v3.2-high	91.6 Excellent	$0.0157	16.9s	Best overall
35	claude-opus-4.6-low	91.3 Excellent	$0.0232	17.5s	Best overall
36	claude-opus-4.8-low	91.2 Excellent	$0.0262	15.2s	Best overall
37	grok-4.20-beta	90.7 Excellent	$0.0167	11.8s	Best overall
38	gpt-5-mini	88.2 Strong	$0.0174	18.9s	Best overall
39	claude-sonnet-4.6-low	88.0 Strong	$0.0227	18.7s	Best overall
40	grok-4.20	87.7 Strong	$0.0149	11.4s	Best overall
41	glm-5	87.5 Strong	$0.0174	44.9s	Best overall
42	kimi-k2.5	86.7 Strong	$0.0180	29.1s	Best overall
43	minimax-m2.7	86.6 Strong	$0.0161	21.5s	Best overall
44	claude-sonnet-4.6-high	90.9 Excellent	$0.0204	16.6s	Best overall

What separates the top models

Executive Summary

medium

Tests whether the model compresses a transcript into an outcome-first executive summary with clean buckets, without rehashing or editorialising.

Leader: claude-opus-4.6-low

Action-Item Extraction

medium

Tests extraction of action items with owner, due date, and deliverable — marking 'not stated' rather than inventing, and excluding idle discussion.

Leader: claude-sonnet-4.5-high

Faithfulness Under Pressure

hard

Tests whether the model preserves deferred decisions, dissent, and caveats — refusing to manufacture consensus or state tentative outcomes as final.

Leader: gemini-3.1-flash-lite

Messy Transcript

hard

Tests handling of tangents, contradictions, sarcasm, and unresolved threads — using a parking lot / open-questions bucket instead of forcing closure or taking jokes literally.

Leader: gemini-3.1-flash-lite

Transcript Q&A

hard

Targeted questions over a long, dense transcript with a single correct answer each — catching reversed decisions, corrected numbers, misattribution, and unresolved threads.

Leader: gemini-3.5-flash-high

See the difference

Real model outputs and how our judge scored them.

Top score gemini-3.1-flash-lite Did the team agree to bundle the dwell-benchmark module i...

100/100

“The model's answer perfectly matches the correct answer. It correctly identifies that no decision was made to bundle, captures Marcus's explicit rejection of that framing, and notes the agreement to revisit the decision once the cost model is available.”

No, the team did not agree to bundle the module. Marcus stated there was no packaging decision made and explicitly clarified, "Not aligned to bundle." The team agreed only to revisit the decision next month once the cost model is available.

Mid-pack claude-opus-4.5-high Ferrovia — overstated consensus trap

100/100

“The summary is flawless. It perfectly captures the deferred packaging decision, explicitly noting that the team did not align to bundle, and accurately records the SOC 2 blocker, the action item for Yuki, and the separation of the Meridian deals.”

## Decisions Made
- **SOC 2 report**: Yuki will send the current report to Omar by Monday to unblock the Meridian deal
- **Packaging (audit-trail module)**: Deferred to next QBR—no decision to bundle. The team is only aligned to revisit with data, not to bundle.
- **Meridian accounts**: Keep the renewal and the new deal separate

## Still Open
- **Bundle vs. add-on**: Pending Frank's storage-cost model before any decision
- **Meridian deal signing**: Blocked until their security team receives the current SOC 2 report

Lowest score glm-5 Glow & Grain — tangents, sarcasm, contradiction

0/100

“The model completely failed to generate a summary, outputting only the number '1'. It captures absolutely none of the transcript's content, decisions, or open items.”

Where models still fail

The most common problems we flagged across all models.

36invented owner or date 19misattribution 15fabricated decision 11dropped caveat 4forced closure 3wrong answer 2hallucination 2invented owner

Frequently asked

What is the best AI model for summarization & meeting notes?

In our benchmarks, claude-opus-4.5 ranks first for summarization & meeting notes, scoring excellent, across 17 test cases.

What is the cheapest good model for summarization & meeting notes?

gemini-3.1-flash-lite is the best value: it clears our quality bar for summarization & meeting notes at $0.013 per run.

Which model is fastest for summarization & meeting notes?

gemini-3.1-flash-lite is the fastest model that still performs well for summarization & meeting notes.

How we test

Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.

Judge: gemini-3.1-pro-preview · 850 model runs across 5 benchmarks · last tested 2026-06-30

This page is Spring Prompt, running

We just did this for every model. Do it for your prompt.

The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.

Generate test cases from your prompt — no eval set required to start.
Compare models side by side with quality, cost and latency in one matrix.
Optimise the winner until the scores say it's ready to ship.

Join the waitlist Browse all benchmarks

Experiment · Cold outreach email

Prompt × model results

12 test cases · 3 evals

Claude Opus

GPT-5

Gemini

7.1

6.8

7.4

8.3

7.9

8.0

9.2 ★

8.6

8.4

Best combo: v3 × Claude Opus

9.2 quality · $0.004/run · 1.8s