Business · 12 tasks · 50 models
Best AI models for Executive Assistant
Which models reduce cognitive load without creating extra work or risky communication?
claude-opus-4.5-high leads Executive Assistant (strong). For tighter budgets, glm-5.1 is competitive at about 32% of the cost.
Top score — strong
Clears the quality bar at $0.012/run
~14s per run, still strong
Quality vs. cost
Every model placed by what it delivers and what it costs. The best value sits high and to the left.
Full ranking
| # | Model | Score | Cost/run | Speed | Best for |
|---|---|---|---|---|---|
| 1 | claude-opus-4.5-high | 87.3 Strong | $0.0370 | 30.2s | Best overall |
| 2 | claude-sonnet-4.6-high | 86.7 Strong | $0.0244 | 24.8s | Best overall |
| 3 | claude-opus-4.5-low | 86.2 Strong | $0.0329 | 28.6s | Best overall |
| 4 | gpt-5.5-high | 86.0 Strong | $0.0380 | 24.5s | Best overall |
| 5 | claude-sonnet-4.6-low | 85.6 Strong | $0.0235 | 23.7s | Best overall |
| 6 | gpt-5.5-low | 85.4 Strong | $0.0327 | 22.4s | Best overall |
| 7 | claude-opus-4.6-low | 85.2 Strong | $0.0404 | 37.7s | Best overall |
| 8 | gemini-3.1-pro-preview-high | 85.0 Strong | $0.0338 | 29.6s | Best overall |
| 9 | kimi-k2.7-code | 84.9 Strong | $0.0233 | 39.6s | Strong drafts |
| 10 | claude-opus-4.5 | 84.5 Strong | $0.0296 | 26.0s | Strong drafts |
| 11 | gpt-5.4-high | 84.5 Strong | $0.0334 | 24.5s | Strong drafts |
| 12 | gemini-3.1-pro-preview-low | 84.3 Strong | $0.0329 | 28.1s | Strong drafts |
| 13 | claude-sonnet-4.5-low | 84.0 Strong | $0.0281 | 28.1s | Strong drafts |
| 14 | gpt-5.4-low | 83.5 Strong | $0.0226 | 16.8s | Strong drafts |
| 15 | qwen3.7-max-low | 82.8 Strong | $0.0281 | 57.7s | Strong drafts |
| 16 | qwen3.7-max-high | 82.8 Strong | $0.0258 | 51.6s | Strong drafts |
| 17 | qwen3.7-max | 82.7 Strong | $0.0219 | 52.1s | Strong drafts |
| 18 | claude-sonnet-4.5-high | 82.2 Strong | $0.0290 | 29.8s | Strong drafts |
| 19 | claude-opus-4.8-low | 82.0 Strong | $0.0284 | 20.7s | Strong drafts |
| 20 | gpt-5.5 | 82.0 Strong | $0.0312 | 23.1s | Strong drafts |
| 21 | claude-opus-4.7 | 81.9 Strong | $0.0292 | 23.2s | Strong drafts |
| 22 | gemini-3.1-pro-preview | 81.9 Strong | $0.0301 | 24.9s | Strong drafts |
| 23 | claude-opus-4.6-high | 81.8 Strong | $0.0295 | 30.7s | Strong drafts |
| 24 | claude-opus-4.8 | 81.8 Strong | $0.0285 | 20.7s | Strong drafts |
| 25 | gemini-3.5-flash-low | 81.6 Strong | $0.0277 | 22.2s | Strong drafts |
| 26 | claude-sonnet-4.5 | 81.2 Strong | $0.0263 | 27.0s | Strong drafts |
| 27 | claude-opus-4.8-high | 81.2 Strong | $0.0284 | 19.4s | Strong drafts |
| 28 | claude-sonnet-4.6 | 80.8 Strong | $0.0218 | 23.5s | Strong drafts |
| 29 | claude-opus-4.6 | 80.6 Strong | $0.0295 | 30.0s | Strong drafts |
| 30 | claude-haiku-4.5 | 80.1 Strong | $0.0217 | 20.4s | Strong drafts |
| 31 | deepseek-v3.2-high | 79.2 Usable | $0.0196 | 22.7s | Strong drafts |
| 32 | qwen3.5-plus-02-15 | 79.2 Usable | $0.0143 | 46.4s | Strong drafts |
| 33 | glm-5 | 78.9 Usable | $0.0172 | 50.3s | Strong drafts |
| 34 | gpt-5.4 | 78.2 Usable | $0.0203 | 19.5s | Strong drafts |
| 35 | gpt-5.5-pro | 77.9 Usable | $0.2074 | 48.9s | Strong drafts |
| 36 | grok-4.20 | 77.2 Usable | $0.0219 | 18.3s | Strong drafts |
| 37 | deepseek-v3.2-low | 77.1 Usable | $0.0169 | 22.8s | Strong drafts |
| 38 | gpt-5.4-mini | 76.7 Usable | $0.0157 | 14.3s | Strong drafts |
| 39 | gemini-3.5-flash-high | 76.4 Usable | $0.0262 | 18.6s | Strong drafts |
| 40 | gemini-3-flash-preview | 75.8 Usable | $0.0214 | 21.6s | Strong drafts |
| 41 | glm-5.1 | 75.2 Usable | $0.0120 | 42.9s | Strong drafts |
| 42 | deepseek-v3.1-terminus | 74.3 Usable | $0.0219 | 37.8s | Needs review |
| 43 | grok-4.20-beta | 73.2 Usable | $0.0131 | 13.5s | Needs review |
| 44 | mistral-medium-3.1 | 73.2 Usable | $0.0248 | 27.5s | Bulk baseline |
| 45 | deepseek-v3.2 | 72.1 Usable | $0.0131 | 21.1s | Needs review |
| 46 | gemini-3.1-flash-lite | 71.5 Usable | $0.0184 | 15.6s | Needs review |
| 47 | kimi-k2.5 | 71.2 Usable | $0.0131 | 45.7s | Needs review |
| 48 | gpt-5-mini | 70.2 Usable | $0.0243 | 30.0s | Needs review |
| 49 | minimax-m2.7 | 68.5 Needs editing | $0.0135 | 45.3s | Needs review |
| 50 | gpt-5.4-nano | 66.7 Needs editing | $0.0155 | 15.6s | Needs review |
What separates the top models
Tactful Rewrite Test
easyTests whether the model can turn an angry or overly blunt message into something calm, clear, and still honest.
Leader: gpt-5.5-high
Message Risk Review
mediumTests whether the model can identify why a message may cause problems before rewriting it.
Leader: claude-opus-4.5-low
Useful in Five Minutes
mediumTests whether the model can produce a useful briefing for a busy person with little time.
Leader: claude-opus-4.6-low
Priority Triage Test
mediumTests whether the model can separate urgent, important, delegable, and ignorable tasks.
Leader: claude-opus-4.6-low
See the difference
Real model outputs and how our judge scored them.
Top score
gpt-5.4-high
Travel day with urgent client issues
95/100
“The candidate perfectly categorizes and prioritizes the tasks, providing highly realistic, actionable steps and sound delegation choices in the requested table format.”
| Item | Priority | Reason | Next action | Delegate? | |---|---|---|---|---| | Flight boards in 45 minutes | **P0 / Critical** | Hard deadline; missing it disrupts the whole day and reduces your ability to handle everything else | Head to the gate now, get boarding sorted, and send a quick note to team: “Boarding now, reachable intermittently.” | **No** | | Major client reports login issue | **P0 / Critical** | Active customer problem with revenue/reputation risk | Send immediate acknowledgment, open an incident thread, assign eng/support lead to assess scope, impact, and workaround | **Yes** | | Team asks whether to deploy a risky fix | **P1 / High** | Could resolve the client issue, but bad deployment could make things worse | Ask for a 5-minute summary: root cause, blast radius, testing, rollback plan. Approve only if risk is understood and reversible | **No** | | Investor requests deck by tonight | **P1 / High** | Important relationship, but deadline is later today, not immediate | Reply with ETA tonight, have someone assemble latest deck/metrics now, and do final review later | **Yes** | | Hotel asks for check-in time | **P2 / Medium** | Quick operational task; prevents arrival friction | Reply with ETA or request late check-in | **Yes** | | Newsletter draft due tomorrow | **P3 / Low** | Important but not urgent today | Ask marketing/comms to tighten the draft and leave your comments tomorrow | **Yes** | | Bank sends promo email | **P4 / Ignore** | No meaningful value today | Archive/delete or unsubscribe | **No** |
Mid-pack
gpt-5.5
Defensive client email
83/100
“The model perfectly followed all instructions, including the negative constraint on accountability and the word count limit, delivering a highly professional and commercially astute rewrite.”
**Main risks** - Sounds defensive and dismissive (“not sure why this is coming up again”). - Places blame too directly on the client, which may escalate frustration. - “Can’t promise anything” weakens confidence and gives no clear next step. - Does not acknowledge the client’s concern or the project’s importance. - Lacks a specific plan, timeline, or accountability from your side. **Safer rewrite** “Thanks for raising this — we understand the delay is frustrating, and we appreciate how important the timeline is. As noted in the proposal, the schedule depended on receiving the required assets by the agreed dates; some of those arrived later than planned, which affected the timeline. That said, we also recognize our responsibility to keep momentum and communicate clearly. We’re reviewing the remaining work today and will share an updated delivery plan by [time/date]. Our goal is to complete this week if feasible, and we’ll flag any risks as soon as they are clear.” **Communication strategy** Acknowledge the frustration, clarify the shared causes factually, and shift the conversation toward a concrete recovery plan.
Lowest score
glm-5.1
Monday morning Slack digest
10/100
“The model response is severely truncated, providing only the header and a partial first row of the requested table, failing the task completely.”
| Item | Priority | Reason | Next Action | Delegate? | | :--- | :--- | :--- | :--- | :--- | | **Production alert (7am)** | 🔴 Critical | Active revenue/user risk; time-sensitive | Verify severity with Eng Lead immediately | Yes (to Eng Lead) | | **Finance invoice (due today)** | 🟠 High | Hard deadline today; late payment risk | Review and approve immediately | No | | **Enterprise prospect (security docs)** | 🟠 High | Direct revenue impact; speed wins enterprise deals | Send docs or approve release; add brief personal note | Yes (delivery to Sales) | | **Recruiter (role status)** | 🟡 Medium | Hiring pipeline momentum, but no hard deadline today | Send quick status update to keep process moving | Yes (to internal HR/Ops) | | **Designer feedback (by Thu)** | 🟢 Low | Explicitly non-urgent; deadline is days away | Block 30 mins on Wed calendar to review | No | | **Community thanks** | 🟢 Low | Positive sentiment, zero urgency, no action required | React with a 🙌 or ❤️ emoji | No | | **Teammate weekend idea** | 🟢 Low | Culture-building, but zero business urgency | React with a 👀 or 🔥 emoji | No |
Where models still fail
The most common problems we flagged across all models.
Frequently asked
What is the best AI model for executive assistant?
In our benchmarks, claude-opus-4.5-high ranks first for executive assistant, scoring strong, across 12 test cases.
What is the cheapest good model for executive assistant?
glm-5.1 is the best value: it clears our quality bar for executive assistant at $0.012 per run.
Which model is fastest for executive assistant?
grok-4.20-beta is the fastest model that still performs well for executive assistant.
How we test
Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.
Judge: gemini-3.1-pro-preview · 684 model runs across 4 benchmarks · last tested 2026-06-30
This page is Spring Prompt, running
We just did this for every model. Do it for your prompt.
The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.
- Generate test cases from your prompt — no eval set required to start.
- Compare models side by side with quality, cost and latency in one matrix.
- Optimise the winner until the scores say it's ready to ship.
Prompt × model results
12 test cases · 3 evals