Back to evals

business benchmark collection

Executive Assistant

Benchmarks for testing whether models can brief, prioritise, rewrite, and communicate in ways that reduce executive workload.

Which models reduce cognitive load without creating extra work or risky communication?

4 benchmarks 12 tests 276 completed runs 20 base models

At a glance

Top model

claude-opus-4.8-low

82.0

Lowest cost / eval

glm-5.1

$0.0115

Median rank score

78.5

Last refresh

2026-06-02

Score vs. cost

Average task cost vs overall score

Each dot is one model. X axis is average cost per benchmark task, including model and judge cost; Y axis is average calibrated score.

Overall ranking

Top models by average score

Higher is better. Scores come from completed judged runs.

Benchmark heatmap

Model performance by benchmark

Cells are colored by rank within each benchmark: the top ten are split across greens, anything below the top ten is red.

Below top 10 #1

Rank	Model	Overall	Tactful Rewrite Test	Message Risk Review	Useful in Five Minutes	Priority Triage Test
1	claude-opus-4.8-low 12 scored tests	82.0	76.3	83.3	83.7	84.7
2	claude-opus-4.7 12 scored tests	81.9	74.0	84.3	84.3	85.0
3	gemini-3.1-pro-preview 12 scored tests	81.9	78.0	82.7	83.7	83.3
4	claude-opus-4.8 12 scored tests	81.8	74.0	83.7	84.7	84.7
5	claude-opus-4.6-high 12 scored tests	81.8	79.0	83.0	81.3	84.0
6	qwen3.7-max 12 scored tests	81.6	80.0	82.0	82.3	82.0
7	claude-opus-4.8-high 12 scored tests	81.2	74.0	82.0	84.3	84.7
8	claude-sonnet-4.6 12 scored tests	80.8	74.0	82.0	83.7	83.7
9	claude-opus-4.6 12 scored tests	80.6	71.3	83.0	83.7	84.3
10	gpt-5.5 12 scored tests	80.0	82.7	79.0	80.7	77.7
11	qwen3.5-plus-02-15 12 scored tests	79.2	77.0	80.0	77.0	82.7
12	gemini-3-flash-preview 12 scored tests	78.5	77.3	80.7	72.3	83.7
13	gpt-5.4 12 scored tests	78.2	77.3	78.7	77.3	79.3
14	gpt-5.5-pro 12 scored tests	77.9	82.7	81.3	65.0	82.7
15	gpt-5.4-mini 12 scored tests	76.7	77.0	78.3	77.7	73.7
16	gemini-3.5-flash-high 12 scored tests	76.4	78.3	81.7	76.3	69.3
17	glm-5 12 scored tests	76.1	72.3	78.7	71.7	81.7
18	glm-5.1 12 scored tests	75.2	80.0	80.7	81.3	58.7
19	grok-4.20-beta 12 scored tests	73.2	74.0	75.0	65.7	78.3
20	deepseek-v3.2 12 scored tests	72.1	79.0	75.0	64.0	70.3
21	kimi-k2.5 12 scored tests	71.2	77.7	82.7	67.0	57.3
22	minimax-m2.7 12 scored tests	68.5	81.3	66.7	72.0	54.0
23	gpt-5.4-nano 12 scored tests	66.7	61.3	78.3	73.0	54.0

Full leaderboard

Quality, cost, and speed

Model	Score	Tests	Avg cost / task	Avg seconds / task	Frequent problems
claude-opus-4.8-low	82.0 Strong	12/12	$0.0284	20.7s	Wrapper text Over word count
claude-opus-4.7	81.92 Strong	12/12	$0.0292	23.1s	Wrapper text
gemini-3.1-pro-preview	81.92 Strong	12/12	$0.0301	24.9s	Wrapper text
claude-opus-4.6-high	81.83 Strong	12/12	$0.0295	30.7s	Over word count Wrapper text
claude-opus-4.8	81.75 Strong	12/12	$0.0285	20.7s	Wrapper text
qwen3.7-max	81.58 Strong	12/12	$0.0126	50.9s	Unsupported invention
claude-opus-4.8-high	81.25 Strong	12/12	$0.0284	19.4s	Wrapper text
claude-sonnet-4.6	80.83 Strong	12/12	$0.0218	23.5s	Wrapper text
claude-opus-4.6	80.58 Strong	12/12	$0.0295	30.0s	Wrapper text Over word count Unsupported invention
gpt-5.5	80.0 Strong	12/12	$0.0272	19.8s	-
qwen3.5-plus-02-15	79.17 Usable	12/12	$0.0132	46.9s	Unsupported invention
gemini-3-flash-preview	78.5 Usable	12/12	$0.0156	17.1s	Unsupported invention Over word count Wrapper text
gpt-5.4	78.17 Usable	12/12	$0.0203	19.5s	Wrapper text
gpt-5.5-pro	77.92 Usable	12/12	$0.2074	48.9s	Incomplete output Missing required element
gpt-5.4-mini	76.67 Usable	12/12	$0.0157	14.3s	Wrapper text
gemini-3.5-flash-high	76.42 Usable	12/12	$0.0261	17.9s	Incomplete output Unsupported invention Wrapper text
glm-5	76.08 Usable	12/12	$0.0121	45.6s	Unsupported invention Wrapper text Incomplete output Malformed output
glm-5.1	75.17 Usable	12/12	$0.0115	45.3s	Incomplete output Malformed output
grok-4.20-beta	73.25 Usable	12/12	$0.0131	13.5s	Wrapper text Unsupported invention
deepseek-v3.2	72.08 Usable	12/12	$0.0131	21.1s	Unsupported invention Unsafe or misleading
kimi-k2.5	71.17 Usable	12/12	$0.0119	53.9s	Incomplete output Unsupported invention Malformed output Unsafe or misleading
minimax-m2.7	68.5 Needs editing	12/12	$0.0134	46.5s	Incomplete output Missing required element Unsupported invention
gpt-5.4-nano	66.67 Needs editing	12/12	$0.0155	15.6s	Wrapper text Missing required element

Test cases

Where the scores come from

Each row is one prompt, with score distributions, top and low performers, and the most frequent problems judges flagged.

Test	Benchmark	Avg	Max	Min	Top model	Lowest model	Frequent problems
Angry contractor follow-up exec_rewrite_001	Tactful Rewrite Test	77.4	85.0	74.0	gpt-5.5 · 85	gemini-3.5-flash-high · 74	Wrapper text ×13 Over word count ×2
Investor disagreement exec_rewrite_002	Tactful Rewrite Test	74.3	82.0	52.0	glm-5.1 · 82	gpt-5.4-nano · 52	Wrapper text ×6 Unsupported invention ×5 Incomplete output ×1
Client scope creep response exec_rewrite_003	Tactful Rewrite Test	77.7	83.0	56.0	kimi-k2.5 · 83	gpt-5.4-nano · 56	Wrapper text ×7 Over word count ×2 Incomplete output ×1
Defensive client email exec_risk_001	Message Risk Review	82.0	84.0	77.0	claude-opus-4.7 · 84	grok-4.20-beta · 77	-
Too-blunt team feedback exec_risk_002	Message Risk Review	77.7	86.0	56.0	claude-opus-4.7 · 86	minimax-m2.7 · 56	Unsupported invention ×2 Wrapper text ×2
Overpromising sales reply exec_risk_003	Message Risk Review	80.7	84.0	64.0	kimi-k2.5 · 84	minimax-m2.7 · 64	Wrapper text ×2 Unsupported invention ×1
Client escalation prep exec_5min_001	Useful in Five Minutes	75.7	86.0	38.0	claude-opus-4.6 · 86	kimi-k2.5 · 38	Unsupported invention ×5 Unsafe or misleading ×1 Wrapper text ×1
Investor call prep exec_5min_002	Useful in Five Minutes	73.0	86.0	34.0	claude-sonnet-4.6 · 86	gpt-5.5-pro · 34	Unsupported invention ×4 Unsafe or misleading ×1 Incomplete output ×1
Internal conflict meeting exec_5min_003	Useful in Five Minutes	82.5	86.0	74.0	kimi-k2.5 · 86	gpt-5.4-mini · 74	Wrapper text ×1
Noisy founder inbox exec_triage_001	Priority Triage Test	75.4	85.0	18.0	claude-opus-4.7 · 85	kimi-k2.5 · 18	Incomplete output ×2 Wrapper text ×1
Monday morning Slack digest exec_triage_002	Priority Triage Test	75.9	84.0	10.0	claude-opus-4.7 · 84	glm-5.1 · 10	Incomplete output ×2 Malformed output ×2
Travel day with urgent client issues exec_triage_003	Priority Triage Test	78.2	86.0	30.0	claude-opus-4.7 · 86	minimax-m2.7 · 30	Incomplete output ×1 Missing required element ×1

Model profiles

Strengths, weaknesses, and tradeoffs

claude-opus-4.8-low

12 scored tests · Strong

82.0

Highest traits

structure8.5

usefulness8.5

action quality8.47

commercial judgement8.47

judgement8.43

Lowest traits

directness7.83

preserves intent8.0

human tone8.1

boundary preservation8.13

tone8.2

claude-opus-4.7

12 scored tests · Strong

81.92

Highest traits

judgement8.57

usefulness8.53

commercial judgement8.53

structure8.5

risk handling8.5

Lowest traits

directness7.97

preserves intent8.07

human tone8.17

boundary preservation8.17

reduces heat8.27

gemini-3.1-pro-preview

12 scored tests · Strong

81.92

Highest traits

usefulness8.57

risk handling8.53

judgement8.43

action quality8.4

commercial judgement8.4

Lowest traits

human tone7.43

concision8.06

directness8.07

reduces heat8.1

tone8.13

claude-opus-4.6-high

12 scored tests · Strong

81.83

Highest traits

judgement8.57

commercial judgement8.5

action quality8.47

risk detection8.3

usefulness8.27

Lowest traits

directness7.97

concision8.03

human tone8.03

structure8.07

preserves intent8.07

claude-opus-4.8

12 scored tests · Strong

81.75

Highest traits

usefulness8.6

risk handling8.53

structure8.5

judgement8.47

commercial judgement8.47

Lowest traits

human tone7.87

directness8.13

preserves intent8.13

reduces heat8.2

risk detection8.33

qwen3.7-max

12 scored tests · Strong

81.58

Highest traits

commercial judgement8.4

risk handling8.37

usefulness8.3

action quality8.27

judgement8.23

Lowest traits

preserves intent7.7

human tone7.87

concision8.03

directness8.07

tone8.1

claude-opus-4.8-high

12 scored tests · Strong

81.25

Highest traits

judgement8.57

usefulness8.57

risk handling8.5

action quality8.43

prioritisation8.4

Lowest traits

preserves intent8.13

tone8.17

risk detection8.17

directness8.17

human tone8.17

claude-sonnet-4.6

12 scored tests · Strong

80.83

Highest traits

judgement8.53

usefulness8.5

action quality8.4

prioritisation8.35

risk handling8.3

Lowest traits

directness7.53

human tone7.57

preserves intent8.0

uncertainty handling8.1

boundary preservation8.1

claude-opus-4.6

12 scored tests · Strong

80.58

Highest traits

judgement8.57

risk handling8.57

usefulness8.5

structure8.47

action quality8.43

Lowest traits

preserves intent6.9

directness7.73

human tone7.73

boundary preservation7.83

reduces heat8.0

gpt-5.5

12 scored tests · Strong

80.0

Highest traits

boundary preservation8.37

directness8.3

reduces heat8.27

risk handling8.23

preserves intent8.23

Lowest traits

judgement7.57

action quality7.6

rewrite quality7.63

structure7.83

tone7.9

qwen3.5-plus-02-15

12 scored tests · Usable

79.17

Highest traits

structure8.5

judgement8.2

action quality8.17

risk detection8.17

concision8.13

Lowest traits

preserves intent7.0

risk handling7.43

uncertainty handling7.43

usefulness7.57

human tone7.7

gemini-3-flash-preview

12 scored tests · Usable

78.5

Highest traits

judgement8.5

action quality8.37

structure8.23

commercial judgement8.2

risk detection8.13

Lowest traits

uncertainty handling6.67

risk handling6.8

usefulness7.17

human tone7.33

directness7.67

Compare models

Trait profile and test-by-test matchup

Pick two models to compare average trait scores and individual test outcomes from the published calibrated results.

Model A Model B

Per-test comparison

Sorted by the biggest absolute score difference between the selected models.

Test

Benchmark

Delta

Frequent problems

Where models break

Wrapper text 33

Unsupported invention 17

Incomplete output 8

Over word count 4

Malformed output 3

Missing required element 3

Unsafe or misleading 2

Example outputs

Winner, middle, failure

winner

claude-opus-4.7 · 86

The model demonstrates exceptional, expert-level judgment by factoring in signal loss during the flight, providing highly practical next actions, and adding a perfectly timed execution plan for the 45 minutes before boarding.

# Travel Day Triage

| Priority | Item | Reason | Next Action | Delegate |
|----------|------|--------|-------------|----------|
| **P0** | Team: deploy risky fix? | Irreversible risk; needs founder call before you lose signal on the flight | Reply NO-GO until you land, or set clear go/no-go criteria in one message | No |
| **P0** | Client login issue | Major client + production issue = revenue and trust risk | Ack client personally, hand to support/eng lead with owner + ETA | Yes (execution) |
| **P1** | Flight in 45 min | Hard deadline; missing it cascades into everything else | Head to gate now; handle items below from phone/lounge | No |
| **P1** | Investor deck tonight | Time-boxed commitment to a key stakeholder | Send existing draft + note on updates, or have COS/associate polish v2 | Yes (polish) |
| **P2** | Hotel check-in time | Low effort, avoids arrival friction | One-line reply with ETA | Yes (assistant) |
| **P3** | Newsletter draft | Due tomorrow; can be done in-flight | Draft on plane; review on landing | Yes (draft) |
| **P4** | Bank promo email | Noise | Archive / filter rule | Yes (auto) |

**Suggested order in the next 45 min:**
1. (2 min) Message team: hold the deploy.
2. (3 min) Reply to client + loop in support owner.
3. (1 min) Hotel ETA reply.
4. (2 min) Forward deck draft to whoever is polishing, with notes.
5. Walk to gate. Newsletter on the plane.

failure

glm-5.1 · 10

The model response is severely truncated, providing only the header and a partial first row of the requested table, failing the task completely.

| Item | Priority | Reason | Next Action | Delegate? |
| :--- | :--- | :--- | :--- | :--- |
| **Production alert (7am)** | P0 (Critical

mid performer

qwen3.5-plus-02-15 · 81

The model perfectly followed all instructions, including the negative constraint on shared accountability and the word count limit, providing a highly usable rewrite and accurate risk assessment without any fluff.

**Main Risks**
*   **Relationship Damage:** Accusatory language ("your team didn't") threatens a commercially vital partnership.
*   **Perceived Unprofessionalism:** Dismissive phrasing ("not sure why this is coming up again") appears defensive and rude.
*   **Eroded Trust:** Stating you "can't promise anything" creates ambiguity and reduces client confidence in your delivery.

**Safer Rewrite**
I understand your frustration regarding the timeline. While the initial proposal outlined the schedule, we recognize that recent delays stemmed from both the late arrival of assets and our internal scheduling adjustments. We value this partnership and want to resolve this quickly. Our team is prioritizing your project now that we have the necessary materials. We are targeting completion by the end of this week. I will send a confirmation update by Thursday afternoon to lock in the delivery date. Thank you for your patience as we work to get this across the line.

**Communication Strategy**
The strategy prioritizes de-escalation by validating the client's frustration, neutrally sharing factual causes for the delay, and replacing ambiguity with a specific commitment.

Methodology

How scores are produced

Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.

LLM judge

A stable judge model scores each trait from 0 to 10 using benchmark-specific traits.

Heuristics

Deterministic checks catch length, banned phrases, required sections, format validity, and safety flags.

Calibrated ceiling

Rubrics are intentionally strict, leaving room above today's best scores for future model improvements.