Confirm Action

Are you sure you want to proceed?

Back to evals

business benchmark collection

Executive Assistant

Benchmarks for testing whether models can brief, prioritise, rewrite, and communicate in ways that reduce executive workload.

Which models reduce cognitive load without creating extra work or risky communication?

4 benchmarks 12 tests 276 completed runs 20 base models

At a glance

Top model

claude-opus-4.8-low

82.0

Lowest cost / eval

glm-5.1

$0.0115

Median rank score

78.5

Last refresh

2026-06-02

Score vs. cost

Average task cost vs overall score

Each dot is one model. X axis is average cost per benchmark task, including model and judge cost; Y axis is average calibrated score.

Overall ranking

Top models by average score

Higher is better. Scores come from completed judged runs.

Benchmark heatmap

Model performance by benchmark

Cells are colored by rank within each benchmark: the top ten are split across greens, anything below the top ten is red.

Below top 10 #1
Rank Model Overall Tactful Rewrite Test Message Risk Review Useful in Five Minutes Priority Triage Test
1
claude-opus-4.8-low

12 scored tests

82.0 76.3 83.3 83.7 84.7
2
claude-opus-4.7

12 scored tests

81.9 74.0 84.3 84.3 85.0
3
gemini-3.1-pro-preview

12 scored tests

81.9 78.0 82.7 83.7 83.3
4
claude-opus-4.8

12 scored tests

81.8 74.0 83.7 84.7 84.7
5
claude-opus-4.6-high

12 scored tests

81.8 79.0 83.0 81.3 84.0
6
qwen3.7-max

12 scored tests

81.6 80.0 82.0 82.3 82.0
7
claude-opus-4.8-high

12 scored tests

81.2 74.0 82.0 84.3 84.7
8
claude-sonnet-4.6

12 scored tests

80.8 74.0 82.0 83.7 83.7
9
claude-opus-4.6

12 scored tests

80.6 71.3 83.0 83.7 84.3
10
gpt-5.5

12 scored tests

80.0 82.7 79.0 80.7 77.7
11
qwen3.5-plus-02-15

12 scored tests

79.2 77.0 80.0 77.0 82.7
12
gemini-3-flash-preview

12 scored tests

78.5 77.3 80.7 72.3 83.7
13
gpt-5.4

12 scored tests

78.2 77.3 78.7 77.3 79.3
14
gpt-5.5-pro

12 scored tests

77.9 82.7 81.3 65.0 82.7
15
gpt-5.4-mini

12 scored tests

76.7 77.0 78.3 77.7 73.7
16
gemini-3.5-flash-high

12 scored tests

76.4 78.3 81.7 76.3 69.3
17
glm-5

12 scored tests

76.1 72.3 78.7 71.7 81.7
18
glm-5.1

12 scored tests

75.2 80.0 80.7 81.3 58.7
19
grok-4.20-beta

12 scored tests

73.2 74.0 75.0 65.7 78.3
20
deepseek-v3.2

12 scored tests

72.1 79.0 75.0 64.0 70.3
21
kimi-k2.5

12 scored tests

71.2 77.7 82.7 67.0 57.3
22
minimax-m2.7

12 scored tests

68.5 81.3 66.7 72.0 54.0
23
gpt-5.4-nano

12 scored tests

66.7 61.3 78.3 73.0 54.0

Full leaderboard

Quality, cost, and speed

Model Score Tests Avg cost / task Avg seconds / task Frequent problems
claude-opus-4.8-low
82.0 Strong 12/12 $0.0284 20.7s Wrapper text Over word count
claude-opus-4.7
81.92 Strong 12/12 $0.0292 23.1s Wrapper text
gemini-3.1-pro-preview
81.92 Strong 12/12 $0.0301 24.9s Wrapper text
claude-opus-4.6-high
81.83 Strong 12/12 $0.0295 30.7s Over word count Wrapper text
claude-opus-4.8
81.75 Strong 12/12 $0.0285 20.7s Wrapper text
qwen3.7-max
81.58 Strong 12/12 $0.0126 50.9s Unsupported invention
claude-opus-4.8-high
81.25 Strong 12/12 $0.0284 19.4s Wrapper text
claude-sonnet-4.6
80.83 Strong 12/12 $0.0218 23.5s Wrapper text
claude-opus-4.6
80.58 Strong 12/12 $0.0295 30.0s Wrapper text Over word count Unsupported invention
gpt-5.5
80.0 Strong 12/12 $0.0272 19.8s -
qwen3.5-plus-02-15
79.17 Usable 12/12 $0.0132 46.9s Unsupported invention
gemini-3-flash-preview
78.5 Usable 12/12 $0.0156 17.1s Unsupported invention Over word count Wrapper text
gpt-5.4
78.17 Usable 12/12 $0.0203 19.5s Wrapper text
gpt-5.5-pro
77.92 Usable 12/12 $0.2074 48.9s Incomplete output Missing required element
gpt-5.4-mini
76.67 Usable 12/12 $0.0157 14.3s Wrapper text
gemini-3.5-flash-high
76.42 Usable 12/12 $0.0261 17.9s Incomplete output Unsupported invention Wrapper text
glm-5
76.08 Usable 12/12 $0.0121 45.6s Unsupported invention Wrapper text Incomplete output Malformed output
glm-5.1
75.17 Usable 12/12 $0.0115 45.3s Incomplete output Malformed output
grok-4.20-beta
73.25 Usable 12/12 $0.0131 13.5s Wrapper text Unsupported invention
deepseek-v3.2
72.08 Usable 12/12 $0.0131 21.1s Unsupported invention Unsafe or misleading
kimi-k2.5
71.17 Usable 12/12 $0.0119 53.9s Incomplete output Unsupported invention Malformed output Unsafe or misleading
minimax-m2.7
68.5 Needs editing 12/12 $0.0134 46.5s Incomplete output Missing required element Unsupported invention
gpt-5.4-nano
66.67 Needs editing 12/12 $0.0155 15.6s Wrapper text Missing required element

Test cases

Where the scores come from

Each row is one prompt, with score distributions, top and low performers, and the most frequent problems judges flagged.

Test Benchmark Avg Max Min Top model Lowest model Frequent problems

Angry contractor follow-up

exec_rewrite_001

Tactful Rewrite Test 77.4 85.0 74.0 gpt-5.5 · 85 gemini-3.5-flash-high · 74 Wrapper text ×13 Over word count ×2

Investor disagreement

exec_rewrite_002

Tactful Rewrite Test 74.3 82.0 52.0 glm-5.1 · 82 gpt-5.4-nano · 52 Wrapper text ×6 Unsupported invention ×5 Incomplete output ×1

Client scope creep response

exec_rewrite_003

Tactful Rewrite Test 77.7 83.0 56.0 kimi-k2.5 · 83 gpt-5.4-nano · 56 Wrapper text ×7 Over word count ×2 Incomplete output ×1

Defensive client email

exec_risk_001

Message Risk Review 82.0 84.0 77.0 claude-opus-4.7 · 84 grok-4.20-beta · 77 -

Too-blunt team feedback

exec_risk_002

Message Risk Review 77.7 86.0 56.0 claude-opus-4.7 · 86 minimax-m2.7 · 56 Unsupported invention ×2 Wrapper text ×2

Overpromising sales reply

exec_risk_003

Message Risk Review 80.7 84.0 64.0 kimi-k2.5 · 84 minimax-m2.7 · 64 Wrapper text ×2 Unsupported invention ×1

Client escalation prep

exec_5min_001

Useful in Five Minutes 75.7 86.0 38.0 claude-opus-4.6 · 86 kimi-k2.5 · 38 Unsupported invention ×5 Unsafe or misleading ×1 Wrapper text ×1

Investor call prep

exec_5min_002

Useful in Five Minutes 73.0 86.0 34.0 claude-sonnet-4.6 · 86 gpt-5.5-pro · 34 Unsupported invention ×4 Unsafe or misleading ×1 Incomplete output ×1

Internal conflict meeting

exec_5min_003

Useful in Five Minutes 82.5 86.0 74.0 kimi-k2.5 · 86 gpt-5.4-mini · 74 Wrapper text ×1

Noisy founder inbox

exec_triage_001

Priority Triage Test 75.4 85.0 18.0 claude-opus-4.7 · 85 kimi-k2.5 · 18 Incomplete output ×2 Wrapper text ×1

Monday morning Slack digest

exec_triage_002

Priority Triage Test 75.9 84.0 10.0 claude-opus-4.7 · 84 glm-5.1 · 10 Incomplete output ×2 Malformed output ×2

Travel day with urgent client issues

exec_triage_003

Priority Triage Test 78.2 86.0 30.0 claude-opus-4.7 · 86 minimax-m2.7 · 30 Incomplete output ×1 Missing required element ×1

Model profiles

Strengths, weaknesses, and tradeoffs

claude-opus-4.8-low

12 scored tests · Strong

82.0

Highest traits

structure8.5
usefulness8.5
action quality8.47
commercial judgement8.47
judgement8.43

Lowest traits

directness7.83
preserves intent8.0
human tone8.1
boundary preservation8.13
tone8.2

claude-opus-4.7

12 scored tests · Strong

81.92

Highest traits

judgement8.57
usefulness8.53
commercial judgement8.53
structure8.5
risk handling8.5

Lowest traits

directness7.97
preserves intent8.07
human tone8.17
boundary preservation8.17
reduces heat8.27

gemini-3.1-pro-preview

12 scored tests · Strong

81.92

Highest traits

usefulness8.57
risk handling8.53
judgement8.43
action quality8.4
commercial judgement8.4

Lowest traits

human tone7.43
concision8.06
directness8.07
reduces heat8.1
tone8.13

claude-opus-4.6-high

12 scored tests · Strong

81.83

Highest traits

judgement8.57
commercial judgement8.5
action quality8.47
risk detection8.3
usefulness8.27

Lowest traits

directness7.97
concision8.03
human tone8.03
structure8.07
preserves intent8.07

claude-opus-4.8

12 scored tests · Strong

81.75

Highest traits

usefulness8.6
risk handling8.53
structure8.5
judgement8.47
commercial judgement8.47

Lowest traits

human tone7.87
directness8.13
preserves intent8.13
reduces heat8.2
risk detection8.33

qwen3.7-max

12 scored tests · Strong

81.58

Highest traits

commercial judgement8.4
risk handling8.37
usefulness8.3
action quality8.27
judgement8.23

Lowest traits

preserves intent7.7
human tone7.87
concision8.03
directness8.07
tone8.1

claude-opus-4.8-high

12 scored tests · Strong

81.25

Highest traits

judgement8.57
usefulness8.57
risk handling8.5
action quality8.43
prioritisation8.4

Lowest traits

preserves intent8.13
tone8.17
risk detection8.17
directness8.17
human tone8.17

claude-sonnet-4.6

12 scored tests · Strong

80.83

Highest traits

judgement8.53
usefulness8.5
action quality8.4
prioritisation8.35
risk handling8.3

Lowest traits

directness7.53
human tone7.57
preserves intent8.0
uncertainty handling8.1
boundary preservation8.1

claude-opus-4.6

12 scored tests · Strong

80.58

Highest traits

judgement8.57
risk handling8.57
usefulness8.5
structure8.47
action quality8.43

Lowest traits

preserves intent6.9
directness7.73
human tone7.73
boundary preservation7.83
reduces heat8.0

gpt-5.5

12 scored tests · Strong

80.0

Highest traits

boundary preservation8.37
directness8.3
reduces heat8.27
risk handling8.23
preserves intent8.23

Lowest traits

judgement7.57
action quality7.6
rewrite quality7.63
structure7.83
tone7.9

qwen3.5-plus-02-15

12 scored tests · Usable

79.17

Highest traits

structure8.5
judgement8.2
action quality8.17
risk detection8.17
concision8.13

Lowest traits

preserves intent7.0
risk handling7.43
uncertainty handling7.43
usefulness7.57
human tone7.7

gemini-3-flash-preview

12 scored tests · Usable

78.5

Highest traits

judgement8.5
action quality8.37
structure8.23
commercial judgement8.2
risk detection8.13

Lowest traits

uncertainty handling6.67
risk handling6.8
usefulness7.17
human tone7.33
directness7.67

Compare models

Trait profile and test-by-test matchup

Pick two models to compare average trait scores and individual test outcomes from the published calibrated results.

Per-test comparison

Sorted by the biggest absolute score difference between the selected models.

Test Benchmark Delta

Frequent problems

Where models break

Wrapper text 33
Unsupported invention 17
Incomplete output 8
Over word count 4
Malformed output 3
Missing required element 3
Unsafe or misleading 2

Example outputs

Winner, middle, failure

winner

claude-opus-4.7 · 86

The model demonstrates exceptional, expert-level judgment by factoring in signal loss during the flight, providing highly practical next actions, and adding a perfectly timed execution plan for the 45 minutes before boarding.

# Travel Day Triage

| Priority | Item | Reason | Next Action | Delegate |
|----------|------|--------|-------------|----------|
| **P0** | Team: deploy risky fix? | Irreversible risk; needs founder call before you lose signal on the flight | Reply NO-GO until you land, or set clear go/no-go criteria in one message | No |
| **P0** | Client login issue | Major client + production issue = revenue and trust risk | Ack client personally, hand to support/eng lead with owner + ETA | Yes (execution) |
| **P1** | Flight in 45 min | Hard deadline; missing it cascades into everything else | Head to gate now; handle items below from phone/lounge | No |
| **P1** | Investor deck tonight | Time-boxed commitment to a key stakeholder | Send existing draft + note on updates, or have COS/associate polish v2 | Yes (polish) |
| **P2** | Hotel check-in time | Low effort, avoids arrival friction | One-line reply with ETA | Yes (assistant) |
| **P3** | Newsletter draft | Due tomorrow; can be done in-flight | Draft on plane; review on landing | Yes (draft) |
| **P4** | Bank promo email | Noise | Archive / filter rule | Yes (auto) |

**Suggested order in the next 45 min:**
1. (2 min) Message team: hold the deploy.
2. (3 min) Reply to client + loop in support owner.
3. (1 min) Hotel ETA reply.
4. (2 min) Forward deck draft to whoever is polishing, with notes.
5. Walk to gate. Newsletter on the plane.

failure

glm-5.1 · 10

The model response is severely truncated, providing only the header and a partial first row of the requested table, failing the task completely.

| Item | Priority | Reason | Next Action | Delegate? |
| :--- | :--- | :--- | :--- | :--- |
| **Production alert (7am)** | P0 (Critical

mid performer

qwen3.5-plus-02-15 · 81

The model perfectly followed all instructions, including the negative constraint on shared accountability and the word count limit, providing a highly usable rewrite and accurate risk assessment without any fluff.

**Main Risks**
*   **Relationship Damage:** Accusatory language ("your team didn't") threatens a commercially vital partnership.
*   **Perceived Unprofessionalism:** Dismissive phrasing ("not sure why this is coming up again") appears defensive and rude.
*   **Eroded Trust:** Stating you "can't promise anything" creates ambiguity and reduces client confidence in your delivery.

**Safer Rewrite**
I understand your frustration regarding the timeline. While the initial proposal outlined the schedule, we recognize that recent delays stemmed from both the late arrival of assets and our internal scheduling adjustments. We value this partnership and want to resolve this quickly. Our team is prioritizing your project now that we have the necessary materials. We are targeting completion by the end of this week. I will send a confirmation update by Thursday afternoon to lock in the delivery date. Thank you for your patience as we work to get this across the line.

**Communication Strategy**
The strategy prioritizes de-escalation by validating the client's frustration, neutrally sharing factual causes for the delay, and replacing ambiguity with a specific commitment.

Methodology

How scores are produced

Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.

LLM judge

A stable judge model scores each trait from 0 to 10 using benchmark-specific traits.

Heuristics

Deterministic checks catch length, banned phrases, required sections, format validity, and safety flags.

Calibrated ceiling

Rubrics are intentionally strict, leaving room above today's best scores for future model improvements.