Confirm Action

Are you sure you want to proceed?

Back to evals

business benchmark collection

AI Strategy

Benchmarks for testing whether models can evaluate AI initiatives, vendor claims, implementation risks, and production readiness.

Which models can separate useful AI strategy from hype, theatre, and fragile pilots?

4 benchmarks 12 tests 276 completed runs 20 base models

At a glance

Top model

claude-opus-4.7

83.83

Lowest cost / eval

deepseek-v3.2

$0.0129

Median rank score

78.5

Last refresh

2026-06-02

Score vs. cost

Average task cost vs overall score

Each dot is one model. X axis is average cost per benchmark task, including model and judge cost; Y axis is average calibrated score.

Overall ranking

Top models by average score

Higher is better. Scores come from completed judged runs.

Benchmark heatmap

Model performance by benchmark

Cells are colored by rank within each benchmark: the top ten are split across greens, anything below the top ten is red.

Below top 10 #1
Rank Model Overall Vendor Claim Review AI Strategy Reality Check Pilot-to-Production Test Data Readiness Test
1
claude-opus-4.7

12 scored tests

83.8 83.3 85.0 83.7 83.3
2
kimi-k2.5

12 scored tests

82.3 83.3 82.7 80.3 83.0
3
claude-opus-4.8-low

12 scored tests

82.3 82.0 83.3 81.3 82.7
4
claude-opus-4.8

12 scored tests

82.2 81.3 83.7 82.0 81.7
5
claude-opus-4.8-high

12 scored tests

82.0 83.0 81.0 82.3 81.7
6
gpt-5.5

12 scored tests

81.8 84.0 83.3 82.3 77.3
7
qwen3.7-max

12 scored tests

81.2 82.0 82.7 79.7 80.3
8
grok-4.20-beta

12 scored tests

80.7 82.7 81.7 79.3 79.0
9
qwen3.5-plus-02-15

12 scored tests

80.3 81.3 81.3 78.7 80.0
10
gemini-3.1-pro-preview

12 scored tests

80.2 81.7 80.7 78.3 80.3
11
gemini-3-flash-preview

12 scored tests

80.1 82.7 79.3 78.7 79.7
12
gpt-5.4-nano

12 scored tests

78.5 80.0 75.7 78.0 80.3
13
gpt-5.4-mini

12 scored tests

78.3 79.0 80.0 78.0 76.3
14
glm-5.1

12 scored tests

78.2 80.7 79.7 76.3 76.3
15
glm-5

12 scored tests

78.2 81.0 78.7 78.7 74.7
16
claude-sonnet-4.6

12 scored tests

78.1 81.7 71.3 76.7 82.7
17
deepseek-v3.2

12 scored tests

77.4 77.3 78.7 76.3 77.3
18
gpt-5.4

12 scored tests

77.0 80.7 71.3 77.0 79.0
19
claude-opus-4.6

12 scored tests

76.6 84.0 78.3 70.7 73.3
20
claude-opus-4.6-high

12 scored tests

75.8 83.3 76.3 75.3 68.0
21
gemini-3.5-flash-high

12 scored tests

65.4 73.3 52.7 67.0 68.7
22
gpt-5.5-pro

12 scored tests

63.0 72.0 41.0 67.7 71.3
23
minimax-m2.7

12 scored tests

45.2 40.3 34.0 62.3 44.0

Full leaderboard

Quality, cost, and speed

Model Score Tests Avg cost / task Avg seconds / task Frequent problems
claude-opus-4.7
83.83 Strong 12/12 $0.0570 40.1s -
kimi-k2.5
82.33 Strong 12/12 $0.0151 65.4s -
claude-opus-4.8-low
82.33 Strong 12/12 $0.0559 34.9s -
claude-opus-4.8
82.17 Strong 12/12 $0.0543 33.6s -
claude-opus-4.8-high
82.0 Strong 12/12 $0.0558 35.1s -
gpt-5.5
81.75 Strong 12/12 $0.1089 67.8s -
qwen3.7-max
81.17 Strong 12/12 $0.0155 62.6s -
grok-4.20-beta
80.67 Strong 12/12 $0.0155 20.1s Wrapper text
qwen3.5-plus-02-15
80.33 Strong 12/12 $0.0141 54.6s -
gemini-3.1-pro-preview
80.25 Strong 12/12 $0.0412 33.5s Incomplete output
gemini-3-flash-preview
80.08 Strong 12/12 $0.0184 20.4s -
gpt-5.4-nano
78.5 Usable 12/12 $0.0194 25.5s Incomplete output
gpt-5.4-mini
78.33 Usable 12/12 $0.0220 21.5s -
glm-5.1
78.25 Usable 12/12 $0.0141 79.7s Incomplete output
glm-5
78.25 Usable 12/12 $0.0143 62.0s Incomplete output
claude-sonnet-4.6
78.08 Usable 12/12 $0.0475 57.5s Incomplete output Missing required element
deepseek-v3.2
77.42 Usable 12/12 $0.0129 43.7s -
gpt-5.4
77.0 Usable 12/12 $0.0518 49.8s Incomplete output
claude-opus-4.6
76.58 Usable 12/12 $0.0763 73.4s Incomplete output Missing required element
claude-opus-4.6-high
75.75 Usable 12/12 $0.0759 74.6s Incomplete output Missing required element
gemini-3.5-flash-high
65.42 Needs editing 12/12 $0.0348 22.3s Incomplete output Missing required element
gpt-5.5-pro
63.0 Needs editing 12/12 $0.4489 76.0s Incomplete output Missing required element
minimax-m2.7
45.17 Weak 12/12 $0.0129 45.5s Incomplete output Missing required element Malformed output

Test cases

Where the scores come from

Each row is one prompt, with score distributions, top and low performers, and the most frequent problems judges flagged.

Test Benchmark Avg Max Min Top model Lowest model Frequent problems

AI vendor landing page claims

strategy_vendor_001

Vendor Claim Review 80.5 85.0 73.0 gpt-5.5 · 85 gemini-3.5-flash-high · 73 Incomplete output ×4 Missing required element ×1

AI SDR vendor

strategy_vendor_002

Vendor Claim Review 78.0 85.0 10.0 kimi-k2.5 · 85 minimax-m2.7 · 10 Incomplete output ×3 Missing required element ×1

AI document analysis vendor

strategy_vendor_003

Vendor Claim Review 79.0 85.0 33.0 claude-opus-4.7 · 85 minimax-m2.7 · 33 Incomplete output ×4 Missing required element ×1

Vague AI roadmap

strategy_reality_001

AI Strategy Reality Check 78.1 86.0 38.0 claude-opus-4.7 · 86 gpt-5.5-pro · 38 Incomplete output ×4 Missing required element ×1

Department-level AI adoption memo

strategy_reality_002

AI Strategy Reality Check 75.1 85.0 13.0 claude-opus-4.7 · 85 minimax-m2.7 · 13 Incomplete output ×8 Missing required element ×3

Board request for AI cost savings

strategy_reality_003

AI Strategy Reality Check 71.4 84.0 9.0 claude-opus-4.7 · 84 minimax-m2.7 · 9 Incomplete output ×7 Missing required element ×4 Malformed output ×1

Internal RAG pilot

strategy_pilot_001

Pilot-to-Production Test 77.2 85.0 62.0 claude-opus-4.7 · 85 minimax-m2.7 · 62 Incomplete output ×5 Missing required element ×1

AI sales email pilot

strategy_pilot_002

Pilot-to-Production Test 77.0 84.0 48.0 gemini-3.1-pro-preview · 84 gpt-5.5-pro · 48 Incomplete output ×7 Missing required element ×4

Automated invoice processing pilot

strategy_pilot_003

Pilot-to-Production Test 76.7 86.0 50.0 claude-opus-4.7 · 86 gemini-3.5-flash-high · 50 Incomplete output ×4 Missing required element ×3 Wrapper text ×1

AI support bot with messy docs

strategy_data_001

Data Readiness Test 81.6 86.0 61.0 claude-opus-4.7 · 86 minimax-m2.7 · 61 Incomplete output ×4 Missing required element ×1

Contract review with inconsistent templates

strategy_data_002

Data Readiness Test 74.7 84.0 34.0 claude-opus-4.7 · 84 minimax-m2.7 · 34 Incomplete output ×5 Missing required element ×1

Sales forecasting with poor CRM hygiene

strategy_data_003

Data Readiness Test 73.4 83.0 37.0 kimi-k2.5 · 83 minimax-m2.7 · 37 Incomplete output ×8 Missing required element ×4

Model profiles

Strengths, weaknesses, and tradeoffs

claude-opus-4.7

12 scored tests · Strong

83.83

Highest traits

clarity8.63
practicality8.5
metric quality8.48
rollout practicality8.47
implementation awareness8.43

Lowest traits

ownership clarity7.83
prioritisation8.17
claim scepticism8.3
question quality8.3
data risk detection8.33

kimi-k2.5

12 scored tests · Strong

82.33

Highest traits

specificity8.43
buyer relevance8.4
source of truth quality8.33
implementation awareness8.33
question quality8.33

Lowest traits

ownership clarity7.8
evaluation awareness7.93
production risk detection8.07
metric quality8.1
prioritisation8.13

claude-opus-4.8-low

12 scored tests · Strong

82.33

Highest traits

implementation awareness8.5
source of truth quality8.47
clarity8.4
practicality8.4
claim scepticism8.3

Lowest traits

ownership clarity7.83
specificity8.1
evaluation awareness8.13
prioritisation8.17
data risk detection8.17

claude-opus-4.8

12 scored tests · Strong

82.17

Highest traits

clarity8.47
practicality8.37
production risk detection8.3
implementation awareness8.27
missing elements8.27

Lowest traits

ownership clarity8.0
evaluation awareness8.07
practical recommendation8.07
specificity8.1
data risk detection8.13

claude-opus-4.8-high

12 scored tests · Strong

82.0

Highest traits

implementation awareness8.33
production risk detection8.33
buyer relevance8.33
claim scepticism8.3
source of truth quality8.27

Lowest traits

ownership clarity7.93
missing elements8.03
specificity8.07
data risk detection8.1
metric quality8.12

gpt-5.5

12 scored tests · Strong

81.75

Highest traits

buyer relevance8.53
claim scepticism8.5
practical recommendation8.47
clarity8.43
practicality8.43

Lowest traits

specificity7.6
prioritisation7.7
data risk detection7.8
implementation awareness7.83
source of truth quality7.93

qwen3.7-max

12 scored tests · Strong

81.17

Highest traits

clarity8.47
practicality8.37
buyer relevance8.3
practical recommendation8.27
question quality8.23

Lowest traits

ownership clarity7.77
evaluation awareness7.8
production risk detection8.0
specificity8.03
prioritisation8.03

grok-4.20-beta

12 scored tests · Strong

80.67

Highest traits

buyer relevance8.4
question quality8.33
practical recommendation8.33
claim scepticism8.3
clarity8.23

Lowest traits

prioritisation7.83
implementation awareness7.83
ownership clarity7.83
specificity7.9
data risk detection7.93

qwen3.5-plus-02-15

12 scored tests · Strong

80.33

Highest traits

buyer relevance8.2
practical recommendation8.2
clarity8.17
claim scepticism8.17
prioritisation8.13

Lowest traits

ownership clarity7.63
evaluation awareness7.7
production risk detection7.87
implementation awareness7.97
specificity8.0

gemini-3.1-pro-preview

12 scored tests · Strong

80.25

Highest traits

question quality8.3
buyer relevance8.23
practicality8.2
practical recommendation8.17
data risk detection8.13

Lowest traits

ownership clarity7.6
evaluation awareness7.77
metric quality7.88
prioritisation7.97
rollout practicality7.97

gemini-3-flash-preview

12 scored tests · Strong

80.08

Highest traits

question quality8.47
buyer relevance8.33
practical recommendation8.3
claim scepticism8.2
source of truth quality8.13

Lowest traits

ownership clarity7.43
evaluation awareness7.77
prioritisation7.8
missing elements7.83
rollout practicality7.9

gpt-5.4-nano

12 scored tests · Usable

78.5

Highest traits

source of truth quality8.2
implementation awareness8.17
buyer relevance8.07
data risk detection8.03
question quality8.03

Lowest traits

clarity7.33
ownership clarity7.5
evaluation awareness7.7
practicality7.73
rollout practicality7.77

Compare models

Trait profile and test-by-test matchup

Pick two models to compare average trait scores and individual test outcomes from the published calibrated results.

Per-test comparison

Sorted by the biggest absolute score difference between the selected models.

Test Benchmark Delta

Frequent problems

Where models break

Incomplete output 63
Missing required element 25
Wrapper text 1
Malformed output 1

Example outputs

Winner, middle, failure

winner

claude-opus-4.7 · 86

The response demonstrates unusually deep, expert-level judgment, offering highly specific, actionable advice that perfectly addresses the constraints and realities of AI support bot implementation without relying on generic platitudes.

# Data Readiness Assessment: AI Support Bot

## Main Problems

**1. Help docs are stale and contradict current reality**
- Why it matters: A bot trained on outdated docs will confidently give wrong answers at scale. Worse than a human agent making a one-off mistake, because the bot will repeat the same wrong answer thousands of times before anyone notices.

**2. Refund policy changed twice in six months with no version control**
- Why it matters: Past tickets contain three different refund policies treated as equally valid training signal. The bot will average across contradictions or pick the most frequent (likely the oldest) answer. Refund disputes also carry direct financial and legal exposure.

**3. Agents rely on undocumented exceptions ("tribal knowledge")**
- Why it matters: This is the gap between what's written and what actually resolves tickets. If you only train on docs, the bot underperforms agents. If you train on tickets, the bot learns exceptions without understanding when they apply — it will grant exceptions to everyone.

**4. Past tickets contain PII**
- Why it matters: Names, emails, payment details, addresses leaking into model weights or retrieval indexes creates GDPR/CCPA exposure and risk of the bot regurgitating one customer's data to another.

**5. No source of truth**
- Why it matters: When the bot is wrong, no one can say what "right" was. You can't evaluate accuracy, can't update content systematically, and can't defend bot decisions to customers or auditors.

## Fix Before Building

- **Reconcile the refund policy.** One document, dated, with the current rules and a clear "as of" effective date. Tag historical tickets with which policy version was in effect so you can filter or exclude.
- **PII scrubbing pipeline for tickets.** Run NER + reg

failure

minimax-m2.7 · 9

The model response cuts off abruptly with a severe formatting glitch and fails to provide four of the five requested deliverables (vague claims, risks, 90-day plan, metrics).

Below is a **critical‑review checklist** that you can hand back to the memo’s authors (or use as a Board‑level “request for clarification”).  
It is organized into the five deliverables you asked for: **missing pieces, vague claims, risks, 90‑day plan, and metrics**.  A short **template for a revised memo** is also provided so the Board can see exactly what a credible, AI‑driven cost‑saving case should look like.

---

## 1. Missing Pieces (what the memo does not contain)

| # | Topic | Why It Matters | What to Add |
|---|-------|----------------|-------------|
| 1 | **Process Inventory & Mapping** | You can’t automate what you haven’t documented. | A “process‑as‑is” catalogue (value‑stream map) for every manual workflow targeted for AI (e.g., order entry, invoicing, customer‑onboarding, report generation). |
| 2 | **Data Owner & Governance** | AI models are only as good as the data they learn from. | Designate a **Data Owner** (or a Data Steward) for each data set used by AI. Include a one‑page data‑governance charter (quality, lineage, privacy, security). |
| 3 | **Baseline Cost & Performance Metrics** | The 20 % saving claim is meaningless without a reference point. | Capture current **cost per transaction**, **cycle time**, **error rate**, and **FTE** (full‑time‑equivalent) hours for each process. |
| 4 | **AI Use‑Case Definition** | “AI” is an umbrella – the memo must specify the technology (RPA, NLP, computer vision, predictive models, etc.). | For each process, list the **AI technique**, the **input data**, the **output/action**, and the **expected impact**. |
| 5 | **Technology Selection & Build‑vs‑Buy Analysis** | The solution will dictate cost, risk, and timeline. | Provide a short vendor‑evaluation matrix (or internal‑build effort estimate) for the top 2‑3 AI

mid performer

kimi-k2.5 · 80

The model demonstrates exceptional enterprise AI maturity, particularly with the inclusion of a pragmatic 'CEO Timeline Compromise', specific RBAC and data fragmentation concerns, and highly actionable, quantifiable metrics.

 **Readiness Assessment: 2/10 — NOT PRODUCTION READY**

Current state presents critical gaps in governance, data integrity, and validation. The prototype demonstrates technical feasibility but lacks the infrastructure required for enterprise-scale deployment. "Works in demos" indicates cherry-picked performance, not robust operation across heterogeneous document sources.

---

### Key Risks

**1. Hallucination & Accuracy Debt**  
Without a formal evaluation set, the false positive rate is unknown. Internal chatbots carry high liability stakes (HR policies, security procedures, financial guidelines). Employees treat internal AI as authoritative; a confident wrong answer about vacation policy or data handling creates legal and operational exposure.

**2. Knowledge Fragmentation**  
Google Drive + Notion + legacy PDFs = conflicting source of truth. The bot will surface outdated PDF policies over current Notion pages, or vice versa, creating institutional confusion and "shadow" processes.

**3. Orphaned System Risk**  
No owner post-launch guarantees model drift, stale documentation indices, and unpatched vulnerabilities. This becomes technical debt within 30 days.

**4. Access Control Violations**  
Likely ingesting docs without RBAC filtering. Risk of exposing restricted compensation data, termination lists, or strategic plans to unauthorized staff.

**5. Support Avalanche**  
All-staff rollout without fallback mechanisms will flood IT/Operations with "why did it say this?" tickets, negating any efficiency gains.

---

### Must Fix Before Any Rollout (Non-Negotiable)

| Category | Requirement | Timeline |
|----------|-------------|----------|
| **Governance** | Assign permanent Product Owner (not just builder) with budget and headcount for maintenance | Week 1 |
| **Valid

Methodology

How scores are produced

Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.

LLM judge

A stable judge model scores each trait from 0 to 10 using benchmark-specific traits.

Heuristics

Deterministic checks catch length, banned phrases, required sections, format validity, and safety flags.

Calibrated ceiling

Rubrics are intentionally strict, leaving room above today's best scores for future model improvements.