Back to evals

business benchmark collection

AI Strategy

Benchmarks for testing whether models can evaluate AI initiatives, vendor claims, implementation risks, and production readiness.

Which models can separate useful AI strategy from hype, theatre, and fragile pilots?

4 benchmarks 12 tests 276 completed runs 20 base models

At a glance

Top model

claude-opus-4.7

83.83

Lowest cost / eval

deepseek-v3.2

$0.0129

Median rank score

78.5

Last refresh

2026-06-02

Score vs. cost

Average task cost vs overall score

Each dot is one model. X axis is average cost per benchmark task, including model and judge cost; Y axis is average calibrated score.

Overall ranking

Top models by average score

Higher is better. Scores come from completed judged runs.

Benchmark heatmap

Model performance by benchmark

Cells are colored by rank within each benchmark: the top ten are split across greens, anything below the top ten is red.

Below top 10 #1

Rank	Model	Overall	Vendor Claim Review	AI Strategy Reality Check	Pilot-to-Production Test	Data Readiness Test
1	claude-opus-4.7 12 scored tests	83.8	83.3	85.0	83.7	83.3
2	kimi-k2.5 12 scored tests	82.3	83.3	82.7	80.3	83.0
3	claude-opus-4.8-low 12 scored tests	82.3	82.0	83.3	81.3	82.7
4	claude-opus-4.8 12 scored tests	82.2	81.3	83.7	82.0	81.7
5	claude-opus-4.8-high 12 scored tests	82.0	83.0	81.0	82.3	81.7
6	gpt-5.5 12 scored tests	81.8	84.0	83.3	82.3	77.3
7	qwen3.7-max 12 scored tests	81.2	82.0	82.7	79.7	80.3
8	grok-4.20-beta 12 scored tests	80.7	82.7	81.7	79.3	79.0
9	qwen3.5-plus-02-15 12 scored tests	80.3	81.3	81.3	78.7	80.0
10	gemini-3.1-pro-preview 12 scored tests	80.2	81.7	80.7	78.3	80.3
11	gemini-3-flash-preview 12 scored tests	80.1	82.7	79.3	78.7	79.7
12	gpt-5.4-nano 12 scored tests	78.5	80.0	75.7	78.0	80.3
13	gpt-5.4-mini 12 scored tests	78.3	79.0	80.0	78.0	76.3
14	glm-5.1 12 scored tests	78.2	80.7	79.7	76.3	76.3
15	glm-5 12 scored tests	78.2	81.0	78.7	78.7	74.7
16	claude-sonnet-4.6 12 scored tests	78.1	81.7	71.3	76.7	82.7
17	deepseek-v3.2 12 scored tests	77.4	77.3	78.7	76.3	77.3
18	gpt-5.4 12 scored tests	77.0	80.7	71.3	77.0	79.0
19	claude-opus-4.6 12 scored tests	76.6	84.0	78.3	70.7	73.3
20	claude-opus-4.6-high 12 scored tests	75.8	83.3	76.3	75.3	68.0
21	gemini-3.5-flash-high 12 scored tests	65.4	73.3	52.7	67.0	68.7
22	gpt-5.5-pro 12 scored tests	63.0	72.0	41.0	67.7	71.3
23	minimax-m2.7 12 scored tests	45.2	40.3	34.0	62.3	44.0

Full leaderboard

Quality, cost, and speed

Model	Score	Tests	Avg cost / task	Avg seconds / task	Frequent problems
claude-opus-4.7	83.83 Strong	12/12	$0.0570	40.1s	-
kimi-k2.5	82.33 Strong	12/12	$0.0151	65.4s	-
claude-opus-4.8-low	82.33 Strong	12/12	$0.0559	34.9s	-
claude-opus-4.8	82.17 Strong	12/12	$0.0543	33.6s	-
claude-opus-4.8-high	82.0 Strong	12/12	$0.0558	35.1s	-
gpt-5.5	81.75 Strong	12/12	$0.1089	67.8s	-
qwen3.7-max	81.17 Strong	12/12	$0.0155	62.6s	-
grok-4.20-beta	80.67 Strong	12/12	$0.0155	20.1s	Wrapper text
qwen3.5-plus-02-15	80.33 Strong	12/12	$0.0141	54.6s	-
gemini-3.1-pro-preview	80.25 Strong	12/12	$0.0412	33.5s	Incomplete output
gemini-3-flash-preview	80.08 Strong	12/12	$0.0184	20.4s	-
gpt-5.4-nano	78.5 Usable	12/12	$0.0194	25.5s	Incomplete output
gpt-5.4-mini	78.33 Usable	12/12	$0.0220	21.5s	-
glm-5.1	78.25 Usable	12/12	$0.0141	79.7s	Incomplete output
glm-5	78.25 Usable	12/12	$0.0143	62.0s	Incomplete output
claude-sonnet-4.6	78.08 Usable	12/12	$0.0475	57.5s	Incomplete output Missing required element
deepseek-v3.2	77.42 Usable	12/12	$0.0129	43.7s	-
gpt-5.4	77.0 Usable	12/12	$0.0518	49.8s	Incomplete output
claude-opus-4.6	76.58 Usable	12/12	$0.0763	73.4s	Incomplete output Missing required element
claude-opus-4.6-high	75.75 Usable	12/12	$0.0759	74.6s	Incomplete output Missing required element
gemini-3.5-flash-high	65.42 Needs editing	12/12	$0.0348	22.3s	Incomplete output Missing required element
gpt-5.5-pro	63.0 Needs editing	12/12	$0.4489	76.0s	Incomplete output Missing required element
minimax-m2.7	45.17 Weak	12/12	$0.0129	45.5s	Incomplete output Missing required element Malformed output

Test cases

Where the scores come from

Each row is one prompt, with score distributions, top and low performers, and the most frequent problems judges flagged.

Test	Benchmark	Avg	Max	Min	Top model	Lowest model	Frequent problems
AI vendor landing page claims strategy_vendor_001	Vendor Claim Review	80.5	85.0	73.0	gpt-5.5 · 85	gemini-3.5-flash-high · 73	Incomplete output ×4 Missing required element ×1
AI SDR vendor strategy_vendor_002	Vendor Claim Review	78.0	85.0	10.0	kimi-k2.5 · 85	minimax-m2.7 · 10	Incomplete output ×3 Missing required element ×1
AI document analysis vendor strategy_vendor_003	Vendor Claim Review	79.0	85.0	33.0	claude-opus-4.7 · 85	minimax-m2.7 · 33	Incomplete output ×4 Missing required element ×1
Vague AI roadmap strategy_reality_001	AI Strategy Reality Check	78.1	86.0	38.0	claude-opus-4.7 · 86	gpt-5.5-pro · 38	Incomplete output ×4 Missing required element ×1
Department-level AI adoption memo strategy_reality_002	AI Strategy Reality Check	75.1	85.0	13.0	claude-opus-4.7 · 85	minimax-m2.7 · 13	Incomplete output ×8 Missing required element ×3
Board request for AI cost savings strategy_reality_003	AI Strategy Reality Check	71.4	84.0	9.0	claude-opus-4.7 · 84	minimax-m2.7 · 9	Incomplete output ×7 Missing required element ×4 Malformed output ×1
Internal RAG pilot strategy_pilot_001	Pilot-to-Production Test	77.2	85.0	62.0	claude-opus-4.7 · 85	minimax-m2.7 · 62	Incomplete output ×5 Missing required element ×1
AI sales email pilot strategy_pilot_002	Pilot-to-Production Test	77.0	84.0	48.0	gemini-3.1-pro-preview · 84	gpt-5.5-pro · 48	Incomplete output ×7 Missing required element ×4
Automated invoice processing pilot strategy_pilot_003	Pilot-to-Production Test	76.7	86.0	50.0	claude-opus-4.7 · 86	gemini-3.5-flash-high · 50	Incomplete output ×4 Missing required element ×3 Wrapper text ×1
AI support bot with messy docs strategy_data_001	Data Readiness Test	81.6	86.0	61.0	claude-opus-4.7 · 86	minimax-m2.7 · 61	Incomplete output ×4 Missing required element ×1
Contract review with inconsistent templates strategy_data_002	Data Readiness Test	74.7	84.0	34.0	claude-opus-4.7 · 84	minimax-m2.7 · 34	Incomplete output ×5 Missing required element ×1
Sales forecasting with poor CRM hygiene strategy_data_003	Data Readiness Test	73.4	83.0	37.0	kimi-k2.5 · 83	minimax-m2.7 · 37	Incomplete output ×8 Missing required element ×4

Model profiles

Strengths, weaknesses, and tradeoffs

claude-opus-4.7

12 scored tests · Strong

83.83

Highest traits

clarity8.63

practicality8.5

metric quality8.48

rollout practicality8.47

implementation awareness8.43

Lowest traits

ownership clarity7.83

prioritisation8.17

claim scepticism8.3

question quality8.3

data risk detection8.33

kimi-k2.5

12 scored tests · Strong

82.33

Highest traits

specificity8.43

buyer relevance8.4

source of truth quality8.33

implementation awareness8.33

question quality8.33

Lowest traits

ownership clarity7.8

evaluation awareness7.93

production risk detection8.07

metric quality8.1

prioritisation8.13

claude-opus-4.8-low

12 scored tests · Strong

82.33

Highest traits

implementation awareness8.5

source of truth quality8.47

clarity8.4

practicality8.4

claim scepticism8.3

Lowest traits

ownership clarity7.83

specificity8.1

evaluation awareness8.13

prioritisation8.17

data risk detection8.17

claude-opus-4.8

12 scored tests · Strong

82.17

Highest traits

clarity8.47

practicality8.37

production risk detection8.3

implementation awareness8.27

missing elements8.27

Lowest traits

ownership clarity8.0

evaluation awareness8.07

practical recommendation8.07

specificity8.1

data risk detection8.13

claude-opus-4.8-high

12 scored tests · Strong

82.0

Highest traits

implementation awareness8.33

production risk detection8.33

buyer relevance8.33

claim scepticism8.3

source of truth quality8.27

Lowest traits

ownership clarity7.93

missing elements8.03

specificity8.07

data risk detection8.1

metric quality8.12

gpt-5.5

12 scored tests · Strong

81.75

Highest traits

buyer relevance8.53

claim scepticism8.5

practical recommendation8.47

clarity8.43

practicality8.43

Lowest traits

specificity7.6

prioritisation7.7

data risk detection7.8

implementation awareness7.83

source of truth quality7.93

qwen3.7-max

12 scored tests · Strong

81.17

Highest traits

clarity8.47

practicality8.37

buyer relevance8.3

practical recommendation8.27

question quality8.23

Lowest traits

ownership clarity7.77

evaluation awareness7.8

production risk detection8.0

specificity8.03

prioritisation8.03

grok-4.20-beta

12 scored tests · Strong

80.67

Highest traits

buyer relevance8.4

question quality8.33

practical recommendation8.33

claim scepticism8.3

clarity8.23

Lowest traits

prioritisation7.83

implementation awareness7.83

ownership clarity7.83

specificity7.9

data risk detection7.93

qwen3.5-plus-02-15

12 scored tests · Strong

80.33

Highest traits

buyer relevance8.2

practical recommendation8.2

clarity8.17

claim scepticism8.17

prioritisation8.13

Lowest traits

ownership clarity7.63

evaluation awareness7.7

production risk detection7.87

implementation awareness7.97

specificity8.0

gemini-3.1-pro-preview

12 scored tests · Strong

80.25

Highest traits

question quality8.3

buyer relevance8.23

practicality8.2

practical recommendation8.17

data risk detection8.13

Lowest traits

ownership clarity7.6

evaluation awareness7.77

metric quality7.88

prioritisation7.97

rollout practicality7.97

gemini-3-flash-preview

12 scored tests · Strong

80.08

Highest traits

question quality8.47

buyer relevance8.33

practical recommendation8.3

claim scepticism8.2

source of truth quality8.13

Lowest traits

ownership clarity7.43

evaluation awareness7.77

prioritisation7.8

missing elements7.83

rollout practicality7.9

gpt-5.4-nano

12 scored tests · Usable

78.5

Highest traits

source of truth quality8.2

implementation awareness8.17

buyer relevance8.07

data risk detection8.03

question quality8.03

Lowest traits

clarity7.33

ownership clarity7.5

evaluation awareness7.7

practicality7.73

rollout practicality7.77

Compare models

Trait profile and test-by-test matchup

Pick two models to compare average trait scores and individual test outcomes from the published calibrated results.

Model A Model B

Per-test comparison

Sorted by the biggest absolute score difference between the selected models.

Test

Benchmark

Delta

Frequent problems

Where models break

Incomplete output 63

Missing required element 25

Wrapper text 1

Malformed output 1

Example outputs

Winner, middle, failure

winner

claude-opus-4.7 · 86

The response demonstrates unusually deep, expert-level judgment, offering highly specific, actionable advice that perfectly addresses the constraints and realities of AI support bot implementation without relying on generic platitudes.

# Data Readiness Assessment: AI Support Bot

## Main Problems

**1. Help docs are stale and contradict current reality**
- Why it matters: A bot trained on outdated docs will confidently give wrong answers at scale. Worse than a human agent making a one-off mistake, because the bot will repeat the same wrong answer thousands of times before anyone notices.

**2. Refund policy changed twice in six months with no version control**
- Why it matters: Past tickets contain three different refund policies treated as equally valid training signal. The bot will average across contradictions or pick the most frequent (likely the oldest) answer. Refund disputes also carry direct financial and legal exposure.

**3. Agents rely on undocumented exceptions ("tribal knowledge")**
- Why it matters: This is the gap between what's written and what actually resolves tickets. If you only train on docs, the bot underperforms agents. If you train on tickets, the bot learns exceptions without understanding when they apply — it will grant exceptions to everyone.

**4. Past tickets contain PII**
- Why it matters: Names, emails, payment details, addresses leaking into model weights or retrieval indexes creates GDPR/CCPA exposure and risk of the bot regurgitating one customer's data to another.

**5. No source of truth**
- Why it matters: When the bot is wrong, no one can say what "right" was. You can't evaluate accuracy, can't update content systematically, and can't defend bot decisions to customers or auditors.

## Fix Before Building

- **Reconcile the refund policy.** One document, dated, with the current rules and a clear "as of" effective date. Tag historical tickets with which policy version was in effect so you can filter or exclude.
- **PII scrubbing pipeline for tickets.** Run NER + reg

failure

minimax-m2.7 · 9

The model response cuts off abruptly with a severe formatting glitch and fails to provide four of the five requested deliverables (vague claims, risks, 90-day plan, metrics).

Below is a **critical‑review checklist** that you can hand back to the memo’s authors (or use as a Board‑level “request for clarification”).  
It is organized into the five deliverables you asked for: **missing pieces, vague claims, risks, 90‑day plan, and metrics**.  A short **template for a revised memo** is also provided so the Board can see exactly what a credible, AI‑driven cost‑saving case should look like.

---

## 1. Missing Pieces (what the memo does not contain)

| # | Topic | Why It Matters | What to Add |
|---|-------|----------------|-------------|
| 1 | **Process Inventory & Mapping** | You can’t automate what you haven’t documented. | A “process‑as‑is” catalogue (value‑stream map) for every manual workflow targeted for AI (e.g., order entry, invoicing, customer‑onboarding, report generation). |
| 2 | **Data Owner & Governance** | AI models are only as good as the data they learn from. | Designate a **Data Owner** (or a Data Steward) for each data set used by AI. Include a one‑page data‑governance charter (quality, lineage, privacy, security). |
| 3 | **Baseline Cost & Performance Metrics** | The 20 % saving claim is meaningless without a reference point. | Capture current **cost per transaction**, **cycle time**, **error rate**, and **FTE** (full‑time‑equivalent) hours for each process. |
| 4 | **AI Use‑Case Definition** | “AI” is an umbrella – the memo must specify the technology (RPA, NLP, computer vision, predictive models, etc.). | For each process, list the **AI technique**, the **input data**, the **output/action**, and the **expected impact**. |
| 5 | **Technology Selection & Build‑vs‑Buy Analysis** | The solution will dictate cost, risk, and timeline. | Provide a short vendor‑evaluation matrix (or internal‑build effort estimate) for the top 2‑3 AI

mid performer

kimi-k2.5 · 80

The model demonstrates exceptional enterprise AI maturity, particularly with the inclusion of a pragmatic 'CEO Timeline Compromise', specific RBAC and data fragmentation concerns, and highly actionable, quantifiable metrics.

**Readiness Assessment: 2/10 — NOT PRODUCTION READY**

Current state presents critical gaps in governance, data integrity, and validation. The prototype demonstrates technical feasibility but lacks the infrastructure required for enterprise-scale deployment. "Works in demos" indicates cherry-picked performance, not robust operation across heterogeneous document sources.

---

### Key Risks

**1. Hallucination & Accuracy Debt**
Without a formal evaluation set, the false positive rate is unknown. Internal chatbots carry high liability stakes (HR policies, security procedures, financial guidelines). Employees treat internal AI as authoritative; a confident wrong answer about vacation policy or data handling creates legal and operational exposure.

**2. Knowledge Fragmentation**
Google Drive + Notion + legacy PDFs = conflicting source of truth. The bot will surface outdated PDF policies over current Notion pages, or vice versa, creating institutional confusion and "shadow" processes.

**3. Orphaned System Risk**
No owner post-launch guarantees model drift, stale documentation indices, and unpatched vulnerabilities. This becomes technical debt within 30 days.

**4. Access Control Violations**
Likely ingesting docs without RBAC filtering. Risk of exposing restricted compensation data, termination lists, or strategic plans to unauthorized staff.

**5. Support Avalanche**
All-staff rollout without fallback mechanisms will flood IT/Operations with "why did it say this?" tickets, negating any efficiency gains.

---

### Must Fix Before Any Rollout (Non-Negotiable)

| Category | Requirement | Timeline |
|----------|-------------|----------|
| **Governance** | Assign permanent Product Owner (not just builder) with budget and headcount for maintenance | Week 1 |
| **Valid

Methodology

How scores are produced

Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.

LLM judge

A stable judge model scores each trait from 0 to 10 using benchmark-specific traits.

Heuristics

Deterministic checks catch length, banned phrases, required sections, format validity, and safety flags.

Calibrated ceiling

Rubrics are intentionally strict, leaving room above today's best scores for future model improvements.