Score vs. cost
Average task cost vs overall score
Each dot is one model. X axis is average cost per benchmark task, including model and judge cost; Y axis is average calibrated score.
business benchmark collection
Benchmarks for testing whether models can evaluate AI initiatives, vendor claims, implementation risks, and production readiness.
Which models can separate useful AI strategy from hype, theatre, and fragile pilots?
At a glance
Top model
claude-opus-4.7
83.83
Lowest cost / eval
deepseek-v3.2
$0.0129
Median rank score
78.5
Last refresh
2026-06-02
Score vs. cost
Each dot is one model. X axis is average cost per benchmark task, including model and judge cost; Y axis is average calibrated score.
Overall ranking
Higher is better. Scores come from completed judged runs.
Benchmark heatmap
Cells are colored by rank within each benchmark: the top ten are split across greens, anything below the top ten is red.
| Rank | Model | Overall | Vendor Claim Review | AI Strategy Reality Check | Pilot-to-Production Test | Data Readiness Test |
|---|---|---|---|---|---|---|
| 1 |
12 scored tests |
83.8 | 83.3 | 85.0 | 83.7 | 83.3 |
| 2 |
12 scored tests |
82.3 | 83.3 | 82.7 | 80.3 | 83.0 |
| 3 |
12 scored tests |
82.3 | 82.0 | 83.3 | 81.3 | 82.7 |
| 4 |
12 scored tests |
82.2 | 81.3 | 83.7 | 82.0 | 81.7 |
| 5 |
12 scored tests |
82.0 | 83.0 | 81.0 | 82.3 | 81.7 |
| 6 |
12 scored tests |
81.8 | 84.0 | 83.3 | 82.3 | 77.3 |
| 7 |
12 scored tests |
81.2 | 82.0 | 82.7 | 79.7 | 80.3 |
| 8 |
12 scored tests |
80.7 | 82.7 | 81.7 | 79.3 | 79.0 |
| 9 |
12 scored tests |
80.3 | 81.3 | 81.3 | 78.7 | 80.0 |
| 10 |
12 scored tests |
80.2 | 81.7 | 80.7 | 78.3 | 80.3 |
| 11 |
12 scored tests |
80.1 | 82.7 | 79.3 | 78.7 | 79.7 |
| 12 |
12 scored tests |
78.5 | 80.0 | 75.7 | 78.0 | 80.3 |
| 13 |
12 scored tests |
78.3 | 79.0 | 80.0 | 78.0 | 76.3 |
| 14 |
12 scored tests |
78.2 | 80.7 | 79.7 | 76.3 | 76.3 |
| 15 |
12 scored tests |
78.2 | 81.0 | 78.7 | 78.7 | 74.7 |
| 16 |
12 scored tests |
78.1 | 81.7 | 71.3 | 76.7 | 82.7 |
| 17 |
12 scored tests |
77.4 | 77.3 | 78.7 | 76.3 | 77.3 |
| 18 |
12 scored tests |
77.0 | 80.7 | 71.3 | 77.0 | 79.0 |
| 19 |
12 scored tests |
76.6 | 84.0 | 78.3 | 70.7 | 73.3 |
| 20 |
12 scored tests |
75.8 | 83.3 | 76.3 | 75.3 | 68.0 |
| 21 |
12 scored tests |
65.4 | 73.3 | 52.7 | 67.0 | 68.7 |
| 22 |
12 scored tests |
63.0 | 72.0 | 41.0 | 67.7 | 71.3 |
| 23 |
12 scored tests |
45.2 | 40.3 | 34.0 | 62.3 | 44.0 |
Full leaderboard
| Model | Score | Tests | Avg cost / task | Avg seconds / task | Frequent problems |
|---|---|---|---|---|---|
|
|
83.83 Strong | 12/12 | $0.0570 | 40.1s | - |
|
|
82.33 Strong | 12/12 | $0.0151 | 65.4s | - |
|
|
82.33 Strong | 12/12 | $0.0559 | 34.9s | - |
|
|
82.17 Strong | 12/12 | $0.0543 | 33.6s | - |
|
|
82.0 Strong | 12/12 | $0.0558 | 35.1s | - |
|
|
81.75 Strong | 12/12 | $0.1089 | 67.8s | - |
|
|
81.17 Strong | 12/12 | $0.0155 | 62.6s | - |
|
|
80.67 Strong | 12/12 | $0.0155 | 20.1s | Wrapper text |
|
|
80.33 Strong | 12/12 | $0.0141 | 54.6s | - |
|
|
80.25 Strong | 12/12 | $0.0412 | 33.5s | Incomplete output |
|
|
80.08 Strong | 12/12 | $0.0184 | 20.4s | - |
|
|
78.5 Usable | 12/12 | $0.0194 | 25.5s | Incomplete output |
|
|
78.33 Usable | 12/12 | $0.0220 | 21.5s | - |
|
|
78.25 Usable | 12/12 | $0.0141 | 79.7s | Incomplete output |
|
|
78.25 Usable | 12/12 | $0.0143 | 62.0s | Incomplete output |
|
|
78.08 Usable | 12/12 | $0.0475 | 57.5s | Incomplete output Missing required element |
|
|
77.42 Usable | 12/12 | $0.0129 | 43.7s | - |
|
|
77.0 Usable | 12/12 | $0.0518 | 49.8s | Incomplete output |
|
|
76.58 Usable | 12/12 | $0.0763 | 73.4s | Incomplete output Missing required element |
|
|
75.75 Usable | 12/12 | $0.0759 | 74.6s | Incomplete output Missing required element |
|
|
65.42 Needs editing | 12/12 | $0.0348 | 22.3s | Incomplete output Missing required element |
|
|
63.0 Needs editing | 12/12 | $0.4489 | 76.0s | Incomplete output Missing required element |
|
|
45.17 Weak | 12/12 | $0.0129 | 45.5s | Incomplete output Missing required element Malformed output |
Test cases
Each row is one prompt, with score distributions, top and low performers, and the most frequent problems judges flagged.
| Test | Benchmark | Avg | Max | Min | Top model | Lowest model | Frequent problems |
|---|---|---|---|---|---|---|---|
|
AI vendor landing page claims strategy_vendor_001 |
Vendor Claim Review | 80.5 | 85.0 | 73.0 | gpt-5.5 · 85 | gemini-3.5-flash-high · 73 | Incomplete output ×4 Missing required element ×1 |
|
AI SDR vendor strategy_vendor_002 |
Vendor Claim Review | 78.0 | 85.0 | 10.0 | kimi-k2.5 · 85 | minimax-m2.7 · 10 | Incomplete output ×3 Missing required element ×1 |
|
AI document analysis vendor strategy_vendor_003 |
Vendor Claim Review | 79.0 | 85.0 | 33.0 | claude-opus-4.7 · 85 | minimax-m2.7 · 33 | Incomplete output ×4 Missing required element ×1 |
|
Vague AI roadmap strategy_reality_001 |
AI Strategy Reality Check | 78.1 | 86.0 | 38.0 | claude-opus-4.7 · 86 | gpt-5.5-pro · 38 | Incomplete output ×4 Missing required element ×1 |
|
Department-level AI adoption memo strategy_reality_002 |
AI Strategy Reality Check | 75.1 | 85.0 | 13.0 | claude-opus-4.7 · 85 | minimax-m2.7 · 13 | Incomplete output ×8 Missing required element ×3 |
|
Board request for AI cost savings strategy_reality_003 |
AI Strategy Reality Check | 71.4 | 84.0 | 9.0 | claude-opus-4.7 · 84 | minimax-m2.7 · 9 | Incomplete output ×7 Missing required element ×4 Malformed output ×1 |
|
Internal RAG pilot strategy_pilot_001 |
Pilot-to-Production Test | 77.2 | 85.0 | 62.0 | claude-opus-4.7 · 85 | minimax-m2.7 · 62 | Incomplete output ×5 Missing required element ×1 |
|
AI sales email pilot strategy_pilot_002 |
Pilot-to-Production Test | 77.0 | 84.0 | 48.0 | gemini-3.1-pro-preview · 84 | gpt-5.5-pro · 48 | Incomplete output ×7 Missing required element ×4 |
|
Automated invoice processing pilot strategy_pilot_003 |
Pilot-to-Production Test | 76.7 | 86.0 | 50.0 | claude-opus-4.7 · 86 | gemini-3.5-flash-high · 50 | Incomplete output ×4 Missing required element ×3 Wrapper text ×1 |
|
AI support bot with messy docs strategy_data_001 |
Data Readiness Test | 81.6 | 86.0 | 61.0 | claude-opus-4.7 · 86 | minimax-m2.7 · 61 | Incomplete output ×4 Missing required element ×1 |
|
Contract review with inconsistent templates strategy_data_002 |
Data Readiness Test | 74.7 | 84.0 | 34.0 | claude-opus-4.7 · 84 | minimax-m2.7 · 34 | Incomplete output ×5 Missing required element ×1 |
|
Sales forecasting with poor CRM hygiene strategy_data_003 |
Data Readiness Test | 73.4 | 83.0 | 37.0 | kimi-k2.5 · 83 | minimax-m2.7 · 37 | Incomplete output ×8 Missing required element ×4 |
Model profiles
12 scored tests · Strong
83.83
Highest traits
Lowest traits
12 scored tests · Strong
82.33
Highest traits
Lowest traits
12 scored tests · Strong
82.33
Highest traits
Lowest traits
12 scored tests · Strong
82.17
Highest traits
Lowest traits
12 scored tests · Strong
82.0
Highest traits
Lowest traits
12 scored tests · Strong
81.75
Highest traits
Lowest traits
12 scored tests · Strong
81.17
Highest traits
Lowest traits
12 scored tests · Strong
80.67
Highest traits
Lowest traits
12 scored tests · Strong
80.33
Highest traits
Lowest traits
12 scored tests · Strong
80.25
Highest traits
Lowest traits
12 scored tests · Strong
80.08
Highest traits
Lowest traits
12 scored tests · Usable
78.5
Highest traits
Lowest traits
Compare models
Pick two models to compare average trait scores and individual test outcomes from the published calibrated results.
Sorted by the biggest absolute score difference between the selected models.
| Test | Benchmark | Delta | ||
|---|---|---|---|---|
Frequent problems
Example outputs
winner
The response demonstrates unusually deep, expert-level judgment, offering highly specific, actionable advice that perfectly addresses the constraints and realities of AI support bot implementation without relying on generic platitudes.
# Data Readiness Assessment: AI Support Bot
## Main Problems
**1. Help docs are stale and contradict current reality**
- Why it matters: A bot trained on outdated docs will confidently give wrong answers at scale. Worse than a human agent making a one-off mistake, because the bot will repeat the same wrong answer thousands of times before anyone notices.
**2. Refund policy changed twice in six months with no version control**
- Why it matters: Past tickets contain three different refund policies treated as equally valid training signal. The bot will average across contradictions or pick the most frequent (likely the oldest) answer. Refund disputes also carry direct financial and legal exposure.
**3. Agents rely on undocumented exceptions ("tribal knowledge")**
- Why it matters: This is the gap between what's written and what actually resolves tickets. If you only train on docs, the bot underperforms agents. If you train on tickets, the bot learns exceptions without understanding when they apply — it will grant exceptions to everyone.
**4. Past tickets contain PII**
- Why it matters: Names, emails, payment details, addresses leaking into model weights or retrieval indexes creates GDPR/CCPA exposure and risk of the bot regurgitating one customer's data to another.
**5. No source of truth**
- Why it matters: When the bot is wrong, no one can say what "right" was. You can't evaluate accuracy, can't update content systematically, and can't defend bot decisions to customers or auditors.
## Fix Before Building
- **Reconcile the refund policy.** One document, dated, with the current rules and a clear "as of" effective date. Tag historical tickets with which policy version was in effect so you can filter or exclude.
- **PII scrubbing pipeline for tickets.** Run NER + reg
failure
The model response cuts off abruptly with a severe formatting glitch and fails to provide four of the five requested deliverables (vague claims, risks, 90-day plan, metrics).
Below is a **critical‑review checklist** that you can hand back to the memo’s authors (or use as a Board‑level “request for clarification”). It is organized into the five deliverables you asked for: **missing pieces, vague claims, risks, 90‑day plan, and metrics**. A short **template for a revised memo** is also provided so the Board can see exactly what a credible, AI‑driven cost‑saving case should look like. --- ## 1. Missing Pieces (what the memo does not contain) | # | Topic | Why It Matters | What to Add | |---|-------|----------------|-------------| | 1 | **Process Inventory & Mapping** | You can’t automate what you haven’t documented. | A “process‑as‑is” catalogue (value‑stream map) for every manual workflow targeted for AI (e.g., order entry, invoicing, customer‑onboarding, report generation). | | 2 | **Data Owner & Governance** | AI models are only as good as the data they learn from. | Designate a **Data Owner** (or a Data Steward) for each data set used by AI. Include a one‑page data‑governance charter (quality, lineage, privacy, security). | | 3 | **Baseline Cost & Performance Metrics** | The 20 % saving claim is meaningless without a reference point. | Capture current **cost per transaction**, **cycle time**, **error rate**, and **FTE** (full‑time‑equivalent) hours for each process. | | 4 | **AI Use‑Case Definition** | “AI” is an umbrella – the memo must specify the technology (RPA, NLP, computer vision, predictive models, etc.). | For each process, list the **AI technique**, the **input data**, the **output/action**, and the **expected impact**. | | 5 | **Technology Selection & Build‑vs‑Buy Analysis** | The solution will dictate cost, risk, and timeline. | Provide a short vendor‑evaluation matrix (or internal‑build effort estimate) for the top 2‑3 AI
mid performer
The model demonstrates exceptional enterprise AI maturity, particularly with the inclusion of a pragmatic 'CEO Timeline Compromise', specific RBAC and data fragmentation concerns, and highly actionable, quantifiable metrics.
**Readiness Assessment: 2/10 — NOT PRODUCTION READY** Current state presents critical gaps in governance, data integrity, and validation. The prototype demonstrates technical feasibility but lacks the infrastructure required for enterprise-scale deployment. "Works in demos" indicates cherry-picked performance, not robust operation across heterogeneous document sources. --- ### Key Risks **1. Hallucination & Accuracy Debt** Without a formal evaluation set, the false positive rate is unknown. Internal chatbots carry high liability stakes (HR policies, security procedures, financial guidelines). Employees treat internal AI as authoritative; a confident wrong answer about vacation policy or data handling creates legal and operational exposure. **2. Knowledge Fragmentation** Google Drive + Notion + legacy PDFs = conflicting source of truth. The bot will surface outdated PDF policies over current Notion pages, or vice versa, creating institutional confusion and "shadow" processes. **3. Orphaned System Risk** No owner post-launch guarantees model drift, stale documentation indices, and unpatched vulnerabilities. This becomes technical debt within 30 days. **4. Access Control Violations** Likely ingesting docs without RBAC filtering. Risk of exposing restricted compensation data, termination lists, or strategic plans to unauthorized staff. **5. Support Avalanche** All-staff rollout without fallback mechanisms will flood IT/Operations with "why did it say this?" tickets, negating any efficiency gains. --- ### Must Fix Before Any Rollout (Non-Negotiable) | Category | Requirement | Timeline | |----------|-------------|----------| | **Governance** | Assign permanent Product Owner (not just builder) with budget and headcount for maintenance | Week 1 | | **Valid
Methodology
Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.
LLM judge
A stable judge model scores each trait from 0 to 10 using benchmark-specific traits.
Heuristics
Deterministic checks catch length, banned phrases, required sections, format validity, and safety flags.
Calibrated ceiling
Rubrics are intentionally strict, leaving room above today's best scores for future model improvements.