Business · 8 tasks · 94 models

Fastest AI models for Knowledge & Docs

Name: Knowledge & Docs AI model benchmark
Creator: Spring Prompt

Which models write documentation that is accurate to the real product — no invented buttons, menus, or API params — and safely sequenced?

Top models OpenAI

gpt-5.4-max Anthropic

claude-opus-4.8-low OpenAI

gpt-5.4-mini

The fastest capable model for Knowledge & Docs is gemini-3.1-flash-lite, at about 12.8s per run.

Best overall Excellent

gpt-5.4-max

Top score — excellent

98.5 score $0.0249/run 27.2s

Best value Excellent

gpt-5.4-mini-medium

Clears the quality bar at $0.015/run

95.5 score $0.0147/run 19.2s

Fastest usable ★ Usable

gemini-3.1-flash-lite

~13s per run, still strong

77.0 score $0.0157/run 12.8s

Quality vs. cost

Every model placed by what it delivers and what it costs. The best value sits high and to the left.

Full ranking

Best overall Cheapest Fastest Smartest

#	Model	Score	Cost/run	Speed	Best for
1	gemini-3.1-flash-lite	77.0 Usable	$0.0157	12.8s	Strong drafts
2	grok-4.20-beta	71.5 Usable	$0.0166	14.1s	Needs review
3	gpt-5.4-mini	97.8 Excellent	$0.0153	14.2s	Best overall
4	gpt-5.4-low	93.0 Excellent	$0.0213	15.5s	Best overall
5	claude-haiku-4.5	73.1 Usable	$0.0184	16.0s	Needs review
6	gpt-5.5-low	96.1 Excellent	$0.0292	17.4s	Best overall
7	claude-opus-4.8-high	92.9 Excellent	$0.0297	18.1s	Best overall
8	gpt-5.4-high	97.5 Excellent	$0.0236	18.8s	Best overall
9	gpt-5.5-high	96.5 Excellent	$0.0297	18.8s	Best overall
10	gpt-5.4-mini-max	95.6 Excellent	$0.0149	19.2s	Best overall
11	gpt-5.4-mini-medium	95.5 Excellent	$0.0147	19.2s	Best overall
12	gpt-5.4	88.8 Strong	$0.0243	19.3s	Best overall
13	gemini-3-flash-preview	72.6 Usable	$0.0231	19.8s	Needs review
14	gpt-5-mini	80.9 Strong	$0.0179	20.8s	Strong drafts
15	gemini-3.5-flash-low	80.1 Strong	$0.0269	20.8s	Strong drafts
16	claude-opus-4.8-low	98.5 Excellent	$0.0341	20.9s	Best overall
17	grok-4.20-medium	76.4 Usable	$0.0159	21.4s	Strong drafts
18	mistral-medium-3.1-max	70.6 Usable	$0.0159	21.5s	Needs review
19	gemini-3.5-flash-high	80.0 Strong	$0.0290	21.7s	Strong drafts
20	gpt-5.5	96.5 Excellent	$0.0299	21.9s	Best overall
21	deepseek-v3.2-low	70.0 Usable	$0.0158	22.9s	Needs review
22	gemini-3.1-flash-lite-max	83.6 Strong	$0.0184	22.9s	Strong drafts
23	gemini-3.1-flash-lite-medium	75.9 Usable	$0.0177	23.4s	Strong drafts
24	gemini-3-flash-preview-max	79.6 Usable	$0.0156	23.4s	Strong drafts
25	gemini-3.1-pro-preview-low	78.0 Usable	$0.0296	23.7s	Strong drafts
26	claude-sonnet-4.5-low	85.0 Strong	$0.0253	24.5s	Best overall
27	claude-opus-4.8-medium	93.4 Excellent	$0.0304	24.8s	Best overall
28	claude-haiku-4.5-max	76.4 Usable	$0.0190	25.0s	Strong drafts
29	mistral-medium-3.1-medium	73.2 Usable	$0.0199	25.1s	Needs review
30	claude-sonnet-4.5	81.1 Strong	$0.0234	25.3s	Strong drafts
31	gemini-3-flash-preview-medium	71.5 Usable	$0.0205	25.4s	Needs review
32	claude-sonnet-4.5-high	79.2 Usable	$0.0268	25.7s	Strong drafts
33	gpt-5.4-medium	96.1 Excellent	$0.0216	26.0s	Best overall
34	gpt-5.5-max	96.1 Excellent	$0.0328	26.3s	Best overall
35	gpt-5.5-medium	94.1 Excellent	$0.0319	26.5s	Best overall
36	gpt-5.4-max	98.5 Excellent	$0.0249	27.2s	Best overall
37	gemini-3.1-pro-preview	85.5 Strong	$0.0331	27.6s	Best overall
38	claude-sonnet-4.5-max	84.8 Strong	$0.0225	28.2s	Strong drafts
39	kimi-k2.7-code-max	85.8 Strong	$0.0179	28.5s	Best overall
40	gemini-3.1-pro-preview-high	77.9 Usable	$0.0297	28.5s	Strong drafts
41	deepseek-v3.2-high	76.2 Usable	$0.0158	28.9s	Strong drafts
42	gemini-3.5-flash-max	72.4 Usable	$0.0297	29.1s	Needs review
43	kimi-k2.7-code-medium	91.0 Excellent	$0.0187	29.3s	Best overall
44	gemini-3.5-flash-medium	79.1 Usable	$0.0301	30.0s	Strong drafts
45	deepseek-v3.1-terminus-medium	79.6 Usable	$0.0175	30.4s	Strong drafts
46	claude-sonnet-4.6-low	76.4 Usable	$0.0317	30.4s	Strong drafts
47	deepseek-v3.2	78.2 Usable	$0.0196	31.0s	Strong drafts
48	gemini-3.1-pro-preview-medium	86.1 Strong	$0.0256	32.5s	Best overall
49	claude-sonnet-4.6-high	81.2 Strong	$0.0327	32.6s	Strong drafts
50	claude-sonnet-4.5-medium	73.2 Usable	$0.0264	32.8s	Needs review
51	gemini-3.1-pro-preview-max	87.0 Strong	$0.0285	34.9s	Best overall
52	claude-opus-4.5-high	80.0 Strong	$0.0418	35.4s	Strong drafts
53	kimi-k2.7-code	80.0 Strong	$0.0212	36.2s	Strong drafts
54	deepseek-v3.2-max	78.1 Usable	$0.0183	36.3s	Strong drafts
55	claude-sonnet-4.6-medium	82.1 Strong	$0.0308	37.0s	Strong drafts
56	claude-opus-4.6-max	71.2 Usable	$0.0412	37.4s	Needs review
57	claude-opus-4.6-low	70.6 Usable	$0.0464	38.0s	Needs review
58	claude-sonnet-4.6-max	79.8 Usable	$0.0337	38.5s	Strong drafts
59	gpt-5-mini-medium	71.9 Usable	$0.0191	38.9s	Needs review
60	claude-opus-4.6-medium	77.5 Usable	$0.0355	39.3s	Strong drafts
61	qwen3.7-max-low	81.8 Strong	$0.0229	41.4s	Strong drafts
62	kimi-k2.5	72.4 Usable	$0.0153	43.3s	Needs review
63	claude-opus-4.8-max	84.0 Strong	$0.0651	43.5s	Strong drafts
64	qwen3.7-max	79.1 Usable	$0.0250	47.1s	Strong drafts
65	qwen3.7-max-high	78.1 Usable	$0.0262	47.4s	Strong drafts
66	qwen3.5-plus-02-15	76.5 Usable	$0.0185	48.7s	Strong drafts
67	qwen3.7-max-medium	78.0 Usable	$0.0215	49.8s	Strong drafts
68	qwen3.7-max-max	83.1 Strong	$0.0213	49.9s	Strong drafts
69	kimi-k2.5-medium	78.2 Usable	$0.0200	54.7s	Strong drafts
70	glm-5-medium	85.6 Strong	$0.0180	56.0s	Best overall
71	qwen3.5-plus-02-15-medium	79.4 Usable	$0.0201	60.7s	Strong drafts
72	qwen3.5-plus-02-15-max	79.0 Usable	$0.0208	61.7s	Strong drafts
73	glm-5-max	73.5 Usable	$0.0200	63.5s	Needs review
74	glm-5	80.9 Strong	$0.0191	66.1s	Strong drafts
75	kimi-k2.5-max	74.6 Usable	$0.0240	66.6s	Needs review
76	gpt-5-mini-max	78.9 Usable	$0.0254	67.7s	Strong drafts
77	mistral-medium-3.1	66.9 Needs editing	$0.0202	17.5s	Needs review
78	grok-4.20	60.9 Needs editing	$0.0237	18.2s	Needs review
79	claude-haiku-4.5-medium	69.5 Needs editing	$0.0214	24.8s	Needs review
80	grok-4.20-beta-max	61.6 Needs editing	$0.0245	25.4s	Needs review
81	grok-4.20-beta-medium	66.8 Needs editing	$0.0227	26.5s	Needs review
82	grok-4.20-max	65.8 Needs editing	$0.0229	26.6s	Needs review
83	deepseek-v3.1-terminus	63.8 Needs editing	$0.0210	26.9s	Needs review
84	claude-opus-4.5-low	57.4 Weak	$0.0398	29.2s	Needs review
85	claude-opus-4.5	67.9 Needs editing	$0.0357	29.7s	Needs review
86	minimax-m2.7	52.1 Weak	$0.0196	30.9s	Needs review
87	deepseek-v3.2-medium	64.1 Needs editing	$0.0206	35.1s	Needs review
88	claude-opus-4.6-high	68.0 Needs editing	$0.0399	35.9s	Needs review
89	claude-opus-4.5-medium	66.2 Needs editing	$0.0392	37.5s	Needs review
90	claude-opus-4.5-max	66.9 Needs editing	$0.0424	37.8s	Needs review
91	deepseek-v3.1-terminus-max	64.9 Needs editing	$0.0208	38.8s	Needs review
92	claude-opus-4.6	63.2 Needs editing	$0.0438	39.1s	Needs review
93	minimax-m2.7-medium	38.9 Failed	$0.0259	61.0s	Needs review
94	minimax-m2.7-max	51.4 Weak	$0.0263	77.5s	Needs review

What separates the top models

Procedure / Runbook

medium

Tests a numbered procedure with prerequisites, one action per step, and a warning before any destructive action.

Leader: claude-opus-4.8-medium

Help-Center Article

medium

Tests a task-focused, scannable support article that answers the user's question and includes troubleshooting.

Leader: qwen3.7-max-max

Grounded Docs (No Fabrication)

hard

Tests writing docs from limited product facts without inventing UI, steps, or behavior — including flagging an unsupported request.

Leader: qwen3.7-max-max

Right Doc Type

medium

Tests choosing and writing the correct Diátaxis doc type for a need without mixing types.

Leader: kimi-k2.7-code-max

See the difference

Real model outputs and how our judge scored them.

Top score qwen3.5-plus-02-15-medium Explanation, not steps

100/100

“The model perfectly executed the requested Diátaxis document type (Explanation). It focused entirely on the conceptual reasons and trade-offs for the system's behavior without slipping into a how-to or reference format. The language is clear, professional, and well-structured.”

# Conceptual Overview: Duplicate Invoice Payment Holds in Northwind

## Purpose of the Behavior
In the Northwind system, the automatic hold placed on an invoice when a duplicate is flagged is not an error state; it is a deliberate financial control mechanism. This behavior is designed to interrupt the payment workflow before funds are disbursed, requiring human validation to proceed.

## The Rationale: Why the System Holds Payment
The primary driver for this logic is **risk mitigation**. In accounts payable operations, the cost of recovering an overpayment significantly exceeds the cost of delaying a legitimate payment. The system enforces this hold for three core reasons:

1. **Prevention of Double Disbursement**
The most immediate risk is paying the same liability twice. This often occurs when a vendor resends an invoice due to perceived non-payment, or when internal teams upload the same document separately. The hold ensures that the general ledger is not impacted by duplicate expenses.

2. **Internal Control Compliance**
Many organizations must adhere to audit standards (such as SOX or ISO) that require segregation of duties and verification steps. An automated payment on a flagged duplicate bypasses verification. By forcing a hold, the system mandates a review event, creating an audit trail that demonstrates due diligence.

3. **Cash Flow Preservation**
Even if a duplicate payment is eventually recoverable, the temporary loss of cash impacts liquidity. The hold protects working capital by ensuring funds leave the account only once per unique liability.

## The Trade-Offs
Implementing strict duplicate detection involves balancing security against operational efficiency. Understanding these trade-offs helps in configuring tolerance levels and managing

Mid-pack gpt-5.5-low SOP: handle a customer data-deletion request (Cedar & Sage)

98/100

“The model perfectly follows the instructions, providing a clear, sequential SOP with prerequisites and roles up front. It correctly places a prominent warning before the irreversible deletion step and ensures identity verification happens first. It avoids fabricating any specific UI elements, keeping the system references appropriately generic based on the provided facts.”

# SOP: Processing Customer Account-Deletion Requests

## Purpose
To ensure Cedar & Sage support agents process customer account-deletion requests securely, consistently, and with clear written confirmation before irreversible deletion.

## Owner / Role
**Owner:** Customer Support Team
**Responsible Role:** Cedar & Sage Support Agent
**Escalation:** Support Lead or Privacy/Compliance contact, if identity cannot be verified or the request is unclear.

## Prerequisites
Before processing an account-deletion request, the support agent must have:

1. An active customer support ticket or written request from the customer.
2. Access to Cedar & Sage’s customer support and account management systems.
3. The customer’s account information available for verification.
4. Authority to process account-deletion requests or approval from the appropriate internal owner.
5. A standard written confirmation template for account deletion.

---

## Procedure

### 1. Receive the Account-Deletion Request
1.1. Review the customer’s request to confirm they are asking to delete their Cedar & Sage account.
1.2. Ensure the request is documented in the support ticketing system.
1.3. If the request is unclear, ask the customer to clarify before proceeding.

### 2. Verify the Customer’s Identity
2.1. Confirm that the request came from the email address associated with the Cedar & Sage account.
2.2. Verify any required account details according to Cedar & Sage’s identity-verification process.
2.3. If identity cannot be verified, do **not** proceed with deletion.
2.4. Inform the customer that Cedar & Sage must verify their identity before processing the request.

### 3. Confirm the Deletion Request in Writing
3.1. Send the customer a written confirmation message stating that account deletio

Lowest score minimax-m2.7-medium Explanation, not steps

0/100

“The model completely failed the prompt. Instead of writing the requested explanation document, it output a stream-of-consciousness planning script that quickly devolved into a severe, infinite generation loop repeating the same sentence structure hundreds of times. It scores a 0 across all traits due to this catastrophic failure.”

The user says: "A user need: 'I don't need steps — I want to understand WHY Northwind holds an invoice from payment when a duplicate is flagged, and what the trade-offs are.' Write the RIGHT doc type (an explanation), and do not turn it into a step-by-step how-to."

Thus they want an explanation document about the Northwind database, specifically why the system holds an invoice from payment when a duplicate is flagged, and what the trade-offs are. They want the explanation, not a step-by-step. So we need to write an explanation that covers the rationale, underlying architecture, logic, business rules, implications, trade-offs, pros and cons, and perhaps some context about duplicate detection in the Northwind dataset. Also consider the context: "Northwind holds an invoice from payment when a duplicate is flagged" maybe referencing the Northwind database example used in Microsoft Access or SQL Server training. Possibly a scenario: In the Northwind sample database, the Invoices table includes a duplicate detection mechanism that holds an invoice from payment when a duplicate is flagged. The user wants to understand why that happens and the trade-offs.

Thus we need to explain the concept: Invoices are held because they are flagged as duplicates, perhaps to avoid double payment, maintain data integrity, ensure audit trail, avoid overpayment, and allow manual review. Trade-offs: delays in payment processing, extra administrative overhead, potential for false positives, impact on cash flow, possible need for human intervention, complexity in workflow, etc. Also discuss alternatives: automatic payment vs. hold, etc.

We need to produce a high-quality explanation that addresses the "why" and the "trade-offs". No step-by-step instructions. Just an explanation. Possibly include s

Where models still fail

The most common problems we flagged across all models.

246invented behavior 240fabricated ui 31fabricated api or param 9feature focused 6wall of text 6no destructive warning 3multiple actions per step 2fabrication

Frequently asked

What is the best AI model for knowledge & docs?

In our benchmarks, gpt-5.4-max ranks first for knowledge & docs, scoring excellent, across 8 test cases.

What is the cheapest good model for knowledge & docs?

gpt-5.4-mini-medium is the best value: it clears our quality bar for knowledge & docs at $0.015 per run.

Which model is fastest for knowledge & docs?

gemini-3.1-flash-lite is the fastest model that still performs well for knowledge & docs.

How we test

Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.

Judge: gemini-3.1-pro-preview · 768 model runs across 4 benchmarks · last tested 2026-06-30

This page is Spring Prompt, running

We just did this for every model. Do it for your prompt.

The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.

Generate test cases from your prompt — no eval set required to start.
Compare models side by side with quality, cost and latency in one matrix.
Optimise the winner until the scores say it's ready to ship.

Join the waitlist Browse all benchmarks

Experiment · Cold outreach email

Prompt × model results

12 test cases · 3 evals

Claude Opus

GPT-5

Gemini

7.1

6.8

7.4

8.3

7.9

8.0

9.2 ★

8.6

8.4

Best combo: v3 × Claude Opus

9.2 quality · $0.004/run · 1.8s