Business · 8 tasks · 94 models
Fastest AI models for Knowledge & Docs
Which models write documentation that is accurate to the real product — no invented buttons, menus, or API params — and safely sequenced?
The fastest capable model for Knowledge & Docs is gemini-3.1-flash-lite, at about 12.8s per run.
Top score — excellent
Clears the quality bar at $0.015/run
~13s per run, still strong
Quality vs. cost
Every model placed by what it delivers and what it costs. The best value sits high and to the left.
Full ranking
| # | Model | Score | Cost/run | Speed | Best for |
|---|---|---|---|---|---|
| 1 | gemini-3.1-flash-lite | 77.0 Usable | $0.0157 | 12.8s | Strong drafts |
| 2 | grok-4.20-beta | 71.5 Usable | $0.0166 | 14.1s | Needs review |
| 3 | gpt-5.4-mini | 97.8 Excellent | $0.0153 | 14.2s | Best overall |
| 4 | gpt-5.4-low | 93.0 Excellent | $0.0213 | 15.5s | Best overall |
| 5 | claude-haiku-4.5 | 73.1 Usable | $0.0184 | 16.0s | Needs review |
| 6 | gpt-5.5-low | 96.1 Excellent | $0.0292 | 17.4s | Best overall |
| 7 | claude-opus-4.8-high | 92.9 Excellent | $0.0297 | 18.1s | Best overall |
| 8 | gpt-5.4-high | 97.5 Excellent | $0.0236 | 18.8s | Best overall |
| 9 | gpt-5.5-high | 96.5 Excellent | $0.0297 | 18.8s | Best overall |
| 10 | gpt-5.4-mini-max | 95.6 Excellent | $0.0149 | 19.2s | Best overall |
| 11 | gpt-5.4-mini-medium | 95.5 Excellent | $0.0147 | 19.2s | Best overall |
| 12 | gpt-5.4 | 88.8 Strong | $0.0243 | 19.3s | Best overall |
| 13 | gemini-3-flash-preview | 72.6 Usable | $0.0231 | 19.8s | Needs review |
| 14 | gpt-5-mini | 80.9 Strong | $0.0179 | 20.8s | Strong drafts |
| 15 | gemini-3.5-flash-low | 80.1 Strong | $0.0269 | 20.8s | Strong drafts |
| 16 | claude-opus-4.8-low | 98.5 Excellent | $0.0341 | 20.9s | Best overall |
| 17 | grok-4.20-medium | 76.4 Usable | $0.0159 | 21.4s | Strong drafts |
| 18 | mistral-medium-3.1-max | 70.6 Usable | $0.0159 | 21.5s | Needs review |
| 19 | gemini-3.5-flash-high | 80.0 Strong | $0.0290 | 21.7s | Strong drafts |
| 20 | gpt-5.5 | 96.5 Excellent | $0.0299 | 21.9s | Best overall |
| 21 | deepseek-v3.2-low | 70.0 Usable | $0.0158 | 22.9s | Needs review |
| 22 | gemini-3.1-flash-lite-max | 83.6 Strong | $0.0184 | 22.9s | Strong drafts |
| 23 | gemini-3.1-flash-lite-medium | 75.9 Usable | $0.0177 | 23.4s | Strong drafts |
| 24 | gemini-3-flash-preview-max | 79.6 Usable | $0.0156 | 23.4s | Strong drafts |
| 25 | gemini-3.1-pro-preview-low | 78.0 Usable | $0.0296 | 23.7s | Strong drafts |
| 26 | claude-sonnet-4.5-low | 85.0 Strong | $0.0253 | 24.5s | Best overall |
| 27 | claude-opus-4.8-medium | 93.4 Excellent | $0.0304 | 24.8s | Best overall |
| 28 | claude-haiku-4.5-max | 76.4 Usable | $0.0190 | 25.0s | Strong drafts |
| 29 | mistral-medium-3.1-medium | 73.2 Usable | $0.0199 | 25.1s | Needs review |
| 30 | claude-sonnet-4.5 | 81.1 Strong | $0.0234 | 25.3s | Strong drafts |
| 31 | gemini-3-flash-preview-medium | 71.5 Usable | $0.0205 | 25.4s | Needs review |
| 32 | claude-sonnet-4.5-high | 79.2 Usable | $0.0268 | 25.7s | Strong drafts |
| 33 | gpt-5.4-medium | 96.1 Excellent | $0.0216 | 26.0s | Best overall |
| 34 | gpt-5.5-max | 96.1 Excellent | $0.0328 | 26.3s | Best overall |
| 35 | gpt-5.5-medium | 94.1 Excellent | $0.0319 | 26.5s | Best overall |
| 36 | gpt-5.4-max | 98.5 Excellent | $0.0249 | 27.2s | Best overall |
| 37 | gemini-3.1-pro-preview | 85.5 Strong | $0.0331 | 27.6s | Best overall |
| 38 | claude-sonnet-4.5-max | 84.8 Strong | $0.0225 | 28.2s | Strong drafts |
| 39 | kimi-k2.7-code-max | 85.8 Strong | $0.0179 | 28.5s | Best overall |
| 40 | gemini-3.1-pro-preview-high | 77.9 Usable | $0.0297 | 28.5s | Strong drafts |
| 41 | deepseek-v3.2-high | 76.2 Usable | $0.0158 | 28.9s | Strong drafts |
| 42 | gemini-3.5-flash-max | 72.4 Usable | $0.0297 | 29.1s | Needs review |
| 43 | kimi-k2.7-code-medium | 91.0 Excellent | $0.0187 | 29.3s | Best overall |
| 44 | gemini-3.5-flash-medium | 79.1 Usable | $0.0301 | 30.0s | Strong drafts |
| 45 | deepseek-v3.1-terminus-medium | 79.6 Usable | $0.0175 | 30.4s | Strong drafts |
| 46 | claude-sonnet-4.6-low | 76.4 Usable | $0.0317 | 30.4s | Strong drafts |
| 47 | deepseek-v3.2 | 78.2 Usable | $0.0196 | 31.0s | Strong drafts |
| 48 | gemini-3.1-pro-preview-medium | 86.1 Strong | $0.0256 | 32.5s | Best overall |
| 49 | claude-sonnet-4.6-high | 81.2 Strong | $0.0327 | 32.6s | Strong drafts |
| 50 | claude-sonnet-4.5-medium | 73.2 Usable | $0.0264 | 32.8s | Needs review |
| 51 | gemini-3.1-pro-preview-max | 87.0 Strong | $0.0285 | 34.9s | Best overall |
| 52 | claude-opus-4.5-high | 80.0 Strong | $0.0418 | 35.4s | Strong drafts |
| 53 | kimi-k2.7-code | 80.0 Strong | $0.0212 | 36.2s | Strong drafts |
| 54 | deepseek-v3.2-max | 78.1 Usable | $0.0183 | 36.3s | Strong drafts |
| 55 | claude-sonnet-4.6-medium | 82.1 Strong | $0.0308 | 37.0s | Strong drafts |
| 56 | claude-opus-4.6-max | 71.2 Usable | $0.0412 | 37.4s | Needs review |
| 57 | claude-opus-4.6-low | 70.6 Usable | $0.0464 | 38.0s | Needs review |
| 58 | claude-sonnet-4.6-max | 79.8 Usable | $0.0337 | 38.5s | Strong drafts |
| 59 | gpt-5-mini-medium | 71.9 Usable | $0.0191 | 38.9s | Needs review |
| 60 | claude-opus-4.6-medium | 77.5 Usable | $0.0355 | 39.3s | Strong drafts |
| 61 | qwen3.7-max-low | 81.8 Strong | $0.0229 | 41.4s | Strong drafts |
| 62 | kimi-k2.5 | 72.4 Usable | $0.0153 | 43.3s | Needs review |
| 63 | claude-opus-4.8-max | 84.0 Strong | $0.0651 | 43.5s | Strong drafts |
| 64 | qwen3.7-max | 79.1 Usable | $0.0250 | 47.1s | Strong drafts |
| 65 | qwen3.7-max-high | 78.1 Usable | $0.0262 | 47.4s | Strong drafts |
| 66 | qwen3.5-plus-02-15 | 76.5 Usable | $0.0185 | 48.7s | Strong drafts |
| 67 | qwen3.7-max-medium | 78.0 Usable | $0.0215 | 49.8s | Strong drafts |
| 68 | qwen3.7-max-max | 83.1 Strong | $0.0213 | 49.9s | Strong drafts |
| 69 | kimi-k2.5-medium | 78.2 Usable | $0.0200 | 54.7s | Strong drafts |
| 70 | glm-5-medium | 85.6 Strong | $0.0180 | 56.0s | Best overall |
| 71 | qwen3.5-plus-02-15-medium | 79.4 Usable | $0.0201 | 60.7s | Strong drafts |
| 72 | qwen3.5-plus-02-15-max | 79.0 Usable | $0.0208 | 61.7s | Strong drafts |
| 73 | glm-5-max | 73.5 Usable | $0.0200 | 63.5s | Needs review |
| 74 | glm-5 | 80.9 Strong | $0.0191 | 66.1s | Strong drafts |
| 75 | kimi-k2.5-max | 74.6 Usable | $0.0240 | 66.6s | Needs review |
| 76 | gpt-5-mini-max | 78.9 Usable | $0.0254 | 67.7s | Strong drafts |
| 77 | mistral-medium-3.1 | 66.9 Needs editing | $0.0202 | 17.5s | Needs review |
| 78 | grok-4.20 | 60.9 Needs editing | $0.0237 | 18.2s | Needs review |
| 79 | claude-haiku-4.5-medium | 69.5 Needs editing | $0.0214 | 24.8s | Needs review |
| 80 | grok-4.20-beta-max | 61.6 Needs editing | $0.0245 | 25.4s | Needs review |
| 81 | grok-4.20-beta-medium | 66.8 Needs editing | $0.0227 | 26.5s | Needs review |
| 82 | grok-4.20-max | 65.8 Needs editing | $0.0229 | 26.6s | Needs review |
| 83 | deepseek-v3.1-terminus | 63.8 Needs editing | $0.0210 | 26.9s | Needs review |
| 84 | claude-opus-4.5-low | 57.4 Weak | $0.0398 | 29.2s | Needs review |
| 85 | claude-opus-4.5 | 67.9 Needs editing | $0.0357 | 29.7s | Needs review |
| 86 | minimax-m2.7 | 52.1 Weak | $0.0196 | 30.9s | Needs review |
| 87 | deepseek-v3.2-medium | 64.1 Needs editing | $0.0206 | 35.1s | Needs review |
| 88 | claude-opus-4.6-high | 68.0 Needs editing | $0.0399 | 35.9s | Needs review |
| 89 | claude-opus-4.5-medium | 66.2 Needs editing | $0.0392 | 37.5s | Needs review |
| 90 | claude-opus-4.5-max | 66.9 Needs editing | $0.0424 | 37.8s | Needs review |
| 91 | deepseek-v3.1-terminus-max | 64.9 Needs editing | $0.0208 | 38.8s | Needs review |
| 92 | claude-opus-4.6 | 63.2 Needs editing | $0.0438 | 39.1s | Needs review |
| 93 | minimax-m2.7-medium | 38.9 Failed | $0.0259 | 61.0s | Needs review |
| 94 | minimax-m2.7-max | 51.4 Weak | $0.0263 | 77.5s | Needs review |
What separates the top models
Procedure / Runbook
mediumTests a numbered procedure with prerequisites, one action per step, and a warning before any destructive action.
Leader: claude-opus-4.8-medium
Help-Center Article
mediumTests a task-focused, scannable support article that answers the user's question and includes troubleshooting.
Leader: qwen3.7-max-max
Grounded Docs (No Fabrication)
hardTests writing docs from limited product facts without inventing UI, steps, or behavior — including flagging an unsupported request.
Leader: qwen3.7-max-max
Right Doc Type
mediumTests choosing and writing the correct Diátaxis doc type for a need without mixing types.
Leader: kimi-k2.7-code-max
See the difference
Real model outputs and how our judge scored them.
Top score
qwen3.5-plus-02-15-medium
Explanation, not steps
100/100
“The model perfectly executed the requested Diátaxis document type (Explanation). It focused entirely on the conceptual reasons and trade-offs for the system's behavior without slipping into a how-to or reference format. The language is clear, professional, and well-structured.”
# Conceptual Overview: Duplicate Invoice Payment Holds in Northwind
## Purpose of the Behavior
In the Northwind system, the automatic hold placed on an invoice when a duplicate is flagged is not an error state; it is a deliberate financial control mechanism. This behavior is designed to interrupt the payment workflow before funds are disbursed, requiring human validation to proceed.
## The Rationale: Why the System Holds Payment
The primary driver for this logic is **risk mitigation**. In accounts payable operations, the cost of recovering an overpayment significantly exceeds the cost of delaying a legitimate payment. The system enforces this hold for three core reasons:
1. **Prevention of Double Disbursement**
The most immediate risk is paying the same liability twice. This often occurs when a vendor resends an invoice due to perceived non-payment, or when internal teams upload the same document separately. The hold ensures that the general ledger is not impacted by duplicate expenses.
2. **Internal Control Compliance**
Many organizations must adhere to audit standards (such as SOX or ISO) that require segregation of duties and verification steps. An automated payment on a flagged duplicate bypasses verification. By forcing a hold, the system mandates a review event, creating an audit trail that demonstrates due diligence.
3. **Cash Flow Preservation**
Even if a duplicate payment is eventually recoverable, the temporary loss of cash impacts liquidity. The hold protects working capital by ensuring funds leave the account only once per unique liability.
## The Trade-Offs
Implementing strict duplicate detection involves balancing security against operational efficiency. Understanding these trade-offs helps in configuring tolerance levels and managing
Mid-pack
gpt-5.5-low
SOP: handle a customer data-deletion request (Cedar & Sage)
98/100
“The model perfectly follows the instructions, providing a clear, sequential SOP with prerequisites and roles up front. It correctly places a prominent warning before the irreversible deletion step and ensures identity verification happens first. It avoids fabricating any specific UI elements, keeping the system references appropriately generic based on the provided facts.”
# SOP: Processing Customer Account-Deletion Requests ## Purpose To ensure Cedar & Sage support agents process customer account-deletion requests securely, consistently, and with clear written confirmation before irreversible deletion. ## Owner / Role **Owner:** Customer Support Team **Responsible Role:** Cedar & Sage Support Agent **Escalation:** Support Lead or Privacy/Compliance contact, if identity cannot be verified or the request is unclear. ## Prerequisites Before processing an account-deletion request, the support agent must have: 1. An active customer support ticket or written request from the customer. 2. Access to Cedar & Sage’s customer support and account management systems. 3. The customer’s account information available for verification. 4. Authority to process account-deletion requests or approval from the appropriate internal owner. 5. A standard written confirmation template for account deletion. --- ## Procedure ### 1. Receive the Account-Deletion Request 1.1. Review the customer’s request to confirm they are asking to delete their Cedar & Sage account. 1.2. Ensure the request is documented in the support ticketing system. 1.3. If the request is unclear, ask the customer to clarify before proceeding. ### 2. Verify the Customer’s Identity 2.1. Confirm that the request came from the email address associated with the Cedar & Sage account. 2.2. Verify any required account details according to Cedar & Sage’s identity-verification process. 2.3. If identity cannot be verified, do **not** proceed with deletion. 2.4. Inform the customer that Cedar & Sage must verify their identity before processing the request. ### 3. Confirm the Deletion Request in Writing 3.1. Send the customer a written confirmation message stating that account deletio
Lowest score
minimax-m2.7-medium
Explanation, not steps
0/100
“The model completely failed the prompt. Instead of writing the requested explanation document, it output a stream-of-consciousness planning script that quickly devolved into a severe, infinite generation loop repeating the same sentence structure hundreds of times. It scores a 0 across all traits due to this catastrophic failure.”
The user says: "A user need: 'I don't need steps — I want to understand WHY Northwind holds an invoice from payment when a duplicate is flagged, and what the trade-offs are.' Write the RIGHT doc type (an explanation), and do not turn it into a step-by-step how-to." Thus they want an explanation document about the Northwind database, specifically why the system holds an invoice from payment when a duplicate is flagged, and what the trade-offs are. They want the explanation, not a step-by-step. So we need to write an explanation that covers the rationale, underlying architecture, logic, business rules, implications, trade-offs, pros and cons, and perhaps some context about duplicate detection in the Northwind dataset. Also consider the context: "Northwind holds an invoice from payment when a duplicate is flagged" maybe referencing the Northwind database example used in Microsoft Access or SQL Server training. Possibly a scenario: In the Northwind sample database, the Invoices table includes a duplicate detection mechanism that holds an invoice from payment when a duplicate is flagged. The user wants to understand why that happens and the trade-offs. Thus we need to explain the concept: Invoices are held because they are flagged as duplicates, perhaps to avoid double payment, maintain data integrity, ensure audit trail, avoid overpayment, and allow manual review. Trade-offs: delays in payment processing, extra administrative overhead, potential for false positives, impact on cash flow, possible need for human intervention, complexity in workflow, etc. Also discuss alternatives: automatic payment vs. hold, etc. We need to produce a high-quality explanation that addresses the "why" and the "trade-offs". No step-by-step instructions. Just an explanation. Possibly include s
Where models still fail
The most common problems we flagged across all models.
Frequently asked
What is the best AI model for knowledge & docs?
In our benchmarks, gpt-5.4-max ranks first for knowledge & docs, scoring excellent, across 8 test cases.
What is the cheapest good model for knowledge & docs?
gpt-5.4-mini-medium is the best value: it clears our quality bar for knowledge & docs at $0.015 per run.
Which model is fastest for knowledge & docs?
gemini-3.1-flash-lite is the fastest model that still performs well for knowledge & docs.
How we test
Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.
Judge: gemini-3.1-pro-preview · 768 model runs across 4 benchmarks · last tested 2026-06-30
This page is Spring Prompt, running
We just did this for every model. Do it for your prompt.
The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.
- Generate test cases from your prompt — no eval set required to start.
- Compare models side by side with quality, cost and latency in one matrix.
- Optimise the winner until the scores say it's ready to ship.
Prompt × model results
12 test cases · 3 evals