Business · 8 tasks · 44 models
Smartest AI models for Research & Competitive Analysis
Which models research and analyse without fabricating sources, inventing competitor facts, or hand-waving a market size?
The highest-quality model for Research & Competitive Analysis is claude-opus-4.8-low (excellent).
Top score — excellent
Clears the quality bar at $0.016/run
~19s per run, still strong
Quality vs. cost
Every model placed by what it delivers and what it costs. The best value sits high and to the left.
Full ranking
| # | Model | Score | Cost/run | Speed | Best for | AA Intelligence |
|---|---|---|---|---|---|---|
| 1 | claude-opus-4.8-low | 99.8 Excellent | $0.0454 | 26.4s | Best overall | 55.7 |
| 2 | claude-opus-4.8-high | 99.2 Excellent | $0.0536 | 31.3s | Best overall | 55.7 |
| 3 | gpt-5.5-low | 98.0 Excellent | $0.0668 | 45.0s | Best overall | 54.8 |
| 4 | gpt-5.4-low | 93.2 Excellent | $0.0398 | 28.0s | Best overall | 51.4 |
| 5 | kimi-k2.7-code | 91.0 Excellent | $0.0268 | 106.0s | Best overall | 41.9 |
| 6 | gpt-5.4-mini | 90.6 Excellent | $0.0262 | 23.1s | Best overall | 40 |
| 7 | gpt-5.4 | 90.5 Excellent | $0.0422 | 30.4s | Best overall | 51.4 |
| 8 | claude-opus-4.6 | 87.8 Strong | $0.0657 | 61.9s | Best overall | — |
| 9 | gpt-5.5-high | 87.0 Strong | $0.0974 | 62.6s | Best overall | 54.8 |
| 10 | gpt-5.5 | 84.2 Strong | $0.0626 | 41.5s | Strong drafts | 54.8 |
| 11 | claude-opus-4.5 | 82.6 Strong | $0.0575 | 45.1s | Strong drafts | — |
| 12 | claude-opus-4.6-low | 81.5 Strong | $0.0781 | 70.7s | Strong drafts | — |
| 13 | claude-sonnet-4.6-low | 80.5 Strong | $0.0555 | 60.8s | Strong drafts | 47.2 |
| 14 | claude-sonnet-4.6-high | 80.2 Strong | $0.0592 | 67.5s | Strong drafts | 47.2 |
| 15 | deepseek-v3.1-terminus | 80.1 Strong | $0.0231 | 33.5s | Strong drafts | — |
| 16 | claude-opus-4.5-low | 77.8 Usable | $0.0630 | 47.7s | Strong drafts | — |
| 17 | glm-5 | 75.1 Usable | $0.0272 | 95.0s | Strong drafts | — |
| 18 | claude-sonnet-4.5-low | 74.4 Usable | $0.0381 | 41.0s | Needs review | — |
| 19 | claude-opus-4.6-high | 73.8 Usable | $0.0807 | 74.0s | Needs review | — |
| 20 | gemini-3.1-pro-preview-low | 73.6 Usable | $0.0337 | 26.6s | Needs review | 46.5 |
| 21 | deepseek-v3.2-low | 73.1 Usable | $0.0161 | 1029.5s | Needs review | — |
| 22 | gemini-3.1-pro-preview-high | 73.1 Usable | $0.0410 | 36.4s | Needs review | 46.5 |
| 23 | gemini-3.5-flash-low | 73.0 Usable | $0.0389 | 25.9s | Needs review | 50.2 |
| 24 | grok-4.20 | 72.6 Usable | $0.0221 | 18.6s | Needs review | — |
| 25 | claude-opus-4.5-high | 72.6 Usable | $0.0766 | 61.1s | Needs review | — |
| 26 | kimi-k2.5 | 72.1 Usable | $0.0310 | 113.1s | Needs review | — |
| 27 | gemini-3.1-pro-preview | 71.2 Usable | $0.0430 | 34.8s | Needs review | 46.5 |
| 28 | claude-sonnet-4.5 | 71.1 Usable | $0.0398 | 41.4s | Needs review | — |
| 29 | qwen3.7-max-low | 71.0 Usable | $0.0305 | 70.8s | Needs review | 46 |
| 30 | deepseek-v3.2 | 70.1 Usable | $0.0240 | 43.0s | Needs review | — |
| 31 | gemini-3.5-flash-high | 69.5 Needs editing | $0.0395 | 27.2s | Needs review | 50.2 |
| 32 | qwen3.7-max-high | 69.4 Needs editing | $0.0270 | 70.6s | Needs review | 46 |
| 33 | qwen3.5-plus-02-15 | 68.1 Needs editing | $0.0242 | 77.9s | Needs review | — |
| 34 | claude-sonnet-4.5-high | 66.2 Needs editing | $0.0413 | 45.6s | Needs review | — |
| 35 | deepseek-v3.2-high | 65.6 Needs editing | $0.0165 | 46.5s | Needs review | — |
| 36 | gemini-3-flash-preview | 64.9 Needs editing | $0.0287 | 26.3s | Needs review | — |
| 37 | gpt-5-mini | 64.6 Needs editing | $0.0189 | 33.2s | Needs review | — |
| 38 | grok-4.20-beta | 64.6 Needs editing | $0.0306 | 24.6s | Needs review | — |
| 39 | qwen3.7-max | 64.5 Needs editing | $0.0352 | 89.9s | Needs review | 46 |
| 40 | minimax-m2.7 | 64.1 Needs editing | $0.0253 | 77.0s | Needs review | 38.1 |
| 41 | gpt-5.4-high | 62.5 Needs editing | $0.0699 | 47.2s | Needs review | 51.4 |
| 42 | claude-haiku-4.5 | 60.9 Needs editing | $0.0275 | 28.9s | Needs review | 29.6 |
| 43 | gemini-3.1-flash-lite | 59.0 Weak | $0.0186 | 14.4s | Needs review | — |
| 44 | mistral-medium-3.1 | 54.0 Weak | $0.0282 | 30.2s | Needs review | — |
“AA Intelligence” is a third-party benchmark shown for context — independent of our tests. Source: Artificial Analysis (artificialanalysis.ai) via OpenRouter (openrouter.ai/rankings). · Source: Design Arena (www.designarena.ai) via OpenRouter (openrouter.ai/rankings).
What separates the top models
Grounded Synthesis
hardTests synthesizing provided source excerpts with accurate attribution — flagging conflicts and abstaining on gaps instead of inventing.
Leader: claude-haiku-4.5
Competitive Teardown
hardTests a fair, structured competitor comparison that separates fact from inference and invents no competitor facts.
Leader: claude-opus-4.8-low
Market Sizing
hardTests a defensible bottom-up TAM/SAM/SOM estimate with explicit assumptions and stated uncertainty — not a hand-wavy top-down number.
Leader: claude-opus-4.8-high
SWOT & Strategy
mediumTests specific, evidence-grounded, actionable strategic analysis rather than generic boilerplate.
Leader: claude-sonnet-4.6-low
See the difference
Real model outputs and how our judge scored them.
Top score
claude-sonnet-4.6-low
Five Forces with evidence (Tradewinds)
100/100
“The model perfectly followed the instructions. It used only the provided facts to conduct a rigorous, specific, and actionable Five Forces analysis. It clearly separated the known facts from its own strategic inferences, invented no figures, and provided a highly useful synthesis.”
# Competitive Forces Facing Tradewinds ## 1. Threat of New Entrants — **Moderate to High** The pay-per-shift model has low capital requirements and no long-term contracts, which means the structural barrier to launching a rival platform is not especially high. Anyone with a developer and a payment processor can replicate the basic mechanic. However, the vetting process creates a genuine, if imitable, barrier. Building a pool of vetted workers takes time and operational investment — it is not something a new entrant can fake overnight. The 88% fill rate is the real moat here: that number only exists because there is already a critical mass of vetted workers available when a venue posts a shift. A new entrant starts at zero supply and therefore cannot promise anything close to that reliability, which matters enormously given how fragile venue trust is. One no-show can end a customer relationship, so venues will not experiment cheaply with an unknown platform. That reputational asymmetry gives Tradewinds a runway advantage, but it is not permanent — a well-funded entrant could buy their way into a worker supply relatively quickly. **Net position:** The vetting and fill rate create a meaningful head-start, but the low structural cost of entry keeps this threat alive. --- ## 2. Threat of Substitutes — **High** This is arguably the most immediate pressure Tradewinds faces. The cheaper, unvetted apps represent a direct substitute that competes on the most visible dimension of the transaction: price per shift. For a venue operator managing thin margins — typical in hospitality — the temptation to save money on the labour platform fee is real. The substitution risk is asymmetric. When a shift goes well on an unvetted app, the venue pockets the saving and feels validated.
Mid-pack
glm-5
Abstain on a gap (Northwind)
100/100
“The model perfectly followed the instructions. It abstained from inventing information for the gaps in the sources, correctly cited the provided texts, and proactively flagged a conflict between Source A and Source C regarding the total number of customers.”
on the provided sources, neither Northwind's annual recurring revenue nor its enterprise-vs-SMB customer split can be determined. * **Annual Recurring Revenue:** The sources do not provide a specific revenue figure. Source B explicitly states that the founder "won't share exact revenue." * **Enterprise-vs-SMB Customer Split:** The sources do not provide a breakdown of customers by business size. Source A notes there are "38 paying brokerage customers" (while Source C claims "over 100"), but neither source categorizes these as enterprise or SMB. Source B only mentions seeing "strong word of mouth among mid-market brokers."
Lowest score
gpt-5.5
Bottom-up TAM (Northwind)
0/100
“The model returned an empty response.”
Where models still fail
The most common problems we flagged across all models.
Frequently asked
What is the best AI model for research & competitive analysis?
In our benchmarks, claude-opus-4.8-low ranks first for research & competitive analysis, scoring excellent, across 8 test cases.
What is the cheapest good model for research & competitive analysis?
deepseek-v3.2-low is the best value: it clears our quality bar for research & competitive analysis at $0.016 per run.
Which model is fastest for research & competitive analysis?
grok-4.20 is the fastest model that still performs well for research & competitive analysis.
How we test
Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.
Judge: gemini-3.1-pro-preview · 352 model runs across 4 benchmarks · last tested 2026-06-30
This page is Spring Prompt, running
We just did this for every model. Do it for your prompt.
The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.
- Generate test cases from your prompt — no eval set required to start.
- Compare models side by side with quality, cost and latency in one matrix.
- Optimise the winner until the scores say it's ready to ship.
Prompt × model results
12 test cases · 3 evals