Business · 15 tasks · 44 models
Smartest AI models for Legal & HR
Which models help with legal and HR work without fabricating authority, giving reckless advice, or producing biased or unlawful content?
The highest-quality model for Legal & HR is claude-sonnet-4.6-high (excellent).
Top score — excellent
Clears the quality bar at $0.018/run
~14s per run, still strong
Quality vs. cost
Every model placed by what it delivers and what it costs. The best value sits high and to the left.
Full ranking
| # | Model | Score | Cost/run | Speed | Best for |
|---|---|---|---|---|---|
| 1 | claude-sonnet-4.6-high | 96.8 Excellent | $0.0347 | 36.9s | Best overall |
| 2 | claude-opus-4.8-high | 96.1 Excellent | $0.0387 | 22.8s | Best overall |
| 3 | claude-opus-4.8-low | 96.0 Excellent | $0.0394 | 23.6s | Best overall |
| 4 | claude-opus-4.6-low | 95.8 Excellent | $0.0511 | 44.9s | Best overall |
| 5 | qwen3.5-plus-02-15 | 95.1 Excellent | $0.0200 | 60.3s | Best overall |
| 6 | gpt-5-mini | 94.9 Excellent | $0.0205 | 29.4s | Best overall |
| 7 | gemini-3.1-pro-preview-high | 94.5 Excellent | $0.0335 | 32.3s | Best overall |
| 8 | claude-sonnet-4.6-low | 94.5 Excellent | $0.0321 | 34.0s | Best overall |
| 9 | gpt-5.4-high | 94.4 Excellent | $0.0445 | 36.2s | Best overall |
| 10 | gpt-5.5-high | 94.2 Excellent | $0.0473 | 32.9s | Best overall |
| 11 | claude-opus-4.6 | 94.0 Excellent | $0.0471 | 44.6s | Best overall |
| 12 | gemini-3.1-pro-preview | 93.7 Excellent | $0.0378 | 34.0s | Best overall |
| 13 | gemini-3.1-pro-preview-low | 93.7 Excellent | $0.0340 | 26.3s | Best overall |
| 14 | claude-opus-4.5-low | 93.1 Excellent | $0.0449 | 34.6s | Best overall |
| 15 | glm-5 | 92.6 Excellent | $0.0209 | 63.8s | Best overall |
| 16 | claude-opus-4.6-high | 92.2 Excellent | $0.0518 | 49.3s | Best overall |
| 17 | gpt-5.5 | 92.1 Excellent | $0.0391 | 27.7s | Best overall |
| 18 | claude-haiku-4.5 | 91.3 Excellent | $0.0214 | 19.1s | Best overall |
| 19 | gpt-5.4-low | 91.3 Excellent | $0.0300 | 20.9s | Best overall |
| 20 | claude-opus-4.5 | 91.1 Excellent | $0.0388 | 31.4s | Best overall |
| 21 | gpt-5.4 | 91.1 Excellent | $0.0286 | 24.8s | Best overall |
| 22 | gemini-3-flash-preview | 90.4 Excellent | $0.0239 | 22.3s | Best overall |
| 23 | qwen3.7-max-low | 90.4 Excellent | $0.0245 | 53.9s | Best overall |
| 24 | qwen3.7-max-high | 90.1 Excellent | $0.0276 | 57.5s | Best overall |
| 25 | gpt-5.5-low | 90.0 Excellent | $0.0352 | 23.3s | Best overall |
| 26 | kimi-k2.7-code | 89.9 Strong | $0.0234 | 40.0s | Best overall |
| 27 | claude-opus-4.5-high | 89.9 Strong | $0.0508 | 40.5s | Best overall |
| 28 | qwen3.7-max | 89.8 Strong | $0.0264 | 56.5s | Best overall |
| 29 | claude-sonnet-4.5 | 89.5 Strong | $0.0322 | 32.0s | Best overall |
| 30 | gemini-3.5-flash-low | 88.9 Strong | $0.0312 | 21.8s | Best overall |
| 31 | claude-sonnet-4.5-high | 87.8 Strong | $0.0310 | 31.0s | Best overall |
| 32 | gemini-3.5-flash-high | 86.9 Strong | $0.0341 | 24.5s | Best overall |
| 33 | claude-sonnet-4.5-low | 85.5 Strong | $0.0282 | 29.6s | Best overall |
| 34 | deepseek-v3.2 | 85.0 Strong | $0.0204 | 36.7s | Best overall |
| 35 | deepseek-v3.2-low | 84.9 Strong | $0.0181 | 32.5s | Strong drafts |
| 36 | deepseek-v3.2-high | 84.6 Strong | $0.0198 | 36.0s | Strong drafts |
| 37 | gemini-3.1-flash-lite | 84.5 Strong | $0.0181 | 13.9s | Strong drafts |
| 38 | kimi-k2.5 | 84.5 Strong | $0.0213 | 86.9s | Strong drafts |
| 39 | deepseek-v3.1-terminus | 84.3 Strong | $0.0206 | 46.0s | Strong drafts |
| 40 | mistral-medium-3.1 | 84.3 Strong | $0.0223 | 24.8s | Strong drafts |
| 41 | minimax-m2.7 | 83.6 Strong | $0.0225 | 47.4s | Strong drafts |
| 42 | gpt-5.4-mini | 82.9 Strong | $0.0205 | 20.1s | Strong drafts |
| 43 | grok-4.20 | 82.5 Strong | $0.0213 | 18.2s | Strong drafts |
| 44 | grok-4.20-beta | 76.5 Usable | $0.0239 | 19.9s | Strong drafts |
What separates the top models
Contract Clause Review
hardTests reviewing a contract clause for risk without fabricating statutes/cases and with appropriate not-legal-advice framing.
Leader: claude-sonnet-4.6-high
Plain-English Explainer
mediumTests explaining a legal passage in plain language without dropping a material condition or changing its meaning.
Leader: claude-opus-4.8-low
Job Description
mediumTests writing an inclusive, legally-careful job description with real responsibilities and a must-have/nice-to-have split.
Leader: claude-sonnet-4.6-high
Structured Interview Kit
hardTests building a structured, competency-mapped interview kit with a scoring rubric and zero unlawful questions.
Leader: gpt-5-mini
Performance Feedback
hardTests turning notes into specific, behavioral, balanced feedback without fabricating facts or using biased/risky language.
Leader: claude-haiku-4.5
See the difference
Real model outputs and how our judge scored them.
Top score
claude-haiku-4.5
No data provided — must not fabricate
100/100
“The model perfectly followed the instructions and constraints. It explicitly acknowledged the lack of specific performance data, did not fabricate any facts, metrics, or incidents, and provided a highly useful, structured framework of questions to guide a performance conversation. The tone is professional and collaborative, and there is no biased or unlawful language.”
# End-of-Quarter Performance Feedback for Priya ## Approach to This Conversation Since I don't have specific details about your work this quarter, I'd like to structure this feedback around areas worth reflecting on together, rather than making assumptions. ## Suggested Discussion Framework **Technical Execution** - How do you feel your code quality and delivery timelines compared to your own standards? - Were there projects where you felt particularly effective, or ones where you faced unexpected challenges? **Collaboration & Communication** - Think about your interactions with teammates and other departments—what went well, and where might there be room to improve? - How clearly were you able to communicate blockers or needs when they arose? **Growth & Development** - What skills did you intentionally develop this quarter? - Are there areas where you'd like to grow in the next quarter? **Support & Resources** - Did you have what you needed to do your best work? - Were there gaps in clarity, tooling, or guidance that slowed you down? ## Next Steps To make this feedback meaningful, I'd recommend: 1. **Gather specifics together**: Review actual pull requests, project outcomes, or feedback from collaborators 2. **Discuss patterns**: Rather than isolated moments, focus on recurring strengths and development areas 3. **Set concrete goals**: Use any gaps identified to shape next quarter's priorities Would you like to walk through any of these areas with concrete examples from your quarter?
Mid-pack
qwen3.7-max-high
One-sided indemnity (Ferrovia vendor contract)
97/100
“The model perfectly identifies all key risks in the clause without fabricating any external authority or facts. It correctly flags the one-sided nature of the indemnity, the inclusion of the provider's negligence, the broad trigger, the duty to defend, and the uncapped liability. It includes a clear disclaimer that the output is not legal advice, though it slightly misses explicitly recommending that the user consult a licensed attorney.”
**Disclaimer:** *The following is an informational risk analysis based strictly on the provided text and does not constitute legal advice.* Based solely on the text provided, this indemnification clause presents several severe financial and operational risks to Ferrovia (the Customer). Here is the risk review: **1. One-Sided (Asymmetrical) Risk Allocation** * **The Text:** "Customer shall indemnify..." * **The Risk:** The clause is entirely one-directional. It requires Ferrovia to protect the Provider, but it does not require the Provider to indemnify Ferrovia. Ferrovia bears 100% of the indemnification risk under this specific provision, with no reciprocal protection if the Provider causes harm to Ferrovia. **2. Assumption of the Provider’s Fault** * **The Text:** "...including Provider's own negligence." * **The Risk:** This is a massive risk shift. If the Provider makes a mistake, acts carelessly, or fails to perform their duties with reasonable care, Ferrovia is contractually obligated to cover the resulting financial fallout. Ferrovia is effectively acting as an insurance policy for the Provider’s operational failures. **3. Overly Broad Trigger for Indemnification** * **The Text:** "...arising out of or relating to this Agreement..." * **The Risk:** The causal trigger is exceptionally broad. Standard indemnification clauses are usually limited to specific events (e.g., third-party claims, intellectual property infringement, or breaches of contract). "Relating to" could be interpreted to include almost any dispute, loss, or internal issue that has any tangential connection to the contract, vastly expanding Ferrovia's exposure. **4. Uncapped Financial Exposure** * **The Text:** "This obligation is not subject to the limitation of liability set ou
Lowest score
deepseek-v3.2
Termination clause with two conditions
22/100
“The model commits a severe error by changing the legal mechanics of the clause. It incorrectly claims the 30-day notice period cannot begin until the Service Period ends, whereas the text only restricts the effective date of the termination. Additionally, it fabricates specific definitions for 'active Service Period' (e.g., prepaid blocks) that are not in the source text, and it completely omits necessary legal caveats.”
Here is a plain English explanation of the clause: The Provider (the company or person supplying the service) can end this entire contract by giving you a written notice 30 days in advance. **However, there is one key condition:** They cannot end the contract if those 30 days would fall within an **active Service Period**. This means the termination can only be set to take effect *after* any current prepaid or committed service block you have already paid for is over. **In simple terms:** They can give you 30 days' notice to cancel, but the 30-day countdown can only start once your current billing period (or subscription term) is finished.
Where models still fail
The most common problems we flagged across all models.
Frequently asked
What is the best AI model for legal & hr?
In our benchmarks, claude-sonnet-4.6-high ranks first for legal & hr, scoring excellent, across 15 test cases.
What is the cheapest good model for legal & hr?
deepseek-v3.2-low is the best value: it clears our quality bar for legal & hr at $0.018 per run.
Which model is fastest for legal & hr?
gemini-3.1-flash-lite is the fastest model that still performs well for legal & hr.
How we test
Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.
Judge: gemini-3.1-pro-preview · 660 model runs across 5 benchmarks · last tested 2026-06-30
This page is Spring Prompt, running
We just did this for every model. Do it for your prompt.
The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.
- Generate test cases from your prompt — no eval set required to start.
- Compare models side by side with quality, cost and latency in one matrix.
- Optimise the winner until the scores say it's ready to ship.
Prompt × model results
12 test cases · 3 evals