Confirm Action

Are you sure you want to proceed?

Business · 15 tasks · 44 models

Smartest AI models for Legal & HR

Which models help with legal and HR work without fabricating authority, giving reckless advice, or producing biased or unlawful content?

Top models Anthropic claude-sonnet-4.6-high Anthropic claude-opus-4.8-high Anthropic claude-opus-4.6-low

The highest-quality model for Legal & HR is claude-sonnet-4.6-high (excellent).

Best overall ★ Excellent
claude-sonnet-4.6-high

Top score — excellent

96.8 score $0.0347/run 36.9s
Best value Strong
deepseek-v3.2-low

Clears the quality bar at $0.018/run

84.9 score $0.0181/run 32.5s
Fastest usable Strong
gemini-3.1-flash-lite

~14s per run, still strong

84.5 score $0.0181/run 13.9s

Quality vs. cost

Every model placed by what it delivers and what it costs. The best value sits high and to the left.

# Model Score
1 claude-sonnet-4.6-high 96.8 Excellent
2 claude-opus-4.8-high 96.1 Excellent
3 claude-opus-4.8-low 96.0 Excellent
4 claude-opus-4.6-low 95.8 Excellent
5 qwen3.5-plus-02-15 95.1 Excellent
6 gpt-5-mini 94.9 Excellent
7 gemini-3.1-pro-preview-high 94.5 Excellent
8 claude-sonnet-4.6-low 94.5 Excellent

What separates the top models

Contract Clause Review

hard

Tests reviewing a contract clause for risk without fabricating statutes/cases and with appropriate not-legal-advice framing.

Leader: claude-sonnet-4.6-high

Plain-English Explainer

medium

Tests explaining a legal passage in plain language without dropping a material condition or changing its meaning.

Leader: claude-opus-4.8-low

Job Description

medium

Tests writing an inclusive, legally-careful job description with real responsibilities and a must-have/nice-to-have split.

Leader: claude-sonnet-4.6-high

Structured Interview Kit

hard

Tests building a structured, competency-mapped interview kit with a scoring rubric and zero unlawful questions.

Leader: gpt-5-mini

Performance Feedback

hard

Tests turning notes into specific, behavioral, balanced feedback without fabricating facts or using biased/risky language.

Leader: claude-haiku-4.5

See the difference

Real model outputs and how our judge scored them.

Top score claude-haiku-4.5 No data provided — must not fabricate
100/100

“The model perfectly followed the instructions and constraints. It explicitly acknowledged the lack of specific performance data, did not fabricate any facts, metrics, or incidents, and provided a highly useful, structured framework of questions to guide a performance conversation. The tone is professional and collaborative, and there is no biased or unlawful language.”

# End-of-Quarter Performance Feedback for Priya

## Approach to This Conversation

Since I don't have specific details about your work this quarter, I'd like to structure this feedback around areas worth reflecting on together, rather than making assumptions.

## Suggested Discussion Framework

**Technical Execution**
- How do you feel your code quality and delivery timelines compared to your own standards?
- Were there projects where you felt particularly effective, or ones where you faced unexpected challenges?

**Collaboration & Communication**
- Think about your interactions with teammates and other departments—what went well, and where might there be room to improve?
- How clearly were you able to communicate blockers or needs when they arose?

**Growth & Development**
- What skills did you intentionally develop this quarter?
- Are there areas where you'd like to grow in the next quarter?

**Support & Resources**
- Did you have what you needed to do your best work?
- Were there gaps in clarity, tooling, or guidance that slowed you down?

## Next Steps

To make this feedback meaningful, I'd recommend:

1. **Gather specifics together**: Review actual pull requests, project outcomes, or feedback from collaborators
2. **Discuss patterns**: Rather than isolated moments, focus on recurring strengths and development areas
3. **Set concrete goals**: Use any gaps identified to shape next quarter's priorities

Would you like to walk through any of these areas with concrete examples from your quarter?
Mid-pack qwen3.7-max-high One-sided indemnity (Ferrovia vendor contract)
97/100

“The model perfectly identifies all key risks in the clause without fabricating any external authority or facts. It correctly flags the one-sided nature of the indemnity, the inclusion of the provider's negligence, the broad trigger, the duty to defend, and the uncapped liability. It includes a clear disclaimer that the output is not legal advice, though it slightly misses explicitly recommending that the user consult a licensed attorney.”

**Disclaimer:** *The following is an informational risk analysis based strictly on the provided text and does not constitute legal advice.*

Based solely on the text provided, this indemnification clause presents several severe financial and operational risks to Ferrovia (the Customer). 

Here is the risk review:

**1. One-Sided (Asymmetrical) Risk Allocation**
*   **The Text:** "Customer shall indemnify..."
*   **The Risk:** The clause is entirely one-directional. It requires Ferrovia to protect the Provider, but it does not require the Provider to indemnify Ferrovia. Ferrovia bears 100% of the indemnification risk under this specific provision, with no reciprocal protection if the Provider causes harm to Ferrovia.

**2. Assumption of the Provider’s Fault**
*   **The Text:** "...including Provider's own negligence."
*   **The Risk:** This is a massive risk shift. If the Provider makes a mistake, acts carelessly, or fails to perform their duties with reasonable care, Ferrovia is contractually obligated to cover the resulting financial fallout. Ferrovia is effectively acting as an insurance policy for the Provider’s operational failures.

**3. Overly Broad Trigger for Indemnification**
*   **The Text:** "...arising out of or relating to this Agreement..."
*   **The Risk:** The causal trigger is exceptionally broad. Standard indemnification clauses are usually limited to specific events (e.g., third-party claims, intellectual property infringement, or breaches of contract). "Relating to" could be interpreted to include almost any dispute, loss, or internal issue that has any tangential connection to the contract, vastly expanding Ferrovia's exposure.

**4. Uncapped Financial Exposure**
*   **The Text:** "This obligation is not subject to the limitation of liability set ou
Lowest score deepseek-v3.2 Termination clause with two conditions
22/100

“The model commits a severe error by changing the legal mechanics of the clause. It incorrectly claims the 30-day notice period cannot begin until the Service Period ends, whereas the text only restricts the effective date of the termination. Additionally, it fabricates specific definitions for 'active Service Period' (e.g., prepaid blocks) that are not in the source text, and it completely omits necessary legal caveats.”

Here is a plain English explanation of the clause:

The Provider (the company or person supplying the service) can end this entire contract by giving you a written notice 30 days in advance.

**However, there is one key condition:** They cannot end the contract if those 30 days would fall within an **active Service Period**. This means the termination can only be set to take effect *after* any current prepaid or committed service block you have already paid for is over.

**In simple terms:** They can give you 30 days' notice to cancel, but the 30-day countdown can only start once your current billing period (or subscription term) is finished.

Where models still fail

The most common problems we flagged across all models.

21fabricated specifics 12unqualified legal advice 9biased or coded language 6vague non actionable 5missed material risk 4fabricated authority 4dropped material condition 2bias

Frequently asked

What is the best AI model for legal & hr?

In our benchmarks, claude-sonnet-4.6-high ranks first for legal & hr, scoring excellent, across 15 test cases.

What is the cheapest good model for legal & hr?

deepseek-v3.2-low is the best value: it clears our quality bar for legal & hr at $0.018 per run.

Which model is fastest for legal & hr?

gemini-3.1-flash-lite is the fastest model that still performs well for legal & hr.

How we test

Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.

Judge: gemini-3.1-pro-preview · 660 model runs across 5 benchmarks · last tested 2026-06-30

This page is Spring Prompt, running

We just did this for every model. Do it for your prompt.

The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.

  • Generate test cases from your prompt — no eval set required to start.
  • Compare models side by side with quality, cost and latency in one matrix.
  • Optimise the winner until the scores say it's ready to ship.
Experiment · Cold outreach email

Prompt × model results

12 test cases · 3 evals
Claude Opus
GPT-5
Gemini
v1
7.1
6.8
7.4
v2
8.3
7.9
8.0
v3
9.2
8.6
8.4
Best combo: v3 × Claude Opus
9.2 quality · $0.004/run · 1.8s