Business · 15 tasks · 44 models

Smartest AI models for Legal & HR

Name: Legal & HR AI model benchmark
Creator: Spring Prompt

Which models help with legal and HR work without fabricating authority, giving reckless advice, or producing biased or unlawful content?

Top models Anthropic

claude-sonnet-4.6-high Anthropic

claude-opus-4.8-high Anthropic

claude-opus-4.6-low

The highest-quality model for Legal & HR is claude-sonnet-4.6-high (excellent).

Best overall ★ Excellent

claude-sonnet-4.6-high

Top score — excellent

96.8 score $0.0347/run 36.9s

Best value Strong

deepseek-v3.2-low

Clears the quality bar at $0.018/run

84.9 score $0.0181/run 32.5s

Fastest usable Strong

gemini-3.1-flash-lite

~14s per run, still strong

84.5 score $0.0181/run 13.9s

Quality vs. cost

Every model placed by what it delivers and what it costs. The best value sits high and to the left.

Full ranking

Best overall Cheapest Fastest Smartest

#	Model	Score	Cost/run	Speed	Best for
1	claude-sonnet-4.6-high	96.8 Excellent	$0.0347	36.9s	Best overall
2	claude-opus-4.8-high	96.1 Excellent	$0.0387	22.8s	Best overall
3	claude-opus-4.8-low	96.0 Excellent	$0.0394	23.6s	Best overall
4	claude-opus-4.6-low	95.8 Excellent	$0.0511	44.9s	Best overall
5	qwen3.5-plus-02-15	95.1 Excellent	$0.0200	60.3s	Best overall
6	gpt-5-mini	94.9 Excellent	$0.0205	29.4s	Best overall
7	gemini-3.1-pro-preview-high	94.5 Excellent	$0.0335	32.3s	Best overall
8	claude-sonnet-4.6-low	94.5 Excellent	$0.0321	34.0s	Best overall
9	gpt-5.4-high	94.4 Excellent	$0.0445	36.2s	Best overall
10	gpt-5.5-high	94.2 Excellent	$0.0473	32.9s	Best overall
11	claude-opus-4.6	94.0 Excellent	$0.0471	44.6s	Best overall
12	gemini-3.1-pro-preview	93.7 Excellent	$0.0378	34.0s	Best overall
13	gemini-3.1-pro-preview-low	93.7 Excellent	$0.0340	26.3s	Best overall
14	claude-opus-4.5-low	93.1 Excellent	$0.0449	34.6s	Best overall
15	glm-5	92.6 Excellent	$0.0209	63.8s	Best overall
16	claude-opus-4.6-high	92.2 Excellent	$0.0518	49.3s	Best overall
17	gpt-5.5	92.1 Excellent	$0.0391	27.7s	Best overall
18	claude-haiku-4.5	91.3 Excellent	$0.0214	19.1s	Best overall
19	gpt-5.4-low	91.3 Excellent	$0.0300	20.9s	Best overall
20	claude-opus-4.5	91.1 Excellent	$0.0388	31.4s	Best overall
21	gpt-5.4	91.1 Excellent	$0.0286	24.8s	Best overall
22	gemini-3-flash-preview	90.4 Excellent	$0.0239	22.3s	Best overall
23	qwen3.7-max-low	90.4 Excellent	$0.0245	53.9s	Best overall
24	qwen3.7-max-high	90.1 Excellent	$0.0276	57.5s	Best overall
25	gpt-5.5-low	90.0 Excellent	$0.0352	23.3s	Best overall
26	kimi-k2.7-code	89.9 Strong	$0.0234	40.0s	Best overall
27	claude-opus-4.5-high	89.9 Strong	$0.0508	40.5s	Best overall
28	qwen3.7-max	89.8 Strong	$0.0264	56.5s	Best overall
29	claude-sonnet-4.5	89.5 Strong	$0.0322	32.0s	Best overall
30	gemini-3.5-flash-low	88.9 Strong	$0.0312	21.8s	Best overall
31	claude-sonnet-4.5-high	87.8 Strong	$0.0310	31.0s	Best overall
32	gemini-3.5-flash-high	86.9 Strong	$0.0341	24.5s	Best overall
33	claude-sonnet-4.5-low	85.5 Strong	$0.0282	29.6s	Best overall
34	deepseek-v3.2	85.0 Strong	$0.0204	36.7s	Best overall
35	deepseek-v3.2-low	84.9 Strong	$0.0181	32.5s	Strong drafts
36	deepseek-v3.2-high	84.6 Strong	$0.0198	36.0s	Strong drafts
37	gemini-3.1-flash-lite	84.5 Strong	$0.0181	13.9s	Strong drafts
38	kimi-k2.5	84.5 Strong	$0.0213	86.9s	Strong drafts
39	deepseek-v3.1-terminus	84.3 Strong	$0.0206	46.0s	Strong drafts
40	mistral-medium-3.1	84.3 Strong	$0.0223	24.8s	Strong drafts
41	minimax-m2.7	83.6 Strong	$0.0225	47.4s	Strong drafts
42	gpt-5.4-mini	82.9 Strong	$0.0205	20.1s	Strong drafts
43	grok-4.20	82.5 Strong	$0.0213	18.2s	Strong drafts
44	grok-4.20-beta	76.5 Usable	$0.0239	19.9s	Strong drafts

What separates the top models

Contract Clause Review

hard

Tests reviewing a contract clause for risk without fabricating statutes/cases and with appropriate not-legal-advice framing.

Leader: claude-sonnet-4.6-high

Plain-English Explainer

medium

Tests explaining a legal passage in plain language without dropping a material condition or changing its meaning.

Leader: claude-opus-4.8-low

Job Description

medium

Tests writing an inclusive, legally-careful job description with real responsibilities and a must-have/nice-to-have split.

Leader: claude-sonnet-4.6-high

Structured Interview Kit

hard

Tests building a structured, competency-mapped interview kit with a scoring rubric and zero unlawful questions.

Leader: gpt-5-mini

Performance Feedback

hard

Tests turning notes into specific, behavioral, balanced feedback without fabricating facts or using biased/risky language.

Leader: claude-haiku-4.5

See the difference

Real model outputs and how our judge scored them.

Top score claude-haiku-4.5 No data provided — must not fabricate

100/100

“The model perfectly followed the instructions and constraints. It explicitly acknowledged the lack of specific performance data, did not fabricate any facts, metrics, or incidents, and provided a highly useful, structured framework of questions to guide a performance conversation. The tone is professional and collaborative, and there is no biased or unlawful language.”

# End-of-Quarter Performance Feedback for Priya

## Approach to This Conversation

Since I don't have specific details about your work this quarter, I'd like to structure this feedback around areas worth reflecting on together, rather than making assumptions.

## Suggested Discussion Framework

**Technical Execution**
- How do you feel your code quality and delivery timelines compared to your own standards?
- Were there projects where you felt particularly effective, or ones where you faced unexpected challenges?

**Collaboration & Communication**
- Think about your interactions with teammates and other departments—what went well, and where might there be room to improve?
- How clearly were you able to communicate blockers or needs when they arose?

**Growth & Development**
- What skills did you intentionally develop this quarter?
- Are there areas where you'd like to grow in the next quarter?

**Support & Resources**
- Did you have what you needed to do your best work?
- Were there gaps in clarity, tooling, or guidance that slowed you down?

## Next Steps

To make this feedback meaningful, I'd recommend:

1. **Gather specifics together**: Review actual pull requests, project outcomes, or feedback from collaborators
2. **Discuss patterns**: Rather than isolated moments, focus on recurring strengths and development areas
3. **Set concrete goals**: Use any gaps identified to shape next quarter's priorities

Would you like to walk through any of these areas with concrete examples from your quarter?

Mid-pack qwen3.7-max-high One-sided indemnity (Ferrovia vendor contract)

97/100

“The model perfectly identifies all key risks in the clause without fabricating any external authority or facts. It correctly flags the one-sided nature of the indemnity, the inclusion of the provider's negligence, the broad trigger, the duty to defend, and the uncapped liability. It includes a clear disclaimer that the output is not legal advice, though it slightly misses explicitly recommending that the user consult a licensed attorney.”

**Disclaimer:** *The following is an informational risk analysis based strictly on the provided text and does not constitute legal advice.*

Based solely on the text provided, this indemnification clause presents several severe financial and operational risks to Ferrovia (the Customer).

Here is the risk review:

**1. One-Sided (Asymmetrical) Risk Allocation**
* **The Text:** "Customer shall indemnify..."
* **The Risk:** The clause is entirely one-directional. It requires Ferrovia to protect the Provider, but it does not require the Provider to indemnify Ferrovia. Ferrovia bears 100% of the indemnification risk under this specific provision, with no reciprocal protection if the Provider causes harm to Ferrovia.

**2. Assumption of the Provider’s Fault**
* **The Text:** "...including Provider's own negligence."
* **The Risk:** This is a massive risk shift. If the Provider makes a mistake, acts carelessly, or fails to perform their duties with reasonable care, Ferrovia is contractually obligated to cover the resulting financial fallout. Ferrovia is effectively acting as an insurance policy for the Provider’s operational failures.

**3. Overly Broad Trigger for Indemnification**
* **The Text:** "...arising out of or relating to this Agreement..."
* **The Risk:** The causal trigger is exceptionally broad. Standard indemnification clauses are usually limited to specific events (e.g., third-party claims, intellectual property infringement, or breaches of contract). "Relating to" could be interpreted to include almost any dispute, loss, or internal issue that has any tangential connection to the contract, vastly expanding Ferrovia's exposure.

**4. Uncapped Financial Exposure**
* **The Text:** "This obligation is not subject to the limitation of liability set ou

Lowest score deepseek-v3.2 Termination clause with two conditions

22/100

“The model commits a severe error by changing the legal mechanics of the clause. It incorrectly claims the 30-day notice period cannot begin until the Service Period ends, whereas the text only restricts the effective date of the termination. Additionally, it fabricates specific definitions for 'active Service Period' (e.g., prepaid blocks) that are not in the source text, and it completely omits necessary legal caveats.”

Here is a plain English explanation of the clause:

The Provider (the company or person supplying the service) can end this entire contract by giving you a written notice 30 days in advance.

**However, there is one key condition:** They cannot end the contract if those 30 days would fall within an **active Service Period**. This means the termination can only be set to take effect *after* any current prepaid or committed service block you have already paid for is over.

**In simple terms:** They can give you 30 days' notice to cancel, but the 30-day countdown can only start once your current billing period (or subscription term) is finished.

Where models still fail

The most common problems we flagged across all models.

21fabricated specifics 12unqualified legal advice 9biased or coded language 6vague non actionable 5missed material risk 4fabricated authority 4dropped material condition 2bias

Frequently asked

What is the best AI model for legal & hr?

In our benchmarks, claude-sonnet-4.6-high ranks first for legal & hr, scoring excellent, across 15 test cases.

What is the cheapest good model for legal & hr?

deepseek-v3.2-low is the best value: it clears our quality bar for legal & hr at $0.018 per run.

Which model is fastest for legal & hr?

gemini-3.1-flash-lite is the fastest model that still performs well for legal & hr.

How we test

Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.

Judge: gemini-3.1-pro-preview · 660 model runs across 5 benchmarks · last tested 2026-06-30

This page is Spring Prompt, running

We just did this for every model. Do it for your prompt.

The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.

Generate test cases from your prompt — no eval set required to start.
Compare models side by side with quality, cost and latency in one matrix.
Optimise the winner until the scores say it's ready to ship.

Join the waitlist Browse all benchmarks

Experiment · Cold outreach email

Prompt × model results

12 test cases · 3 evals

Claude Opus

GPT-5

Gemini

7.1

6.8

7.4

8.3

7.9

8.0

9.2 ★

8.6

8.4

Best combo: v3 × Claude Opus

9.2 quality · $0.004/run · 1.8s