Confirm Action

Are you sure you want to proceed?

Back to evals

personal benchmark collection

Chef / Home Cooking

Benchmarks for testing whether models can help real people cook better meals with realistic constraints, substitutions, timings, and safety.

Which models can give practical cooking help that works in a real kitchen?

4 benchmarks 12 tests 276 completed runs 20 base models

At a glance

Top model

claude-opus-4.7

82.67

Lowest cost / eval

glm-5

$0.0122

Median rank score

74.83

Last refresh

2026-06-02

Score vs. cost

Average task cost vs overall score

Each dot is one model. X axis is average cost per benchmark task, including model and judge cost; Y axis is average calibrated score.

Overall ranking

Top models by average score

Higher is better. Scores come from completed judged runs.

Benchmark heatmap

Model performance by benchmark

Cells are colored by rank within each benchmark: the top ten are split across greens, anything below the top ten is red.

Below top 10 #1
Rank Model Overall Practical Recipe Test Dinner Rescue Test Meal Timing Test Substitution Test
1
claude-opus-4.7

12 scored tests

82.7 83.7 83.0 79.3 84.7
2
gpt-5.5

12 scored tests

82.0 83.7 83.0 80.7 80.7
3
claude-opus-4.8-high

12 scored tests

81.8 84.0 84.3 75.3 83.3
4
gpt-5.4

12 scored tests

79.5 82.3 82.3 71.3 82.0
5
claude-opus-4.6-high

12 scored tests

79.5 80.7 81.3 72.3 83.7
6
claude-opus-4.8-low

12 scored tests

79.2 82.7 83.3 68.0 83.0
7
claude-opus-4.8

12 scored tests

79.2 84.0 83.0 66.0 84.0
8
claude-opus-4.6

12 scored tests

78.2 82.0 84.0 65.0 82.0
9
grok-4.20-beta

12 scored tests

76.5 83.7 78.3 62.7 81.3
10
qwen3.7-max

12 scored tests

76.1 82.3 84.0 55.7 82.3
11
qwen3.5-plus-02-15

12 scored tests

75.7 82.7 80.7 57.7 81.7
12
glm-5.1

12 scored tests

74.8 79.3 84.7 52.0 83.3
13
gemini-3-flash-preview

12 scored tests

74.7 72.7 82.7 60.7 82.7
14
gpt-5.5-pro

12 scored tests

73.8 84.0 82.3 61.0 68.0
15
claude-sonnet-4.6

12 scored tests

73.5 82.3 82.0 45.0 84.7
16
deepseek-v3.2

12 scored tests

72.5 81.3 69.3 60.7 78.7
17
gemini-3.1-pro-preview

12 scored tests

72.1 83.3 83.0 40.0 82.0
18
gpt-5.4-mini

12 scored tests

72.0 74.3 80.0 53.3 80.3
19
kimi-k2.5

12 scored tests

67.2 83.7 61.3 40.7 83.3
20
gpt-5.4-nano

12 scored tests

63.6 68.3 68.7 38.7 78.7
21
gemini-3.5-flash-high

12 scored tests

60.5 63.0 82.7 20.0 76.3
22
glm-5

12 scored tests

51.0 58.0 48.7 24.3 73.0
23
minimax-m2.7

12 scored tests

49.4 60.7 48.7 16.7 71.7

Full leaderboard

Quality, cost, and speed

Model Score Tests Avg cost / task Avg seconds / task Frequent problems
claude-opus-4.7
82.67 Strong 12/12 $0.0456 31.6s -
gpt-5.5
82.0 Strong 12/12 $0.0667 51.0s -
claude-opus-4.8-high
81.75 Strong 12/12 $0.0450 28.6s -
gpt-5.4
79.5 Usable 12/12 $0.0308 29.2s -
claude-opus-4.6-high
79.5 Usable 12/12 $0.0522 45.7s Incomplete output Missing required element
claude-opus-4.8
79.25 Usable 12/12 $0.0438 27.7s -
claude-opus-4.8-low
79.25 Usable 12/12 $0.0452 28.6s Wrapper text Unsafe or misleading
claude-opus-4.6
78.25 Usable 12/12 $0.0518 44.8s Incomplete output Missing required element
grok-4.20-beta
76.5 Usable 12/12 $0.0170 19.4s Unsupported invention
qwen3.7-max
76.08 Usable 12/12 $0.0160 66.9s -
qwen3.5-plus-02-15
75.67 Usable 12/12 $0.0150 65.1s -
glm-5.1
74.83 Usable 12/12 $0.0153 60.2s Incomplete output Missing required element
gemini-3-flash-preview
74.67 Usable 12/12 $0.0193 20.1s Unsafe or misleading
gpt-5.5-pro
73.83 Usable 12/12 $0.6926 113.4s Incomplete output
claude-sonnet-4.6
73.5 Usable 12/12 $0.0335 37.1s Incomplete output Missing required element
deepseek-v3.2
72.5 Usable 12/12 $0.0160 40.8s -
gemini-3.1-pro-preview
72.08 Usable 12/12 $0.0385 29.2s Incomplete output Missing required element
gpt-5.4-mini
72.0 Usable 12/12 $0.0203 19.5s -
kimi-k2.5
67.25 Needs editing 12/12 $0.0145 88.9s Incomplete output Missing required element
gpt-5.4-nano
63.58 Needs editing 12/12 $0.0204 21.5s -
gemini-3.5-flash-high
60.5 Needs editing 12/12 $0.0310 20.1s Incomplete output Missing required element
glm-5
51.0 Weak 12/12 $0.0122 55.0s Incomplete output Missing required element Malformed output Unsupported invention
minimax-m2.7
49.42 Weak 12/12 $0.0140 47.6s Incomplete output Missing required element Unsupported invention

Test cases

Where the scores come from

Each row is one prompt, with score distributions, top and low performers, and the most frequent problems judges flagged.

Test Benchmark Avg Max Min Top model Lowest model Frequent problems

Serrano ham scrambled eggs

chef_recipe_001

Practical Recipe Test 80.3 86.0 52.0 gpt-5.5 · 86 gemini-3-flash-preview · 52 Unsafe or misleading ×1 Incomplete output ×1 Missing required element ×1

Low-carb/high-carb shared dinner

chef_recipe_002

Practical Recipe Test 74.4 84.0 21.0 claude-opus-4.7 · 84 minimax-m2.7 · 21 Incomplete output ×3 Unsupported invention ×1

Quick chicken thighs with limited ingredients

chef_recipe_003

Practical Recipe Test 80.5 84.0 48.0 kimi-k2.5 · 84 gpt-5.4-nano · 48 -

Watery tomato salsa

chef_rescue_001

Dinner Rescue Test 78.7 85.0 35.0 claude-opus-4.7 · 85 minimax-m2.7 · 35 Incomplete output ×2 Missing required element ×1 Unsupported invention ×1

Over-salted soup

chef_rescue_002

Dinner Rescue Test 75.7 85.0 40.0 claude-opus-4.8-high · 85 kimi-k2.5 · 40 Missing required element ×3 Incomplete output ×3

Dry chicken breast

chef_rescue_003

Dinner Rescue Test 78.0 85.0 20.0 glm-5.1 · 85 glm-5 · 20 Incomplete output ×2 Missing required element ×1

Steak, wings, potatoes, broccoli dinner

chef_timing_001

Meal Timing Test 54.5 82.0 9.0 claude-opus-4.8-high · 82 gemini-3.5-flash-high · 9 Incomplete output ×6 Missing required element ×2 Unsupported invention ×1

Thai dinner with satay, stir fry, rice

chef_timing_002

Meal Timing Test 50.5 81.0 10.0 gpt-5.5 · 81 minimax-m2.7 · 10 Incomplete output ×9 Missing required element ×5 Unsafe or misleading ×1

Roast chicken dinner with two dietary versions

chef_timing_003

Meal Timing Test 60.2 81.0 20.0 gpt-5.4 · 81 minimax-m2.7 · 20 Incomplete output ×10 Missing required element ×5 Malformed output ×1

Thai stir fry substitutions

chef_sub_001

Substitution Test 78.9 85.0 56.0 claude-sonnet-4.6 · 85 glm-5 · 56 Incomplete output ×2 Missing required element ×1

Pasta sauce without cream

chef_sub_002

Substitution Test 83.7 86.0 78.0 claude-opus-4.7 · 86 gpt-5.4-nano · 78 -

Curry without coconut milk

chef_sub_003

Substitution Test 78.9 85.0 38.0 claude-opus-4.8-high · 85 gpt-5.5-pro · 38 Incomplete output ×2

Model profiles

Strengths, weaknesses, and tradeoffs

claude-opus-4.7

12 scored tests · Strong

82.67

Highest traits

honesty8.5
timing clarity8.5
constraint handling8.5
adaptability8.47
flavour judgement8.43

Lowest traits

food quality7.67
safety7.75
timing accuracy7.9
practical sequencing8.03
quantities8.17

gpt-5.5

12 scored tests · Strong

82.0

Highest traits

constraint handling8.6
timing clarity8.4
practicality8.37
clarity8.27
instruction clarity8.24

Lowest traits

food quality7.83
timing accuracy7.87
quantities7.93
safety8.08
honesty8.13

claude-opus-4.8-high

12 scored tests · Strong

81.75

Highest traits

timing clarity8.5
constraint handling8.5
practicality8.47
flavour judgement8.4
adaptability8.38

Lowest traits

practical sequencing7.3
timing accuracy7.4
food quality7.43
safety7.87
clarity8.13

gpt-5.4

12 scored tests · Usable

79.5

Highest traits

constraint handling8.47
timing clarity8.33
practicality8.28
flavour judgement8.24
adaptability8.2

Lowest traits

food quality6.67
timing accuracy6.77
practical sequencing7.3
safety7.53
clarity8.0

claude-opus-4.6-high

12 scored tests · Usable

79.5

Highest traits

honesty8.6
constraint handling8.43
timing clarity8.4
flavour judgement8.33
instruction clarity8.26

Lowest traits

clarity6.83
timing accuracy7.17
practical sequencing7.23
food quality7.33
safety7.75

claude-opus-4.8-low

12 scored tests · Usable

79.25

Highest traits

timing clarity8.47
flavour judgement8.34
instruction clarity8.34
adaptability8.33
honesty8.3

Lowest traits

timing accuracy6.67
practical sequencing6.67
food quality7.17
safety7.2
quantities8.03

claude-opus-4.8

12 scored tests · Usable

79.25

Highest traits

timing clarity8.53
constraint handling8.5
honesty8.47
practicality8.4
instruction clarity8.34

Lowest traits

timing accuracy5.67
food quality6.27
practical sequencing6.87
safety7.62
clarity8.0

claude-opus-4.6

12 scored tests · Usable

78.25

Highest traits

timing clarity8.57
constraint handling8.43
honesty8.3
adaptability8.3
flavour judgement8.27

Lowest traits

timing accuracy6.1
clarity6.5
practical sequencing6.5
food quality6.93
safety7.83

grok-4.20-beta

12 scored tests · Usable

76.5

Highest traits

constraint handling8.5
timing clarity8.47
instruction clarity8.19
honesty8.13
quantities8.1

Lowest traits

food quality5.5
practical sequencing5.5
timing accuracy6.5
clarity7.33
safety7.67

qwen3.7-max

12 scored tests · Usable

76.08

Highest traits

flavour judgement8.36
constraint handling8.33
practicality8.32
timing clarity8.3
adaptability8.28

Lowest traits

timing accuracy4.67
practical sequencing5.0
food quality5.33
safety7.57
clarity7.67

qwen3.5-plus-02-15

12 scored tests · Usable

75.67

Highest traits

constraint handling8.47
timing clarity8.3
instruction clarity8.21
practicality8.18
honesty8.17

Lowest traits

food quality5.0
practical sequencing5.17
timing accuracy5.5
clarity7.17
safety7.67

glm-5.1

12 scored tests · Usable

74.83

Highest traits

adaptability8.38
constraint handling8.33
flavour judgement8.3
timing clarity8.2
honesty8.17

Lowest traits

clarity4.0
timing accuracy4.83
practical sequencing5.5
food quality5.83
safety7.13

Compare models

Trait profile and test-by-test matchup

Pick two models to compare average trait scores and individual test outcomes from the published calibrated results.

Per-test comparison

Sorted by the biggest absolute score difference between the selected models.

Test Benchmark Delta

Frequent problems

Where models break

Incomplete output 40
Missing required element 19
Unsupported invention 3
Unsafe or misleading 2
Wrapper text 1
Malformed output 1

Example outputs

Winner, middle, failure

winner

claude-opus-4.7 · 86

The model perfectly executes the substitution, providing expert-level techniques to prevent the yogurt from splitting, precise quantities for two, and highly insightful, realistic flavour tradeoffs.

# Creamy Lemon-Pepper Pasta with Peas (No Cream)

Greek yogurt stands in for cream, but it **splits (curdles) if it gets too hot or hits direct heat**. The trick: temper it, pull the pan off the burner, and use residual heat plus starchy pasta water to bind everything.

## Ingredients (serves 2)
- 180–200 g pasta (spaghetti, linguine, or fusilli)
- 150 g full-fat Greek yogurt (≥10% fat — low-fat splits more easily)
- 40 g parmesan, finely grated, plus extra to serve
- 20 g butter
- 1 tbsp olive oil
- 2 garlic cloves, thinly sliced or grated
- 100 g peas (frozen are fine)
- Zest of ½ lemon + 1–2 tsp juice
- ½ tsp freshly cracked black pepper
- Salt

## Method

1. **Temper the yogurt.** In a bowl, whisk the yogurt with the grated parmesan, lemon zest, and a pinch of salt. Let it sit at room temperature while you cook the pasta — cold yogurt going into a hot pan is the #1 cause of splitting.

2. **Cook the pasta** in well-salted water until just shy of al dente. **Reserve at least 250 ml of pasta water** before draining. In the last 2 minutes, drop the peas into the same pot.

3. **Build the base.** In a wide pan over medium-low heat, melt the butter with the olive oil. Add the garlic and black pepper; cook 1–2 minutes until fragrant but not browned. **Turn off the heat.**

4. **Temper again.** Whisk 3–4 tablespoons of hot pasta water into the yogurt bowl, one spoonful at a time, until smooth and warm to the touch.

5. **Combine off the heat.** Add the drained pasta and peas to the pan (still off the burner). Pour in the tempered yogurt and toss vigorously, adding splashes of pasta water until the sauce coats the noodles glossily.

6. **Finish.** Stir in lemon juice to taste. Only return to the lowest heat for 20–30 seconds if needed to warm through — never simmer. Top wit

failure

gemini-3.5-flash-high · 9

The model response cuts off abruptly at the beginning of the output, failing to provide the requested timeline, sequencing, and plating plan.

Here is your highly coordinated, minute-by-minute cooking timeline to get all five dishes on the table, hot and perfectly cooked, in exactly 90 minutes.

### **The Strategy (Reverse Sear & Oven Swap)**
Because a 1.6kg bone-in ribeye is massive, we will use the **reverse sear** method. This allows us to bake the steak at a low temperature, rest it while the oven cranks up to high heat to bake the potato tart, and then do a quick, hard sear on the hob right before serving. 

---

### **Target Steak Temperatures (Meat Thermometer)**
*   **For Medium-Rare (Recommended):** Pull from oven at **46°C (115°F)**. Target after resting and searing: **

mid performer

gemini-3-flash-preview · 81

The model demonstrates exceptional culinary intuition, particularly in combining dried basil and fresh mint to approximate holy basil, and provides highly accurate, expert-level flavour tradeoff analysis.

This adaptation moves away from the peppery, anise-heavy profile of authentic Pad Kra Pao and leans into a **"Zesty Mint & Garlic Stir-fry."** By combining fresh mint with dried basil, you can approximate the complexity of holy basil, while the lime and fish sauce provide the necessary "funk" and brightness.

### 1. The Substitution Approach
*   **The "Basil" Hack:** Holy basil is peppery and medicinal. Since you only have dried basil (which is earthy) and fresh mint (which is cool/sweet), we will use **both**. The dried basil provides the background herbal note, while the fresh mint provides the "green" punch.
*   **The Sauce Body:** Oyster sauce provides thickness and umami. To mimic this, we will increase the **sugar and soy sauce** ratio to create a syrupy glaze, using **fish sauce** for the deep savory depth.
*   **The Acid:** White wine vinegar is sharper than rice wine vinegar. We will use a small amount of vinegar during cooking and finish with **fresh lime juice** to soften the edge.

---

### 2. Ingredients (For 2 People)

**The Protein & Aromatics:**
*   **Chicken:** 300g–400g (Ideally thigh, finely chopped or minced).
*   **Garlic:** 5-6 cloves (crushed and roughly chopped).
*   **Chillies:** 2-4 bird’s eye chillies (adjust to heat preference, finely chopped).
*   **Oil:** 2 tbsp (neutral oil).

**The "Mock" Kra Pao Sauce:**
*   **Soy Sauce:** 2 tbsp.
*   **Fish Sauce:** 1 tbsp.
*   **Sugar:** 1 tbsp (brown sugar is best, white is fine).
*   **White Wine Vinegar:** 1 tsp.
*   **Dried Basil:** 1 tsp (whisked into the liquid to rehydrate).

**The Finish:**
*   **Fresh Mint:** 1 cup (packed) whole leaves.
*   **Lime:** 1/2 lime (squeezed at the very end).

---

### 3. Cooking Method

1.  **Prep the Sauce:** In a small bowl, mix the soy sauce, fish sauce, sugar,

Methodology

How scores are produced

Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.

LLM judge

A stable judge model scores each trait from 0 to 10 using benchmark-specific traits.

Heuristics

Deterministic checks catch length, banned phrases, required sections, format validity, and safety flags.

Calibrated ceiling

Rubrics are intentionally strict, leaving room above today's best scores for future model improvements.