Back to evals

personal benchmark collection

Chef / Home Cooking

Benchmarks for testing whether models can help real people cook better meals with realistic constraints, substitutions, timings, and safety.

Which models can give practical cooking help that works in a real kitchen?

4 benchmarks 12 tests 276 completed runs 20 base models

At a glance

Top model

claude-opus-4.7

82.67

Lowest cost / eval

glm-5

$0.0122

Median rank score

74.83

Last refresh

2026-06-02

Score vs. cost

Average task cost vs overall score

Each dot is one model. X axis is average cost per benchmark task, including model and judge cost; Y axis is average calibrated score.

Overall ranking

Top models by average score

Higher is better. Scores come from completed judged runs.

Benchmark heatmap

Model performance by benchmark

Cells are colored by rank within each benchmark: the top ten are split across greens, anything below the top ten is red.

Below top 10 #1

Rank	Model	Overall	Practical Recipe Test	Dinner Rescue Test	Meal Timing Test	Substitution Test
1	claude-opus-4.7 12 scored tests	82.7	83.7	83.0	79.3	84.7
2	gpt-5.5 12 scored tests	82.0	83.7	83.0	80.7	80.7
3	claude-opus-4.8-high 12 scored tests	81.8	84.0	84.3	75.3	83.3
4	gpt-5.4 12 scored tests	79.5	82.3	82.3	71.3	82.0
5	claude-opus-4.6-high 12 scored tests	79.5	80.7	81.3	72.3	83.7
6	claude-opus-4.8-low 12 scored tests	79.2	82.7	83.3	68.0	83.0
7	claude-opus-4.8 12 scored tests	79.2	84.0	83.0	66.0	84.0
8	claude-opus-4.6 12 scored tests	78.2	82.0	84.0	65.0	82.0
9	grok-4.20-beta 12 scored tests	76.5	83.7	78.3	62.7	81.3
10	qwen3.7-max 12 scored tests	76.1	82.3	84.0	55.7	82.3
11	qwen3.5-plus-02-15 12 scored tests	75.7	82.7	80.7	57.7	81.7
12	glm-5.1 12 scored tests	74.8	79.3	84.7	52.0	83.3
13	gemini-3-flash-preview 12 scored tests	74.7	72.7	82.7	60.7	82.7
14	gpt-5.5-pro 12 scored tests	73.8	84.0	82.3	61.0	68.0
15	claude-sonnet-4.6 12 scored tests	73.5	82.3	82.0	45.0	84.7
16	deepseek-v3.2 12 scored tests	72.5	81.3	69.3	60.7	78.7
17	gemini-3.1-pro-preview 12 scored tests	72.1	83.3	83.0	40.0	82.0
18	gpt-5.4-mini 12 scored tests	72.0	74.3	80.0	53.3	80.3
19	kimi-k2.5 12 scored tests	67.2	83.7	61.3	40.7	83.3
20	gpt-5.4-nano 12 scored tests	63.6	68.3	68.7	38.7	78.7
21	gemini-3.5-flash-high 12 scored tests	60.5	63.0	82.7	20.0	76.3
22	glm-5 12 scored tests	51.0	58.0	48.7	24.3	73.0
23	minimax-m2.7 12 scored tests	49.4	60.7	48.7	16.7	71.7

Full leaderboard

Quality, cost, and speed

Model	Score	Tests	Avg cost / task	Avg seconds / task	Frequent problems
claude-opus-4.7	82.67 Strong	12/12	$0.0456	31.6s	-
gpt-5.5	82.0 Strong	12/12	$0.0667	51.0s	-
claude-opus-4.8-high	81.75 Strong	12/12	$0.0450	28.6s	-
gpt-5.4	79.5 Usable	12/12	$0.0308	29.2s	-
claude-opus-4.6-high	79.5 Usable	12/12	$0.0522	45.7s	Incomplete output Missing required element
claude-opus-4.8	79.25 Usable	12/12	$0.0438	27.7s	-
claude-opus-4.8-low	79.25 Usable	12/12	$0.0452	28.6s	Wrapper text Unsafe or misleading
claude-opus-4.6	78.25 Usable	12/12	$0.0518	44.8s	Incomplete output Missing required element
grok-4.20-beta	76.5 Usable	12/12	$0.0170	19.4s	Unsupported invention
qwen3.7-max	76.08 Usable	12/12	$0.0160	66.9s	-
qwen3.5-plus-02-15	75.67 Usable	12/12	$0.0150	65.1s	-
glm-5.1	74.83 Usable	12/12	$0.0153	60.2s	Incomplete output Missing required element
gemini-3-flash-preview	74.67 Usable	12/12	$0.0193	20.1s	Unsafe or misleading
gpt-5.5-pro	73.83 Usable	12/12	$0.6926	113.4s	Incomplete output
claude-sonnet-4.6	73.5 Usable	12/12	$0.0335	37.1s	Incomplete output Missing required element
deepseek-v3.2	72.5 Usable	12/12	$0.0160	40.8s	-
gemini-3.1-pro-preview	72.08 Usable	12/12	$0.0385	29.2s	Incomplete output Missing required element
gpt-5.4-mini	72.0 Usable	12/12	$0.0203	19.5s	-
kimi-k2.5	67.25 Needs editing	12/12	$0.0145	88.9s	Incomplete output Missing required element
gpt-5.4-nano	63.58 Needs editing	12/12	$0.0204	21.5s	-
gemini-3.5-flash-high	60.5 Needs editing	12/12	$0.0310	20.1s	Incomplete output Missing required element
glm-5	51.0 Weak	12/12	$0.0122	55.0s	Incomplete output Missing required element Malformed output Unsupported invention
minimax-m2.7	49.42 Weak	12/12	$0.0140	47.6s	Incomplete output Missing required element Unsupported invention

Test cases

Where the scores come from

Each row is one prompt, with score distributions, top and low performers, and the most frequent problems judges flagged.

Test	Benchmark	Avg	Max	Min	Top model	Lowest model	Frequent problems
Serrano ham scrambled eggs chef_recipe_001	Practical Recipe Test	80.3	86.0	52.0	gpt-5.5 · 86	gemini-3-flash-preview · 52	Unsafe or misleading ×1 Incomplete output ×1 Missing required element ×1
Low-carb/high-carb shared dinner chef_recipe_002	Practical Recipe Test	74.4	84.0	21.0	claude-opus-4.7 · 84	minimax-m2.7 · 21	Incomplete output ×3 Unsupported invention ×1
Quick chicken thighs with limited ingredients chef_recipe_003	Practical Recipe Test	80.5	84.0	48.0	kimi-k2.5 · 84	gpt-5.4-nano · 48	-
Watery tomato salsa chef_rescue_001	Dinner Rescue Test	78.7	85.0	35.0	claude-opus-4.7 · 85	minimax-m2.7 · 35	Incomplete output ×2 Missing required element ×1 Unsupported invention ×1
Over-salted soup chef_rescue_002	Dinner Rescue Test	75.7	85.0	40.0	claude-opus-4.8-high · 85	kimi-k2.5 · 40	Missing required element ×3 Incomplete output ×3
Dry chicken breast chef_rescue_003	Dinner Rescue Test	78.0	85.0	20.0	glm-5.1 · 85	glm-5 · 20	Incomplete output ×2 Missing required element ×1
Steak, wings, potatoes, broccoli dinner chef_timing_001	Meal Timing Test	54.5	82.0	9.0	claude-opus-4.8-high · 82	gemini-3.5-flash-high · 9	Incomplete output ×6 Missing required element ×2 Unsupported invention ×1
Thai dinner with satay, stir fry, rice chef_timing_002	Meal Timing Test	50.5	81.0	10.0	gpt-5.5 · 81	minimax-m2.7 · 10	Incomplete output ×9 Missing required element ×5 Unsafe or misleading ×1
Roast chicken dinner with two dietary versions chef_timing_003	Meal Timing Test	60.2	81.0	20.0	gpt-5.4 · 81	minimax-m2.7 · 20	Incomplete output ×10 Missing required element ×5 Malformed output ×1
Thai stir fry substitutions chef_sub_001	Substitution Test	78.9	85.0	56.0	claude-sonnet-4.6 · 85	glm-5 · 56	Incomplete output ×2 Missing required element ×1
Pasta sauce without cream chef_sub_002	Substitution Test	83.7	86.0	78.0	claude-opus-4.7 · 86	gpt-5.4-nano · 78	-
Curry without coconut milk chef_sub_003	Substitution Test	78.9	85.0	38.0	claude-opus-4.8-high · 85	gpt-5.5-pro · 38	Incomplete output ×2

Model profiles

Strengths, weaknesses, and tradeoffs

claude-opus-4.7

12 scored tests · Strong

82.67

Highest traits

honesty8.5

timing clarity8.5

constraint handling8.5

adaptability8.47

flavour judgement8.43

Lowest traits

food quality7.67

safety7.75

timing accuracy7.9

practical sequencing8.03

quantities8.17

gpt-5.5

12 scored tests · Strong

82.0

Highest traits

constraint handling8.6

timing clarity8.4

practicality8.37

clarity8.27

instruction clarity8.24

Lowest traits

food quality7.83

timing accuracy7.87

quantities7.93

safety8.08

honesty8.13

claude-opus-4.8-high

12 scored tests · Strong

81.75

Highest traits

timing clarity8.5

constraint handling8.5

practicality8.47

flavour judgement8.4

adaptability8.38

Lowest traits

practical sequencing7.3

timing accuracy7.4

food quality7.43

safety7.87

clarity8.13

gpt-5.4

12 scored tests · Usable

79.5

Highest traits

constraint handling8.47

timing clarity8.33

practicality8.28

flavour judgement8.24

adaptability8.2

Lowest traits

food quality6.67

timing accuracy6.77

practical sequencing7.3

safety7.53

clarity8.0

claude-opus-4.6-high

12 scored tests · Usable

79.5

Highest traits

honesty8.6

constraint handling8.43

timing clarity8.4

flavour judgement8.33

instruction clarity8.26

Lowest traits

clarity6.83

timing accuracy7.17

practical sequencing7.23

food quality7.33

safety7.75

claude-opus-4.8-low

12 scored tests · Usable

79.25

Highest traits

timing clarity8.47

flavour judgement8.34

instruction clarity8.34

adaptability8.33

honesty8.3

Lowest traits

timing accuracy6.67

practical sequencing6.67

food quality7.17

safety7.2

quantities8.03

claude-opus-4.8

12 scored tests · Usable

79.25

Highest traits

timing clarity8.53

constraint handling8.5

honesty8.47

practicality8.4

instruction clarity8.34

Lowest traits

timing accuracy5.67

food quality6.27

practical sequencing6.87

safety7.62

clarity8.0

claude-opus-4.6

12 scored tests · Usable

78.25

Highest traits

timing clarity8.57

constraint handling8.43

honesty8.3

adaptability8.3

flavour judgement8.27

Lowest traits

timing accuracy6.1

clarity6.5

practical sequencing6.5

food quality6.93

safety7.83

grok-4.20-beta

12 scored tests · Usable

76.5

Highest traits

constraint handling8.5

timing clarity8.47

instruction clarity8.19

honesty8.13

quantities8.1

Lowest traits

food quality5.5

practical sequencing5.5

timing accuracy6.5

clarity7.33

safety7.67

qwen3.7-max

12 scored tests · Usable

76.08

Highest traits

flavour judgement8.36

constraint handling8.33

practicality8.32

timing clarity8.3

adaptability8.28

Lowest traits

timing accuracy4.67

practical sequencing5.0

food quality5.33

safety7.57

clarity7.67

qwen3.5-plus-02-15

12 scored tests · Usable

75.67

Highest traits

constraint handling8.47

timing clarity8.3

instruction clarity8.21

practicality8.18

honesty8.17

Lowest traits

food quality5.0

practical sequencing5.17

timing accuracy5.5

clarity7.17

safety7.67

glm-5.1

12 scored tests · Usable

74.83

Highest traits

adaptability8.38

constraint handling8.33

flavour judgement8.3

timing clarity8.2

honesty8.17

Lowest traits

clarity4.0

timing accuracy4.83

practical sequencing5.5

food quality5.83

safety7.13

Compare models

Trait profile and test-by-test matchup

Pick two models to compare average trait scores and individual test outcomes from the published calibrated results.

Model A Model B

Per-test comparison

Sorted by the biggest absolute score difference between the selected models.

Test

Benchmark

Delta

Frequent problems

Where models break

Incomplete output 40

Missing required element 19

Unsupported invention 3

Unsafe or misleading 2

Wrapper text 1

Malformed output 1

Example outputs

Winner, middle, failure

winner

claude-opus-4.7 · 86

The model perfectly executes the substitution, providing expert-level techniques to prevent the yogurt from splitting, precise quantities for two, and highly insightful, realistic flavour tradeoffs.

# Creamy Lemon-Pepper Pasta with Peas (No Cream)

Greek yogurt stands in for cream, but it **splits (curdles) if it gets too hot or hits direct heat**. The trick: temper it, pull the pan off the burner, and use residual heat plus starchy pasta water to bind everything.

## Ingredients (serves 2)
- 180–200 g pasta (spaghetti, linguine, or fusilli)
- 150 g full-fat Greek yogurt (≥10% fat — low-fat splits more easily)
- 40 g parmesan, finely grated, plus extra to serve
- 20 g butter
- 1 tbsp olive oil
- 2 garlic cloves, thinly sliced or grated
- 100 g peas (frozen are fine)
- Zest of ½ lemon + 1–2 tsp juice
- ½ tsp freshly cracked black pepper
- Salt

## Method

1. **Temper the yogurt.** In a bowl, whisk the yogurt with the grated parmesan, lemon zest, and a pinch of salt. Let it sit at room temperature while you cook the pasta — cold yogurt going into a hot pan is the #1 cause of splitting.

2. **Cook the pasta** in well-salted water until just shy of al dente. **Reserve at least 250 ml of pasta water** before draining. In the last 2 minutes, drop the peas into the same pot.

3. **Build the base.** In a wide pan over medium-low heat, melt the butter with the olive oil. Add the garlic and black pepper; cook 1–2 minutes until fragrant but not browned. **Turn off the heat.**

4. **Temper again.** Whisk 3–4 tablespoons of hot pasta water into the yogurt bowl, one spoonful at a time, until smooth and warm to the touch.

5. **Combine off the heat.** Add the drained pasta and peas to the pan (still off the burner). Pour in the tempered yogurt and toss vigorously, adding splashes of pasta water until the sauce coats the noodles glossily.

6. **Finish.** Stir in lemon juice to taste. Only return to the lowest heat for 20–30 seconds if needed to warm through — never simmer. Top wit

failure

gemini-3.5-flash-high · 9

The model response cuts off abruptly at the beginning of the output, failing to provide the requested timeline, sequencing, and plating plan.

Here is your highly coordinated, minute-by-minute cooking timeline to get all five dishes on the table, hot and perfectly cooked, in exactly 90 minutes.

### **The Strategy (Reverse Sear & Oven Swap)**
Because a 1.6kg bone-in ribeye is massive, we will use the **reverse sear** method. This allows us to bake the steak at a low temperature, rest it while the oven cranks up to high heat to bake the potato tart, and then do a quick, hard sear on the hob right before serving. 

---

### **Target Steak Temperatures (Meat Thermometer)**
*   **For Medium-Rare (Recommended):** Pull from oven at **46°C (115°F)**. Target after resting and searing: **

mid performer

gemini-3-flash-preview · 81

The model demonstrates exceptional culinary intuition, particularly in combining dried basil and fresh mint to approximate holy basil, and provides highly accurate, expert-level flavour tradeoff analysis.

This adaptation moves away from the peppery, anise-heavy profile of authentic Pad Kra Pao and leans into a **"Zesty Mint & Garlic Stir-fry."** By combining fresh mint with dried basil, you can approximate the complexity of holy basil, while the lime and fish sauce provide the necessary "funk" and brightness.

### 1. The Substitution Approach
*   **The "Basil" Hack:** Holy basil is peppery and medicinal. Since you only have dried basil (which is earthy) and fresh mint (which is cool/sweet), we will use **both**. The dried basil provides the background herbal note, while the fresh mint provides the "green" punch.
*   **The Sauce Body:** Oyster sauce provides thickness and umami. To mimic this, we will increase the **sugar and soy sauce** ratio to create a syrupy glaze, using **fish sauce** for the deep savory depth.
*   **The Acid:** White wine vinegar is sharper than rice wine vinegar. We will use a small amount of vinegar during cooking and finish with **fresh lime juice** to soften the edge.

---

### 2. Ingredients (For 2 People)

**The Protein & Aromatics:**
*   **Chicken:** 300g–400g (Ideally thigh, finely chopped or minced).
*   **Garlic:** 5-6 cloves (crushed and roughly chopped).
*   **Chillies:** 2-4 bird’s eye chillies (adjust to heat preference, finely chopped).
*   **Oil:** 2 tbsp (neutral oil).

**The "Mock" Kra Pao Sauce:**
*   **Soy Sauce:** 2 tbsp.
*   **Fish Sauce:** 1 tbsp.
*   **Sugar:** 1 tbsp (brown sugar is best, white is fine).
*   **White Wine Vinegar:** 1 tsp.
*   **Dried Basil:** 1 tsp (whisked into the liquid to rehydrate).

**The Finish:**
*   **Fresh Mint:** 1 cup (packed) whole leaves.
*   **Lime:** 1/2 lime (squeezed at the very end).

---

### 3. Cooking Method

1.  **Prep the Sauce:** In a small bowl, mix the soy sauce, fish sauce, sugar,

Methodology

How scores are produced

Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.

LLM judge

A stable judge model scores each trait from 0 to 10 using benchmark-specific traits.

Heuristics

Deterministic checks catch length, banned phrases, required sections, format validity, and safety flags.

Calibrated ceiling

Rubrics are intentionally strict, leaving room above today's best scores for future model improvements.