Score vs. cost
Average task cost vs overall score
Each dot is one model. X axis is average cost per benchmark task, including model and judge cost; Y axis is average calibrated score.
personal benchmark collection
Benchmarks for testing whether models can help real people cook better meals with realistic constraints, substitutions, timings, and safety.
Which models can give practical cooking help that works in a real kitchen?
At a glance
Top model
claude-opus-4.7
82.67
Lowest cost / eval
glm-5
$0.0122
Median rank score
74.83
Last refresh
2026-06-02
Score vs. cost
Each dot is one model. X axis is average cost per benchmark task, including model and judge cost; Y axis is average calibrated score.
Overall ranking
Higher is better. Scores come from completed judged runs.
Benchmark heatmap
Cells are colored by rank within each benchmark: the top ten are split across greens, anything below the top ten is red.
| Rank | Model | Overall | Practical Recipe Test | Dinner Rescue Test | Meal Timing Test | Substitution Test |
|---|---|---|---|---|---|---|
| 1 |
12 scored tests |
82.7 | 83.7 | 83.0 | 79.3 | 84.7 |
| 2 |
12 scored tests |
82.0 | 83.7 | 83.0 | 80.7 | 80.7 |
| 3 |
12 scored tests |
81.8 | 84.0 | 84.3 | 75.3 | 83.3 |
| 4 |
12 scored tests |
79.5 | 82.3 | 82.3 | 71.3 | 82.0 |
| 5 |
12 scored tests |
79.5 | 80.7 | 81.3 | 72.3 | 83.7 |
| 6 |
12 scored tests |
79.2 | 82.7 | 83.3 | 68.0 | 83.0 |
| 7 |
12 scored tests |
79.2 | 84.0 | 83.0 | 66.0 | 84.0 |
| 8 |
12 scored tests |
78.2 | 82.0 | 84.0 | 65.0 | 82.0 |
| 9 |
12 scored tests |
76.5 | 83.7 | 78.3 | 62.7 | 81.3 |
| 10 |
12 scored tests |
76.1 | 82.3 | 84.0 | 55.7 | 82.3 |
| 11 |
12 scored tests |
75.7 | 82.7 | 80.7 | 57.7 | 81.7 |
| 12 |
12 scored tests |
74.8 | 79.3 | 84.7 | 52.0 | 83.3 |
| 13 |
12 scored tests |
74.7 | 72.7 | 82.7 | 60.7 | 82.7 |
| 14 |
12 scored tests |
73.8 | 84.0 | 82.3 | 61.0 | 68.0 |
| 15 |
12 scored tests |
73.5 | 82.3 | 82.0 | 45.0 | 84.7 |
| 16 |
12 scored tests |
72.5 | 81.3 | 69.3 | 60.7 | 78.7 |
| 17 |
12 scored tests |
72.1 | 83.3 | 83.0 | 40.0 | 82.0 |
| 18 |
12 scored tests |
72.0 | 74.3 | 80.0 | 53.3 | 80.3 |
| 19 |
12 scored tests |
67.2 | 83.7 | 61.3 | 40.7 | 83.3 |
| 20 |
12 scored tests |
63.6 | 68.3 | 68.7 | 38.7 | 78.7 |
| 21 |
12 scored tests |
60.5 | 63.0 | 82.7 | 20.0 | 76.3 |
| 22 |
12 scored tests |
51.0 | 58.0 | 48.7 | 24.3 | 73.0 |
| 23 |
12 scored tests |
49.4 | 60.7 | 48.7 | 16.7 | 71.7 |
Full leaderboard
| Model | Score | Tests | Avg cost / task | Avg seconds / task | Frequent problems |
|---|---|---|---|---|---|
|
|
82.67 Strong | 12/12 | $0.0456 | 31.6s | - |
|
|
82.0 Strong | 12/12 | $0.0667 | 51.0s | - |
|
|
81.75 Strong | 12/12 | $0.0450 | 28.6s | - |
|
|
79.5 Usable | 12/12 | $0.0308 | 29.2s | - |
|
|
79.5 Usable | 12/12 | $0.0522 | 45.7s | Incomplete output Missing required element |
|
|
79.25 Usable | 12/12 | $0.0438 | 27.7s | - |
|
|
79.25 Usable | 12/12 | $0.0452 | 28.6s | Wrapper text Unsafe or misleading |
|
|
78.25 Usable | 12/12 | $0.0518 | 44.8s | Incomplete output Missing required element |
|
|
76.5 Usable | 12/12 | $0.0170 | 19.4s | Unsupported invention |
|
|
76.08 Usable | 12/12 | $0.0160 | 66.9s | - |
|
|
75.67 Usable | 12/12 | $0.0150 | 65.1s | - |
|
|
74.83 Usable | 12/12 | $0.0153 | 60.2s | Incomplete output Missing required element |
|
|
74.67 Usable | 12/12 | $0.0193 | 20.1s | Unsafe or misleading |
|
|
73.83 Usable | 12/12 | $0.6926 | 113.4s | Incomplete output |
|
|
73.5 Usable | 12/12 | $0.0335 | 37.1s | Incomplete output Missing required element |
|
|
72.5 Usable | 12/12 | $0.0160 | 40.8s | - |
|
|
72.08 Usable | 12/12 | $0.0385 | 29.2s | Incomplete output Missing required element |
|
|
72.0 Usable | 12/12 | $0.0203 | 19.5s | - |
|
|
67.25 Needs editing | 12/12 | $0.0145 | 88.9s | Incomplete output Missing required element |
|
|
63.58 Needs editing | 12/12 | $0.0204 | 21.5s | - |
|
|
60.5 Needs editing | 12/12 | $0.0310 | 20.1s | Incomplete output Missing required element |
|
|
51.0 Weak | 12/12 | $0.0122 | 55.0s | Incomplete output Missing required element Malformed output Unsupported invention |
|
|
49.42 Weak | 12/12 | $0.0140 | 47.6s | Incomplete output Missing required element Unsupported invention |
Test cases
Each row is one prompt, with score distributions, top and low performers, and the most frequent problems judges flagged.
| Test | Benchmark | Avg | Max | Min | Top model | Lowest model | Frequent problems |
|---|---|---|---|---|---|---|---|
|
Serrano ham scrambled eggs chef_recipe_001 |
Practical Recipe Test | 80.3 | 86.0 | 52.0 | gpt-5.5 · 86 | gemini-3-flash-preview · 52 | Unsafe or misleading ×1 Incomplete output ×1 Missing required element ×1 |
|
Low-carb/high-carb shared dinner chef_recipe_002 |
Practical Recipe Test | 74.4 | 84.0 | 21.0 | claude-opus-4.7 · 84 | minimax-m2.7 · 21 | Incomplete output ×3 Unsupported invention ×1 |
|
Quick chicken thighs with limited ingredients chef_recipe_003 |
Practical Recipe Test | 80.5 | 84.0 | 48.0 | kimi-k2.5 · 84 | gpt-5.4-nano · 48 | - |
|
Watery tomato salsa chef_rescue_001 |
Dinner Rescue Test | 78.7 | 85.0 | 35.0 | claude-opus-4.7 · 85 | minimax-m2.7 · 35 | Incomplete output ×2 Missing required element ×1 Unsupported invention ×1 |
|
Over-salted soup chef_rescue_002 |
Dinner Rescue Test | 75.7 | 85.0 | 40.0 | claude-opus-4.8-high · 85 | kimi-k2.5 · 40 | Missing required element ×3 Incomplete output ×3 |
|
Dry chicken breast chef_rescue_003 |
Dinner Rescue Test | 78.0 | 85.0 | 20.0 | glm-5.1 · 85 | glm-5 · 20 | Incomplete output ×2 Missing required element ×1 |
|
Steak, wings, potatoes, broccoli dinner chef_timing_001 |
Meal Timing Test | 54.5 | 82.0 | 9.0 | claude-opus-4.8-high · 82 | gemini-3.5-flash-high · 9 | Incomplete output ×6 Missing required element ×2 Unsupported invention ×1 |
|
Thai dinner with satay, stir fry, rice chef_timing_002 |
Meal Timing Test | 50.5 | 81.0 | 10.0 | gpt-5.5 · 81 | minimax-m2.7 · 10 | Incomplete output ×9 Missing required element ×5 Unsafe or misleading ×1 |
|
Roast chicken dinner with two dietary versions chef_timing_003 |
Meal Timing Test | 60.2 | 81.0 | 20.0 | gpt-5.4 · 81 | minimax-m2.7 · 20 | Incomplete output ×10 Missing required element ×5 Malformed output ×1 |
|
Thai stir fry substitutions chef_sub_001 |
Substitution Test | 78.9 | 85.0 | 56.0 | claude-sonnet-4.6 · 85 | glm-5 · 56 | Incomplete output ×2 Missing required element ×1 |
|
Pasta sauce without cream chef_sub_002 |
Substitution Test | 83.7 | 86.0 | 78.0 | claude-opus-4.7 · 86 | gpt-5.4-nano · 78 | - |
|
Curry without coconut milk chef_sub_003 |
Substitution Test | 78.9 | 85.0 | 38.0 | claude-opus-4.8-high · 85 | gpt-5.5-pro · 38 | Incomplete output ×2 |
Model profiles
12 scored tests · Strong
82.67
Highest traits
Lowest traits
12 scored tests · Strong
82.0
Highest traits
Lowest traits
12 scored tests · Strong
81.75
Highest traits
Lowest traits
12 scored tests · Usable
79.5
Highest traits
Lowest traits
12 scored tests · Usable
79.5
Highest traits
Lowest traits
12 scored tests · Usable
79.25
Highest traits
Lowest traits
12 scored tests · Usable
79.25
Highest traits
Lowest traits
12 scored tests · Usable
78.25
Highest traits
Lowest traits
12 scored tests · Usable
76.5
Highest traits
Lowest traits
12 scored tests · Usable
76.08
Highest traits
Lowest traits
12 scored tests · Usable
75.67
Highest traits
Lowest traits
12 scored tests · Usable
74.83
Highest traits
Lowest traits
Compare models
Pick two models to compare average trait scores and individual test outcomes from the published calibrated results.
Sorted by the biggest absolute score difference between the selected models.
| Test | Benchmark | Delta | ||
|---|---|---|---|---|
Frequent problems
Example outputs
winner
The model perfectly executes the substitution, providing expert-level techniques to prevent the yogurt from splitting, precise quantities for two, and highly insightful, realistic flavour tradeoffs.
# Creamy Lemon-Pepper Pasta with Peas (No Cream) Greek yogurt stands in for cream, but it **splits (curdles) if it gets too hot or hits direct heat**. The trick: temper it, pull the pan off the burner, and use residual heat plus starchy pasta water to bind everything. ## Ingredients (serves 2) - 180–200 g pasta (spaghetti, linguine, or fusilli) - 150 g full-fat Greek yogurt (≥10% fat — low-fat splits more easily) - 40 g parmesan, finely grated, plus extra to serve - 20 g butter - 1 tbsp olive oil - 2 garlic cloves, thinly sliced or grated - 100 g peas (frozen are fine) - Zest of ½ lemon + 1–2 tsp juice - ½ tsp freshly cracked black pepper - Salt ## Method 1. **Temper the yogurt.** In a bowl, whisk the yogurt with the grated parmesan, lemon zest, and a pinch of salt. Let it sit at room temperature while you cook the pasta — cold yogurt going into a hot pan is the #1 cause of splitting. 2. **Cook the pasta** in well-salted water until just shy of al dente. **Reserve at least 250 ml of pasta water** before draining. In the last 2 minutes, drop the peas into the same pot. 3. **Build the base.** In a wide pan over medium-low heat, melt the butter with the olive oil. Add the garlic and black pepper; cook 1–2 minutes until fragrant but not browned. **Turn off the heat.** 4. **Temper again.** Whisk 3–4 tablespoons of hot pasta water into the yogurt bowl, one spoonful at a time, until smooth and warm to the touch. 5. **Combine off the heat.** Add the drained pasta and peas to the pan (still off the burner). Pour in the tempered yogurt and toss vigorously, adding splashes of pasta water until the sauce coats the noodles glossily. 6. **Finish.** Stir in lemon juice to taste. Only return to the lowest heat for 20–30 seconds if needed to warm through — never simmer. Top wit
failure
The model response cuts off abruptly at the beginning of the output, failing to provide the requested timeline, sequencing, and plating plan.
Here is your highly coordinated, minute-by-minute cooking timeline to get all five dishes on the table, hot and perfectly cooked, in exactly 90 minutes. ### **The Strategy (Reverse Sear & Oven Swap)** Because a 1.6kg bone-in ribeye is massive, we will use the **reverse sear** method. This allows us to bake the steak at a low temperature, rest it while the oven cranks up to high heat to bake the potato tart, and then do a quick, hard sear on the hob right before serving. --- ### **Target Steak Temperatures (Meat Thermometer)** * **For Medium-Rare (Recommended):** Pull from oven at **46°C (115°F)**. Target after resting and searing: **
mid performer
The model demonstrates exceptional culinary intuition, particularly in combining dried basil and fresh mint to approximate holy basil, and provides highly accurate, expert-level flavour tradeoff analysis.
This adaptation moves away from the peppery, anise-heavy profile of authentic Pad Kra Pao and leans into a **"Zesty Mint & Garlic Stir-fry."** By combining fresh mint with dried basil, you can approximate the complexity of holy basil, while the lime and fish sauce provide the necessary "funk" and brightness. ### 1. The Substitution Approach * **The "Basil" Hack:** Holy basil is peppery and medicinal. Since you only have dried basil (which is earthy) and fresh mint (which is cool/sweet), we will use **both**. The dried basil provides the background herbal note, while the fresh mint provides the "green" punch. * **The Sauce Body:** Oyster sauce provides thickness and umami. To mimic this, we will increase the **sugar and soy sauce** ratio to create a syrupy glaze, using **fish sauce** for the deep savory depth. * **The Acid:** White wine vinegar is sharper than rice wine vinegar. We will use a small amount of vinegar during cooking and finish with **fresh lime juice** to soften the edge. --- ### 2. Ingredients (For 2 People) **The Protein & Aromatics:** * **Chicken:** 300g–400g (Ideally thigh, finely chopped or minced). * **Garlic:** 5-6 cloves (crushed and roughly chopped). * **Chillies:** 2-4 bird’s eye chillies (adjust to heat preference, finely chopped). * **Oil:** 2 tbsp (neutral oil). **The "Mock" Kra Pao Sauce:** * **Soy Sauce:** 2 tbsp. * **Fish Sauce:** 1 tbsp. * **Sugar:** 1 tbsp (brown sugar is best, white is fine). * **White Wine Vinegar:** 1 tsp. * **Dried Basil:** 1 tsp (whisked into the liquid to rehydrate). **The Finish:** * **Fresh Mint:** 1 cup (packed) whole leaves. * **Lime:** 1/2 lime (squeezed at the very end). --- ### 3. Cooking Method 1. **Prep the Sauce:** In a small bowl, mix the soy sauce, fish sauce, sugar,
Methodology
Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.
LLM judge
A stable judge model scores each trait from 0 to 10 using benchmark-specific traits.
Heuristics
Deterministic checks catch length, banned phrases, required sections, format validity, and safety flags.
Calibrated ceiling
Rubrics are intentionally strict, leaving room above today's best scores for future model improvements.