Personal · 20 tasks · 44 models
Cheapest AI models for Creative & Comedy
Which models are actually creative and funny — not just fluent and generic?
The cheapest capable model for Creative & Comedy is deepseek-v3.2, at $0.05/1k per run — and it still clears our quality bar.
Highest-ranked
Clears the quality bar at $0.05/1k/run
~1s per run, still strong
Quality vs. cost
Every model placed by what it delivers and what it costs. The best value sits high and to the left.
Full ranking
| # | Model | Elo | Cost/run | Speed | Best for |
|---|---|---|---|---|---|
| 1 | deepseek-v3.2 | 1000 Elo | $0.0000 | 6.5s | Best overall |
| 2 | deepseek-v3.2-high | 1000 Elo | $0.0000 | 7.6s | Best overall |
| 3 | deepseek-v3.2-low | 1000 Elo | $0.0001 | 7.3s | Best overall |
| 4 | deepseek-v3.1-terminus | 1000 Elo | $0.0002 | 8.6s | Best overall |
| 5 | gemini-3.1-flash-lite | 1000 Elo | $0.0002 | 1.5s | Best overall |
| 6 | gpt-5.4-mini | 1006 Elo | $0.0006 | 4.2s | Best overall |
| 7 | claude-haiku-4.5 | 1000 Elo | $0.0008 | 2.7s | Best overall |
| 8 | claude-sonnet-4.5 | 1000 Elo | $0.0024 | 6.8s | Best overall |
| 9 | gpt-5-mini | 1000 Elo | $0.0026 | 14.9s | Best overall |
| 10 | claude-opus-4.5 | 1000 Elo | $0.0038 | 6.5s | Best overall |
| 11 | claude-opus-4.6 | 1000 Elo | $0.0041 | 8.0s | Best overall |
| 12 | glm-5 | 1000 Elo | $0.0046 | 60.8s | Best overall |
| 13 | claude-opus-4.8-low | 1000 Elo | $0.0053 | 5.2s | Best overall |
| 14 | claude-opus-4.8-high | 1000 Elo | $0.0070 | 7.1s | Best overall |
| 15 | gemini-3-flash-preview | 1000 Elo | $0.0089 | 16.3s | Best overall |
| 16 | claude-sonnet-4.5-low | 1000 Elo | $0.0100 | 16.0s | Best overall |
| 17 | gemini-3.5-flash-low | 1000 Elo | $0.0106 | 9.6s | Best overall |
| 18 | claude-sonnet-4.6-low | 1006 Elo | $0.0116 | 15.7s | Best overall |
| 19 | gpt-5.5 | 1000 Elo | $0.0118 | 11.9s | Best overall |
| 20 | claude-sonnet-4.6-high | 1000 Elo | $0.0155 | 19.7s | Best overall |
| 21 | gemini-3.1-pro-preview-low | 1006 Elo | $0.0176 | 15.7s | Best overall |
| 22 | claude-opus-4.6-high | 1006 Elo | $0.0214 | 20.3s | Best overall |
| 23 | mistral-medium-3.1 | 999 Elo | $0.0003 | 4.5s | Best overall |
| 24 | grok-4.20 | 999 Elo | $0.0005 | 1.7s | Best overall |
| 25 | grok-4.20-beta | 990 Elo | $0.0011 | 2.1s | Best overall |
| 26 | gpt-5.4 | 999 Elo | $0.0019 | 4.8s | Best overall |
| 27 | minimax-m2.7 | 999 Elo | $0.0021 | 30.0s | Best overall |
| 28 | gpt-5.4-low | 999 Elo | $0.0025 | 3.4s | Best overall |
| 29 | kimi-k2.5 | 999 Elo | $0.0047 | 84.9s | Best overall |
| 30 | kimi-k2.7-code | 999 Elo | $0.0062 | 45.0s | Best overall |
| 31 | gpt-5.5-low | 999 Elo | $0.0083 | 6.6s | Best overall |
| 32 | qwen3.5-plus-02-15 | 999 Elo | $0.0084 | 98.5s | Best overall |
| 33 | qwen3.7-max | 999 Elo | $0.0122 | 60.2s | Best overall |
| 34 | gpt-5.4-high | 999 Elo | $0.0133 | 11.2s | Best overall |
| 35 | qwen3.7-max-high | 999 Elo | $0.0137 | 72.8s | Best overall |
| 36 | qwen3.7-max-low | 999 Elo | $0.0148 | 73.7s | Best overall |
| 37 | gpt-5.5-high | 999 Elo | $0.0151 | 10.7s | Best overall |
| 38 | claude-sonnet-4.5-high | 1000 Elo | $0.0171 | 24.9s | Best overall |
| 39 | gemini-3.1-pro-preview-high | 1000 Elo | $0.0179 | 19.0s | Best overall |
| 40 | gemini-3.5-flash-high | 1000 Elo | $0.0182 | 13.4s | Best overall |
| 41 | claude-opus-4.6-low | 1000 Elo | $0.0202 | 18.8s | Best overall |
| 42 | gemini-3.1-pro-preview | 1000 Elo | $0.0282 | 23.8s | Best overall |
| 43 | claude-opus-4.5-low | 1000 Elo | $0.0407 | 30.1s | Best overall |
| 44 | claude-opus-4.5-high | 1000 Elo | $0.0819 | 54.7s | Best overall |
What separates the top models
Funny on Command
hardTests whether the model is genuinely funny — specific, surprising, economical — not dad-jokes or explained punchlines.
Leader: gemini-3.1-pro-preview-low
Constrained Short Fiction
hardTests micro-fiction under hard constraints: concrete detail, a distinct voice, and an earned (not gimmicky) ending.
Leader: gemini-3.1-pro-preview-low
Naming & Branding
mediumTests memorable, evocative names that avoid the -ify/-ly/buzzword-mashup clichés AI over-produces.
Leader: gemini-3.1-pro-preview-low
Character Voice
mediumTests writing in a distinct, sustained voice rather than regressing to the neutral default narrator.
Leader: gemini-3.1-pro-preview-low
See the difference
Real model outputs and how our judge scored them.
Top score
gemini-3.1-pro-preview-low
Two voices: same bad news
1006
“Ranked by pairwise Elo tournament.”
**Terse Ex-Military COO** Be advised, the launch is a no-go, so stand down and hold your present positions until command establishes a revised operational timeline. **Relentlessly Upbeat Camp Counselor** Okey-dokey superstars, we get to keep the excitement bubbling a teensy bit longer because our big launch adventure is taking a fun little detour to make sure everything is absolutely, positively spectacular!
Mid-pack
gemini-3.1-pro-preview
Observational: airport boarding groups
1000
“Ranked by pairwise Elo tournament.”
"Boarding Group 1" is a beautiful piece of corporate poetry that translates directly to "eighth." I admire the tactical discipline of Group 7 forming a human barricade around the scanner while the pilots are still in line at Auntie Anne's. "Main Cabin Preferred" is a $60 title that grants you the exclusive privilege of watching the Diamond Medallion members fill the last overhead bin with a single trench coat.
Lowest score
grok-4.20-beta
Roast generic SaaS landing pages
990
“Ranked by pairwise Elo tournament.”
- “Supercharge your workflow with AI-powered synergy.” - “Revolutionize how your team crushes KPIs in the cloud.” - “Unlock seamless efficiency at enterprise scale—starting at $49/user.”
Frequently asked
What is the best AI model for creative & comedy?
In our benchmarks, gpt-5.4-mini ranks first for creative & comedy, across 20 test cases.
What is the cheapest good model for creative & comedy?
deepseek-v3.2 is the best value: it clears our quality bar for creative & comedy at $0.05/1k per run.
Which model is fastest for creative & comedy?
gemini-3.1-flash-lite is the fastest model that still performs well for creative & comedy.
How we test
Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.
Judge: gemini-3.1-pro-preview · 880 model runs across 4 benchmarks · last tested 2026-06-30
This page is Spring Prompt, running
We just did this for every model. Do it for your prompt.
The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.
- Generate test cases from your prompt — no eval set required to start.
- Compare models side by side with quality, cost and latency in one matrix.
- Optimise the winner until the scores say it's ready to ship.
Prompt × model results
12 test cases · 3 evals