Personal · 20 tasks · 44 models

Cheapest AI models for Creative & Comedy

Name: Creative & Comedy AI model benchmark
Creator: Spring Prompt

Which models are actually creative and funny — not just fluent and generic?

Top models OpenAI

gpt-5.4-mini Anthropic

claude-sonnet-4.6-low Google

gemini-3.1-pro-preview-low

The cheapest capable model for Creative & Comedy is deepseek-v3.2, at $0.05/1k per run — and it still clears our quality bar.

Best overall Elo

gpt-5.4-mini

Highest-ranked

$0.0006/run 4.2s

Best value ★ Elo

deepseek-v3.2

Clears the quality bar at $0.05/1k/run

$0.0000/run 6.5s

Fastest usable Elo

gemini-3.1-flash-lite

~1s per run, still strong

$0.0002/run 1.5s

Quality vs. cost

Every model placed by what it delivers and what it costs. The best value sits high and to the left.

Full ranking

Best overall Cheapest Fastest Smartest

#	Model	Elo	Cost/run	Speed	Best for
1	deepseek-v3.2	1000 Elo	$0.0000	6.5s	Best overall
2	deepseek-v3.2-high	1000 Elo	$0.0000	7.6s	Best overall
3	deepseek-v3.2-low	1000 Elo	$0.0001	7.3s	Best overall
4	deepseek-v3.1-terminus	1000 Elo	$0.0002	8.6s	Best overall
5	gemini-3.1-flash-lite	1000 Elo	$0.0002	1.5s	Best overall
6	gpt-5.4-mini	1006 Elo	$0.0006	4.2s	Best overall
7	claude-haiku-4.5	1000 Elo	$0.0008	2.7s	Best overall
8	claude-sonnet-4.5	1000 Elo	$0.0024	6.8s	Best overall
9	gpt-5-mini	1000 Elo	$0.0026	14.9s	Best overall
10	claude-opus-4.5	1000 Elo	$0.0038	6.5s	Best overall
11	claude-opus-4.6	1000 Elo	$0.0041	8.0s	Best overall
12	glm-5	1000 Elo	$0.0046	60.8s	Best overall
13	claude-opus-4.8-low	1000 Elo	$0.0053	5.2s	Best overall
14	claude-opus-4.8-high	1000 Elo	$0.0070	7.1s	Best overall
15	gemini-3-flash-preview	1000 Elo	$0.0089	16.3s	Best overall
16	claude-sonnet-4.5-low	1000 Elo	$0.0100	16.0s	Best overall
17	gemini-3.5-flash-low	1000 Elo	$0.0106	9.6s	Best overall
18	claude-sonnet-4.6-low	1006 Elo	$0.0116	15.7s	Best overall
19	gpt-5.5	1000 Elo	$0.0118	11.9s	Best overall
20	claude-sonnet-4.6-high	1000 Elo	$0.0155	19.7s	Best overall
21	gemini-3.1-pro-preview-low	1006 Elo	$0.0176	15.7s	Best overall
22	claude-opus-4.6-high	1006 Elo	$0.0214	20.3s	Best overall
23	mistral-medium-3.1	999 Elo	$0.0003	4.5s	Best overall
24	grok-4.20	999 Elo	$0.0005	1.7s	Best overall
25	grok-4.20-beta	990 Elo	$0.0011	2.1s	Best overall
26	gpt-5.4	999 Elo	$0.0019	4.8s	Best overall
27	minimax-m2.7	999 Elo	$0.0021	30.0s	Best overall
28	gpt-5.4-low	999 Elo	$0.0025	3.4s	Best overall
29	kimi-k2.5	999 Elo	$0.0047	84.9s	Best overall
30	kimi-k2.7-code	999 Elo	$0.0062	45.0s	Best overall
31	gpt-5.5-low	999 Elo	$0.0083	6.6s	Best overall
32	qwen3.5-plus-02-15	999 Elo	$0.0084	98.5s	Best overall
33	qwen3.7-max	999 Elo	$0.0122	60.2s	Best overall
34	gpt-5.4-high	999 Elo	$0.0133	11.2s	Best overall
35	qwen3.7-max-high	999 Elo	$0.0137	72.8s	Best overall
36	qwen3.7-max-low	999 Elo	$0.0148	73.7s	Best overall
37	gpt-5.5-high	999 Elo	$0.0151	10.7s	Best overall
38	claude-sonnet-4.5-high	1000 Elo	$0.0171	24.9s	Best overall
39	gemini-3.1-pro-preview-high	1000 Elo	$0.0179	19.0s	Best overall
40	gemini-3.5-flash-high	1000 Elo	$0.0182	13.4s	Best overall
41	claude-opus-4.6-low	1000 Elo	$0.0202	18.8s	Best overall
42	gemini-3.1-pro-preview	1000 Elo	$0.0282	23.8s	Best overall
43	claude-opus-4.5-low	1000 Elo	$0.0407	30.1s	Best overall
44	claude-opus-4.5-high	1000 Elo	$0.0819	54.7s	Best overall

What separates the top models

Funny on Command

hard

Tests whether the model is genuinely funny — specific, surprising, economical — not dad-jokes or explained punchlines.

Leader: gemini-3.1-pro-preview-low

Constrained Short Fiction

hard

Tests micro-fiction under hard constraints: concrete detail, a distinct voice, and an earned (not gimmicky) ending.

Leader: gemini-3.1-pro-preview-low

Naming & Branding

medium

Tests memorable, evocative names that avoid the -ify/-ly/buzzword-mashup clichés AI over-produces.

Leader: gemini-3.1-pro-preview-low

Character Voice

medium

Tests writing in a distinct, sustained voice rather than regressing to the neutral default narrator.

Leader: gemini-3.1-pro-preview-low

See the difference

Real model outputs and how our judge scored them.

Top score gemini-3.1-pro-preview-low Two voices: same bad news

1006

“Ranked by pairwise Elo tournament.”

**Terse Ex-Military COO**
Be advised, the launch is a no-go, so stand down and hold your present positions until command establishes a revised operational timeline. 

**Relentlessly Upbeat Camp Counselor**
Okey-dokey superstars, we get to keep the excitement bubbling a teensy bit longer because our big launch adventure is taking a fun little detour to make sure everything is absolutely, positively spectacular!

Mid-pack gemini-3.1-pro-preview Observational: airport boarding groups

1000

“Ranked by pairwise Elo tournament.”

"Boarding Group 1" is a beautiful piece of corporate poetry that translates directly to "eighth."

I admire the tactical discipline of Group 7 forming a human barricade around the scanner while the pilots are still in line at Auntie Anne's.

"Main Cabin Preferred" is a $60 title that grants you the exclusive privilege of watching the Diamond Medallion members fill the last overhead bin with a single trench coat.

Lowest score grok-4.20-beta Roast generic SaaS landing pages

990

“Ranked by pairwise Elo tournament.”

- “Supercharge your workflow with AI-powered synergy.”  
- “Revolutionize how your team crushes KPIs in the cloud.”  
- “Unlock seamless efficiency at enterprise scale—starting at $49/user.”

Frequently asked

What is the best AI model for creative & comedy?

In our benchmarks, gpt-5.4-mini ranks first for creative & comedy, across 20 test cases.

What is the cheapest good model for creative & comedy?

deepseek-v3.2 is the best value: it clears our quality bar for creative & comedy at $0.05/1k per run.

Which model is fastest for creative & comedy?

gemini-3.1-flash-lite is the fastest model that still performs well for creative & comedy.

How we test

Each model output is scored by a strict JSON LLM judge, supported by deterministic heuristics, then normalized to a 0-100 score.

Judge: gemini-3.1-pro-preview · 880 model runs across 4 benchmarks · last tested 2026-06-30

This page is Spring Prompt, running

We just did this for every model. Do it for your prompt.

The rankings above come from running real tasks through real models and scoring every output. Spring Prompt is that same engine — pointed at your prompt, your test cases, and your definition of good.

Generate test cases from your prompt — no eval set required to start.
Compare models side by side with quality, cost and latency in one matrix.
Optimise the winner until the scores say it's ready to ship.

Join the waitlist Browse all benchmarks

Experiment · Cold outreach email

Prompt × model results

12 test cases · 3 evals

Claude Opus

GPT-5

Gemini

7.1

6.8

7.4

8.3

7.9

8.0

9.2 ★

8.6

8.4

Best combo: v3 × Claude Opus

9.2 quality · $0.004/run · 1.8s