Evals BulletBench

BulletBench: we made 23 AI models play speed chess, and the clock was the judge

Ellis Crosby

Published July 03, 2026

7 min read

📌

This post describes BulletBench as it stood at launch, on 3 July 2026, with 23 models and roughly 2,100 games on the board. The benchmark is live and keeps running: ratings shift as games accumulate, and new models are added as they release. For the current standings, always check the live leaderboard.

Every AI leaderboard answers the same question: how smart is this model? Almost none of them ask the question that decides what actually ships inside real products: how smart is it per second?

If you are building agents, you make that trade-off constantly. A router classifying incoming requests has a latency budget of a second or two, not a minute. A preflight check that decides whether a task needs the expensive model has to be cheaper and faster than just calling the expensive model. Bulk pipelines that label, extract or triage millions of items live and die on per-item decision quality at per-item speed. In all of these jobs, a brilliant answer that arrives late is just a wrong answer with better manners.

So we built BulletBench: a benchmark where speed is not a footnote next to the score. It is the reason you lose.

BulletBench: AI models play speed chess against a chess computer, with every second of API latency deducted from their game clock.

The clock is the judge

The setup is simple enough to explain in one sentence: AI models play speed chess against a calibrated chess computer, on a real clock, and every second a model spends producing its move - including any hidden reasoning - is deducted from its time. When the clock hits zero, the game is lost. It does not matter how good the position was.

We run four formats. Bullet gives a model 60 seconds of total thinking for a whole game. Lightning gives it 10 seconds plus one second back per move, which sounds generous until you realise it means sustaining roughly one-second answers forever. Blitz (3 minutes) and Rapid (10 minutes) sit above them as context, to show what a bigger time budget buys.

Each model climbs an opponent ladder that ranges from a random-move player (rating anchor 400) up to full-strength Stockfish (2800). Win and you face a stronger engine; lose and you drop down. From over 2,100 games we fit each model a chess rating per format, with proper confidence intervals. Alongside the rating we record what a router actually cares about: median response time, how often the model died on the clock, and what a game costs in API spend.

A few fairness rules matter. Legal moves are listed in the prompt, because we want to measure decision quality, not chess notation trivia. Provider infrastructure hiccups - hung requests, rate limits - pause the clock rather than counting against the model. And every model is told its remaining time each move, so a model that paces itself is rewarded. Some genuinely do.

What the first full run says

Bullet ratings (60 seconds of thinking for a whole game) across all 23 models. Whiskers show the 95% confidence interval. Google blue owns the top of the board.

The headline chart is a wall of Google blue. Gemini 3.5 Flash tops bullet at 900. Gemini 3.1 Flash Lite is right behind at 863, and it is the all-rounder of the field: the only model that stays above 800 at every fast format, at around $0.01 per game, with zero time losses at the lightning control. Gemini 3 Flash, a generation old, still posts 800 at bullet and the best blitz rating on the board.

Below the blue wall, the field tells three different stories.

Fast and disciplined, but limited. Qwen 3.5 Flash is the metronome of the benchmark: 0.4 seconds per move, and not a single game lost on time in over a hundred games. Its chess plateaus in the mid-500s, but if your job punishes latency variance above all else, that profile is worth knowing about.

Fast but empty. A cluster of small models - Ministral 3B, Nova Micro, Nemotron Nano, Liquid's LFM-2 - answer in under a second and play barely better than a random-move opponent. Sub-second latency has become table stakes. The differentiator is what you get per second, and for some models the honest answer is almost nothing.

Smart but late. Every large reasoning model lost essentially all of its fast games on time. Not because they played badly: in several games the model was clearly winning on the board when its flag fell. GPT-5.5 at medium reasoning takes around ten seconds per move and lost 20 of 24 blitz games on the clock. Kimi K2.5 averaged 57 seconds per move even at rapid. Claude Fable 5, the most expensive player in the field at over a dollar per rapid game, paces itself well enough to flag less often than GPT-5.5, but the chess it buys with those seconds is modest.

The fast board: bullet and lightning ratings side by side with response time, effective output speed, time-loss rate and cost per game.

The single most instructive row in the whole benchmark is Gemini 3.1 Pro. At lightning and bullet it scores effectively zero: it cannot physically play. At blitz it manages 683 while still losing half its games to the clock. At rapid, with ten minutes to think, it posts 1164 - the strongest chess anyone played in the entire dataset. The best brain in the field is unreachable in under three minutes. That is the exact trade-off this benchmark exists to measure, drawn in one row of a table.

The full clock spectrum. Reading a row left to right shows what every extra second of time budget buys. Gemini 3.1 Pro goes from unplayable to the strongest player in the dataset.

Are Gemini models just better at chess?

Partly, and it is worth being straight about it. Google models do not just win the fast formats - Gemini 3.5 Flash and 3.1 Pro also top rapid, where speed barely matters. That pattern says the Gemini family has genuinely stronger chess to begin with, presumably from training data and post-training choices, and no amount of speed normalisation changes that.

But the fast formats add something the rapid column cannot fake. Watch what happens to each model's rating as the clock shrinks: the Gemini flash models degrade gently, while nearly everyone else falls off a cliff. Holding on to your ability when the time budget collapses is a different skill from having the ability in the first place, and it is the one BulletBench is built to expose. We also see behaviour that has nothing to do with chess knowledge: several models visibly compress their thinking when told the clock is short - Gemini 3.1 Pro nearly halves its per-move time between rapid and lightning - which is exactly the kind of budget-awareness you want in an agent component.

Is chess a valid proxy for routing and bulk decisions?

Honest answer: it measures one axis extremely well, and we would not use it alone.

What it gets right is the shape of the problem. Sequential decisions under a hard deadline, with real consequences for both bad answers and slow answers, scored against ground truth that cannot be argued with. The latency we measure is real wall-clock API latency, including the provider's serving infrastructure - which sounds like a confound until you remember that a router experiences exactly that number. Chess also gives us calibrated opponents for free, which is how we can put a defensible rating with confidence intervals on every model instead of a vibe.

What it does not measure is general intelligence. Chess knowledge is part of the score, and models that happened to absorb more chess will look better than their general fast-thinking ability deserves - see the Gemini question above. A model could also be a poor chess player and a fine request classifier. That is why BulletBench sits alongside our task benchmarks rather than replacing them: this page tells you who can think under a deadline, the others tell you who can think about your problem.

Every featured game is replayable, move by move, with the model's think time and remaining clock shown per move.

It is also, frankly, the most watchable benchmark we have ever run. Every featured game on the page can be replayed move by move, with the model's thinking time and remaining clock shown as you step through. Watching a frontier model burn its last ten seconds on a move in a winning position is a better explanation of the speed-intelligence trade-off than any paragraph we could write.

This one stays live

BulletBench is not a one-off study. The harness, the opponent ladder and the rating system are built to run continuously, and we intend to treat every notable model release the same way: it gets a clock, it gets a ladder, and we find out within a day whether it can actually play bullet. The leaderboard, the confidence intervals and the auto-generated findings on the page update with every run.

On the roadmap: streaming time-to-first-token measurement, a puzzle-sprint companion mode with pre-rated positions for cheaper and even better-calibrated ratings, the same model entered at multiple reasoning settings so the speed dial itself becomes visible, and time-of-day replication so provider infrastructure variance is measured rather than assumed.

Explore the full leaderboard, sort it by whatever your latency budget cares about, and step through the games at springprompt.com/evals/bullet-chess. And if you want your own prompts and workloads benchmarked across models with this level of rigour, join the Spring Prompt waitlist.

BulletBench: we made 23 AI models play speed chess, and the clock was the judge

The clock is the judge

What the first full run says

Are Gemini models just better at chess?

Is chess a valid proxy for routing and bulk decisions?

This one stays live

Ellis Crosby

Related Articles

Ready to Optimize Your AI Prompts?

BulletBench: we made 23 AI models play speed chess, and the clock was the judge

The clock is the judge

What the first full run says

Are Gemini models just better at chess?

Is chess a valid proxy for routing and bulk decisions?

This one stays live

Ellis Crosby

Related Articles

Sonnet 5 gave the sharpest marketing analysis we've ever tested. It also lost money every single time.

How We Built ROASBench to Feel Like Real Growth Work

Claude vs Gemini vs GPT in a 12-Month Marketing Simulation

Ready to Optimize Your AI Prompts?