Google Gemini 3 Review: The Benchmarks Actually Match the Hype 🤯

Ellis Crosby

Published November 20, 2025

3 min read

Google Gemini 3 Review: The Benchmarks Actually Match the Hype 🤯

So, on Tuesday Google launched Gemini 3. The hype was massive leading up to this, and honestly? It is justified.

It is really, really good.

Trying to explain how good is difficult without getting bogged down in technical jargon, but the general consensus is pretty clear. Even Sam Altman tweeted his congratulations last night, calling it a "great model." When the head of the competition is being that humble, you know something big just happened.

If you watched the GPT 5.1 launch last week, you might have felt it was a bit underwhelming. We were comparing it to Gemini 2.5 Flash and Pro, and the margins were thin. Gemini 3 feels like a completely different generation of model.

Let’s look at the benchmarks. I know, usually these charts are cherry-picked marketing fluff, but in this case, the gap between Gemini 3 and the rest of the field is staggering.

The "Everything" Exam (Humanities Last Exam)

First up is the oddly named "Humanities Last Exam." Think of this as a broad test you would give a human. It covers math, physics, biology, and social sciences. It is a great way to see if an AI is well-rounded or just good at code.

Before this launch, Gemini 2.5 Pro was doing decent work here. GPT 5.1, which launched last week, scored a 26.5.

Gemini 3 Pro scored 37.5.

Nothing else is even in the thirties. It is a huge improvement in general intelligence.

The Business Owner Test (Vending Bench 2)

This is my absolute favorite benchmark from the launch because it is so practical.

Vending Bench 2 basically gives the AI control over a vending machine business. It has to handle pricing, deliveries, negotiating with vendors, and dealing with refund requests over a simulated timeframe.

The problem with older models is that they get "tired" or lose the plot after a while. They forget which vendors are scammers and which are reliable.

Gemini 3 didn't just survive the test; it thrived. The data shows it was a "persistent negotiator." It didn't give up after the first email. It kept pushing for lower prices from suppliers to maximize profit. It also identified the "friendly" vendors and routed more money to them, showing a level of emotional intelligence and long-term memory that we haven't really seen before.

It literally made the most money in the simulation.

Visual Logic (ARC-AGI 2)

You are going to see the ARC-AGI graph a lot this week. This benchmark tests visual puzzles that are easy for humans but usually impossible for AI. It requires you to look at a pattern of colored squares and figure out the hidden rule.

Historically, LLMs are terrible at this.

Gemini 3 with "deep thinking" enabled is hitting 45% accuracy. The previous best models were stuck around 18-20%. This is a massive leap in visual reasoning.

What This Means for You

Benchmarks are cool, but the real-world tests are wild. We are already seeing people build interactive apps in a single prompt.

One user built a 3D Lego editor in one shot.
Another created a fully animated simulation of a power plant.
The SVG generation for icons is finally perfect (no more weird, misaligned lines).

And if you look at the Screen Spot Pro benchmark, Gemini 3 is scoring 72.7 on understanding computer screens. The closest competitor is at 36.2. This means we are getting very close to AI agents that can actually take over your laptop to book flights or navigate complex websites, rather than just talking about it.

The Price of Performance

There is one catch. Gemini 3 Pro is about 20% more expensive than the previous 2.5 Pro model.

It is not an automatic "swap and forget" decision. For some tasks, it might be overkill. For others, specifically complex visual reasoning or long-context coding, it is absolutely worth the extra cost.

Need a Second Opinion?

If you are wondering if your current prompts are ready for Gemini 3, or if the price hike is worth it for your specific use case, we can help.

At Spring Prompt, we are currently offering prompt audits. We will take your existing prompts, run them through our own evaluation set against Gemini 3, 2.5, and GPT 5.1, and tell you exactly which model gives you the best bang for your buck.

Book a no commitment 30 min consultation call here

Google Gemini 3 Review: The Benchmarks Actually Match the Hype 🤯

The "Everything" Exam (Humanities Last Exam)