Back to Blog

Google Gemini 3 Review: The Benchmarks Actually Match the Hype 🤯

Ellis Crosby
3 min read
Google Gemini 3 Review: The Benchmarks Actually Match the Hype 🤯

So, on Tuesday Google launched Gemini 3. The hype was massive leading up to this, and honestly? It is justified.

It is really, really good.

Trying to explain how good is difficult without getting bogged down in technical jargon, but the general consensus is pretty clear. Even Sam Altman tweeted his congratulations last night, calling it a "great model." When the head of the competition is being that humble, you know something big just happened.

If you watched the GPT 5.1 launch last week, you might have felt it was a bit underwhelming. We were comparing it to Gemini 2.5 Flash and Pro, and the margins were thin. Gemini 3 feels like a completely different generation of model.

Let’s look at the benchmarks. I know, usually these charts are cherry-picked marketing fluff, but in this case, the gap between Gemini 3 and the rest of the field is staggering.

The "Everything" Exam (Humanities Last Exam)

First up is the oddly named "Humanities Last Exam." Think of this as a broad test you would give a human. It covers math, physics, biology, and social sciences. It is a great way to see if an AI is well-rounded or just good at code.

Before this launch, Gemini 2.5 Pro was doing decent work here. GPT 5.1, which launched last week, scored a 26.5.

Gemini 3 Pro scored 37.5.

Nothing else is even in the thirties. It is a huge improvement in general intelligence.

The Business Owner Test (Vending Bench 2)

This is my absolute favorite benchmark from the launch because it is so practical.

Vending Bench 2 basically gives the AI control over a vending machine business. It has to handle pricing, deliveries, negotiating with vendors, and dealing with refund requests over a simulated timeframe.

The problem with older models is that they get "tired" or lose the plot after a while. They forget which vendors are scammers and which are reliable.

Gemini 3 didn't just survive the test; it thrived. The data shows it was a "persistent negotiator." It didn't give up after the first email. It kept pushing for lower prices from suppliers to maximize profit. It also identified the "friendly" vendors and routed more money to them, showing a level of emotional intelligence and long-term memory that we haven't really seen before.

It literally made the most money in the simulation.

Visual Logic (ARC-AGI 2)

You are going to see the ARC-AGI graph a lot this week. This benchmark tests visual puzzles that are easy for humans but usually impossible for AI. It requires you to look at a pattern of colored squares and figure out the hidden rule.

Historically, LLMs are terrible at this.

Gemini 3 with "deep thinking" enabled is hitting 45% accuracy. The previous best models were stuck around 18-20%. This is a massive leap in visual reasoning.

What This Means for You

Benchmarks are cool, but the real-world tests are wild. We are already seeing people build interactive apps in a single prompt.

  • One user built a 3D Lego editor in one shot.
  • Another created a fully animated simulation of a power plant.
  • The SVG generation for icons is finally perfect (no more weird, misaligned lines).

And if you look at the Screen Spot Pro benchmark, Gemini 3 is scoring 72.7 on understanding computer screens. The closest competitor is at 36.2. This means we are getting very close to AI agents that can actually take over your laptop to book flights or navigate complex websites, rather than just talking about it.

The Price of Performance

There is one catch. Gemini 3 Pro is about 20% more expensive than the previous 2.5 Pro model.

It is not an automatic "swap and forget" decision. For some tasks, it might be overkill. For others, specifically complex visual reasoning or long-context coding, it is absolutely worth the extra cost.

Need a Second Opinion?

If you are wondering if your current prompts are ready for Gemini 3, or if the price hike is worth it for your specific use case, we can help.

At Spring Prompt, we are currently offering prompt audits. We will take your existing prompts, run them through our own evaluation set against Gemini 3, 2.5, and GPT 5.1, and tell you exactly which model gives you the best bang for your buck.

Book a no commitment 30 min consultation call here

Ellis Crosby

Related Articles

GPT-5.1 First Look: Smarter, Warmer… But Not a Breakthrough

GPT-5.1 First Look: Smarter, Warmer… But Not a Breakthrough

2025’s flagship model season kicked off yesterday with the unexpected arrival of GPT-5.1, with OpenAI getting their release out before Gemini 3. While we’re still waiting for API access (and therefore can’t run proper, high-volume benchmark testing yet), we can take a close look at the release notes, early examples, and some small-scale hands-on tests within ChatGPT. Here are my early impressions - what actually improved, how it compares to the wider market, and whether I think most teams shoul

Read More
How to prepare for Gemini 3 + GPT 5.1

How to prepare for Gemini 3 + GPT 5.1

Here we go again: new-flagship season. Google’s Gemini 3 has been peeking through A/B tests in AI Studio and docs watchers have noticed model lifecycle shuffles, while OpenAI is lining up a GPT-5.1 family (base, Reasoning, and Pro). None of this is a formal launch note you can pin your roadmap to—but it’s enough signal to prepare. Treat it like a weather alert rather than a calendar invite.  What should you actually expect? Broadly: bigger context, stronger multimodality (esp. vision + code), a

Read More

Ready to Optimize Your AI Prompts?

Get expert prompt engineering that makes your AI faster, cheaper, and more reliable.

Book Free Consultation