Back to Blog

Gemini Embedding 2 Just Launched - So We Benchmarked It

Ellis Crosby
8 min read
Gemini Embedding 2 Just Launched - So We Benchmarked It

Google launched gemini-embedding-2-preview on March 10, 2026 as its first multimodal embedding model, with one shared embedding space for text, images, video, audio, and PDFs. Google specifically positions it for cross-modal semantic search, document retrieval, and recommendation-style similarity tasks.

That made it a pretty obvious model to test on two things we care about a lot at Spring Prompt:

  1. RAG over mixed-media documents
  2. Search flows where users combine text and images

So instead of posting a launch-day hot take, we ran a benchmark.

We compared three setups:

  • Gemini Embedding 2 multimodal
  • Gemini Embedding 1 text baseline
  • CLIP multimodal

And we tested them on two different retrieval problems:

  • NVIDIA's latest earnings materials (press release and pdf presentation) for chart / slide / document retrieval
  • WANDS sofa search for text, image, and hybrid ecommerce retrieval

The datasets

1. NVIDIA earnings corpus

For the document side, we built a small financial-document retrieval corpus around recent NVIDIA earnings materials.

That corpus included things like:

  • press release chunks
  • earnings slides
  • charts
  • table-adjacent content
  • finance footnotes and outlook material

This made it a good test bed for chart-aware RAG and document retrieval over mixed media, because the useful information wasn’t only in plain paragraph text. Some of it lived in charts, slide layouts, short labels, footnotes, and financial tables. The report specifically notes successful retrieval on items like the revenue chart, GAAP / non-GAAP P&L slides, Q1 FY27 outlook material, reconciliation/footnote content, and press-release chunks containing the core quarterly numbers.

This is exactly the kind of corpus that tends to break simplistic “text only” retrieval setups.

2. WANDS sofa subset

For ecommerce, we used a WANDS sofa subset with:

  • product titles
  • product text / descriptions
  • product images
  • hybrid text+image queries

The key thing here is that this subset contains a lot of near-duplicate sofas and sectionals. That makes it a useful test for whether a model can handle fine-grained visual similarity, vague shopping intent, and image-led product discovery. It also makes the benchmark more realistic, because real ecommerce catalogs are messy: similar names, similar photos, similar materials, similar shapes.

So in short:

  • NVIDIA tested whether the model can retrieve the right chart, slide, or finance context from mixed document data
  • WANDS tested whether the model can improve product search, especially when text and images both matter

The benchmark setup

We used five balanced query sets with 20 searches each:

  • NVIDIA text: 20 text queries against the NVIDIA earnings corpus
  • NVIDIA image: 20 image queries against the NVIDIA earnings corpus
  • WANDS text: 20 product-text queries against the indexed WANDS sofa subset
  • WANDS image: 20 product-image queries against the indexed WANDS sofa subset
  • WANDS hybrid: 20 text+image queries against the indexed WANDS sofa subset

We scored each system with:

  • Recall@1: how often the correct result was ranked first
  • Recall@3 / Recall@5: how often the correct result appeared somewhere in the top 3 or top 5
  • MRR: Mean Reciprocal Rank, which rewards getting the right result near the top instead of buried further down the page

That matters because retrieval quality is not just about “did it eventually find the right thing?” It’s about whether it found it fast enough to feel useful.

One important detail from the report: for WANDS image queries, the Gemini Embedding 1 baseline is effectively unsupported, so it returns no result there. And for WANDS text and hybrid queries, matching is done at the product level, so the text baseline still gets credit if it retrieves the text record for the right product.


The headline result

Gemini Embedding 2 was the best all-round system.

That does not mean it won every category. It didn’t.

But it was the strongest default overall across:

  • document text retrieval
  • slide/chart image retrieval
  • hybrid text+image product search

Honestly, that is a more useful result than “new model wins everything”.


What this means for RAG that can understand charts and image data

This was one of the clearest wins in the benchmark.

System R@1 R@3 R@5 MRR
Gemini Embedding 2 multimodal 0.90 0.95 0.95 0.9167
Gemini Embedding 1 baseline 0.85 0.95 0.95 0.8917
CLIP multimodal 0.45 0.70 0.85 0.6100
System R@1 R@3 R@5 MRR
Gemini Embedding 2 multimodal 0.90 1.00 1.00 0.9500
Gemini Embedding 1 baseline 0.00 0.00 0.00 0.0000
CLIP multimodal 0.90 1.00 1.00 0.9417

On the document side, the story is pretty simple:

  • Gemini Embedding 2 was excellent on financial-document text retrieval
  • Gemini Embedding 2 and CLIP were both excellent on slide/chart image lookup
  • the text-only baseline could not do image-only retrieval at all

That is a strong signal for anyone building:

  • analytics RAG
  • investor-deck retrieval
  • search over dashboards or PDFs
  • support/search systems that need to understand charts, slides, and screenshots

Example: revenue chart retrieval

One simple example was this query:

record quarterly revenue of 68.1 billion up 73 percent year over year

Gemini Embedding 2 hit the right press-release chunk at rank 1.
The Gemini Embedding 1 baseline’s first hit was at rank 3.
CLIP missed in the top 5.

That’s a nice example of where strong document semantics matter more than just page-level visual similarity.

NVIDIA revenue chart

Example: crop robustness on slide retrieval

We also tested a cropped image from the Data Center slide.

  • Gemini Embedding 2: hit at rank 1
  • CLIP: hit at rank 3
  • Gemini Embedding 1 baseline: no result

That is exactly the kind of retrieval case I care about most in practice. People do not always type clean queries. Sometimes they upload a screenshot, crop a chart out of a deck, or paste in part of a slide.

Query crop:

Cropped Data Center query image

Correct target slide:

NVIDIA Data Center slide

So the claim I’d make here is:

Gemini Embedding 2 looks like a very strong fit for chart-aware and slide-aware RAG.

The report is also careful on an important point: this benchmark measures retrieval quality, not final free-form answer quality. So for high-stakes finance or analytics workflows, the safe production pattern is still:

  1. retrieve the right chart / table / source section
  2. extract or answer from that source
  3. cite the underlying artifact

That’s the right way to think about it: great at finding the right evidence, not automatically a replacement for grounded numeric QA.


The ecommerce side was more nuanced, which honestly made it more interesting.

System R@1 R@3 R@5 MRR
Gemini Embedding 2 multimodal 0.15 0.30 0.45 0.2517
Gemini Embedding 1 baseline 0.20 0.40 0.45 0.2875
CLIP multimodal 0.05 0.10 0.25 0.1017
System R@1 R@3 R@5 MRR
Gemini Embedding 2 multimodal 0.25 0.55 0.80 0.4408
Gemini Embedding 1 baseline 0.00 0.00 0.00 0.0000
CLIP multimodal 0.40 0.65 0.80 0.5433
System R@1 R@3 R@5 MRR
Gemini Embedding 2 multimodal 0.40 0.60 0.75 0.5158
Gemini Embedding 1 baseline 0.05 0.05 0.15 0.0700
CLIP multimodal 0.25 0.60 0.70 0.4250

A few things jump out here.

1. Plain text product search is not where the magic is

On text-only sofa search, the Gemini Embedding 1 baseline was actually a bit better than Gemini Embedding 2, and both beat CLIP. So multimodal embeddings are not automatically a huge win for classic text-only catalog retrieval.

On pure product-image retrieval, CLIP was the strongest system overall:

  • CLIP: R@1 = 0.40
  • Gemini Embedding 2: R@1 = 0.25

Both reached Recall@5 = 0.80, but CLIP was more precise at rank 1.

3. Hybrid search is where Gemini Embedding 2 really stands out

This was the clearest commercial win.

On WANDS hybrid queries:

  • Gemini Embedding 2: R@1 = 0.40, R@5 = 0.75
  • CLIP: R@1 = 0.25, R@5 = 0.70
  • Gemini Embedding 1 baseline: much weaker

That matters because hybrid search is much closer to how real users want to shop:

  • “find something like this photo”
  • “same style, but in leather”
  • “I want this sectional, but smaller”

Example: hybrid query where Gemini wins

One of the clearest showcase cases used the text:

sectional sofa with ottoman

...plus an image of the mendoza 103.5'' wide right hand facing sofa & chaise with ottoman.

  • Gemini Embedding 2: hit at rank 1
  • CLIP: hit at rank 2
  • Gemini Embedding 1 baseline: missed the top 5

That is exactly the kind of query where the text is broad, the image is specific, and both need to matter.

Mendoza right-hand sectional

Example: image-only case where CLIP wins

We also had a nice counterexample where CLIP was clearly better.

For the image of janna 73'' wide reversible sofa and chaise with ottoman:

  • CLIP: hit at rank 1
  • Gemini Embedding 2: hit at rank 5
  • Gemini Embedding 1 baseline: no result
Janna reversible sectional

That’s a useful reminder that the story here is not:

“Gemini replaces everything.”

It’s more like:

  • best single default system overall: Gemini Embedding 2
  • best pure text baseline: still a strong text embedder
  • best image-led product baseline: CLIP
  • most interesting category for Gemini: hybrid retrieval

What I’d take away from this

Gemini Embedding 2 is a strong default if you want one multimodal retriever

If you need one retrieval layer that can handle documents, slides, charts, screenshots, product images, and mixed queries, Gemini Embedding 2 looks like a very strong starting point.

Text baselines still matter

On plain text search, especially product search, multimodal embeddings are not automatically a dramatic upgrade.

CLIP is still a serious baseline

Especially on image-heavy product retrieval, CLIP remains very relevant.

Hybrid retrieval is where things get really interesting

The most commercially useful result in the whole benchmark was not text search. It was the text+image setting, where Gemini Embedding 2 was best overall on WANDS hybrid queries.

That’s where multimodal retrieval stops being a neat demo and starts feeling like a better product experience.


Final thoughts

This is exactly why evals matter.

The useful conclusion from this benchmark was not “the new model wins everything.”
It was understanding:

  • where it wins
  • where it doesn’t
  • and what that means for the retrieval stack you actually want to build

If you’re working on:

  • chart-aware RAG
  • document retrieval over decks and PDFs
  • search-by-image
  • multimodal ecommerce discovery
  • embedding benchmarks

...then Gemini Embedding 2 is definitely worth testing.

And if your use case is strongly image-led, I would still keep CLIP in the benchmark rather than assuming the newest model automatically replaces it.


Closing

At Spring Prompt, this is exactly the kind of work we care about: structured evals, retrieval benchmarks, and figuring out where systems genuinely improve versus where they just look good in a launch demo.

We’ll be doing more experiments like this, and we’re also building more around the tooling side of LLM evaluation and benchmarking. Follow along if that’s your world too.

Ellis Crosby

Related Articles

LiteLLM alternatives for 2026

LiteLLM alternatives for 2026

If you’re looking for LiteLLM alternatives, you’re usually trying to solve one of two problems: * you need a Python library that makes it easy to switch between LLM providers * you need an AI gateway / routing layer that handles fallbacks, caching, observability, and control That split matters, because the best LiteLLM alternative depends on which problem you actually have. Recent context: On March 24, 2026, LiteLLM disclosed a supply-chain incident affecting malicious PyPI releases 1.82.7 

Read More

Ready to Optimize Your AI Prompts?

Start testing and improving your prompts with Spring Prompt's professional tools.

Get Started Free