Gemini Embedding 2 Just Launched - So We Benchmarked It

Ellis Crosby

Published March 11, 2026

8 min read

Gemini Embedding 2 Just Launched - So We Benchmarked It

Google launched gemini-embedding-2-preview on March 10, 2026 as its first multimodal embedding model, with one shared embedding space for text, images, video, audio, and PDFs. Google specifically positions it for cross-modal semantic search, document retrieval, and recommendation-style similarity tasks.

That made it a pretty obvious model to test on two things we care about a lot at Spring Prompt:

RAG over mixed-media documents
Search flows where users combine text and images

So instead of posting a launch-day hot take, we ran a benchmark.

We compared three setups:

Gemini Embedding 2 multimodal
Gemini Embedding 1 text baseline
CLIP multimodal

And we tested them on two different retrieval problems:

NVIDIA's latest earnings materials (press release and pdf presentation) for chart / slide / document retrieval
WANDS sofa search for text, image, and hybrid ecommerce retrieval

The datasets

1. NVIDIA earnings corpus

For the document side, we built a small financial-document retrieval corpus around recent NVIDIA earnings materials.

That corpus included things like:

press release chunks
earnings slides
charts
table-adjacent content
finance footnotes and outlook material

This made it a good test bed for chart-aware RAG and document retrieval over mixed media, because the useful information wasn’t only in plain paragraph text. Some of it lived in charts, slide layouts, short labels, footnotes, and financial tables. The report specifically notes successful retrieval on items like the revenue chart, GAAP / non-GAAP P&L slides, Q1 FY27 outlook material, reconciliation/footnote content, and press-release chunks containing the core quarterly numbers.

This is exactly the kind of corpus that tends to break simplistic “text only” retrieval setups.

2. WANDS sofa subset

For ecommerce, we used a WANDS sofa subset with:

product titles
product text / descriptions
product images
hybrid text+image queries

The key thing here is that this subset contains a lot of near-duplicate sofas and sectionals. That makes it a useful test for whether a model can handle fine-grained visual similarity, vague shopping intent, and image-led product discovery. It also makes the benchmark more realistic, because real ecommerce catalogs are messy: similar names, similar photos, similar materials, similar shapes.

So in short:

NVIDIA tested whether the model can retrieve the right chart, slide, or finance context from mixed document data
WANDS tested whether the model can improve product search, especially when text and images both matter

The benchmark setup

We used five balanced query sets with 20 searches each:

NVIDIA text: 20 text queries against the NVIDIA earnings corpus
NVIDIA image: 20 image queries against the NVIDIA earnings corpus
WANDS text: 20 product-text queries against the indexed WANDS sofa subset
WANDS image: 20 product-image queries against the indexed WANDS sofa subset
WANDS hybrid: 20 text+image queries against the indexed WANDS sofa subset

We scored each system with:

Recall@1: how often the correct result was ranked first
Recall@3 / Recall@5: how often the correct result appeared somewhere in the top 3 or top 5
MRR: Mean Reciprocal Rank, which rewards getting the right result near the top instead of buried further down the page

That matters because retrieval quality is not just about “did it eventually find the right thing?” It’s about whether it found it fast enough to feel useful.

One important detail from the report: for WANDS image queries, the Gemini Embedding 1 baseline is effectively unsupported, so it returns no result there. And for WANDS text and hybrid queries, matching is done at the product level, so the text baseline still gets credit if it retrieves the text record for the right product.

The headline result

Gemini Embedding 2 was the best all-round system.

That does not mean it won every category. It didn’t.

But it was the strongest default overall across:

document text retrieval
slide/chart image retrieval
hybrid text+image product search

Honestly, that is a more useful result than “new model wins everything”.

What this means for RAG that can understand charts and image data

This was one of the clearest wins in the benchmark.

NVIDIA text search

System	R@1	R@3	R@5	MRR
Gemini Embedding 2 multimodal	0.90	0.95	0.95	0.9167
Gemini Embedding 1 baseline	0.85	0.95	0.95	0.8917
CLIP multimodal	0.45	0.70	0.85	0.6100

NVIDIA image search

System	R@1	R@3	R@5	MRR
Gemini Embedding 2 multimodal	0.90	1.00	1.00	0.9500
Gemini Embedding 1 baseline	0.00	0.00	0.00	0.0000
CLIP multimodal	0.90	1.00	1.00	0.9417

On the document side, the story is pretty simple:

Gemini Embedding 2 was excellent on financial-document text retrieval
Gemini Embedding 2 and CLIP were both excellent on slide/chart image lookup
the text-only baseline could not do image-only retrieval at all

That is a strong signal for anyone building:

analytics RAG
investor-deck retrieval
search over dashboards or PDFs
support/search systems that need to understand charts, slides, and screenshots

Example: revenue chart retrieval

One simple example was this query:

record quarterly revenue of 68.1 billion up 73 percent year over year

Gemini Embedding 2 hit the right press-release chunk at rank 1.
The Gemini Embedding 1 baseline’s first hit was at rank 3.
CLIP missed in the top 5.

That’s a nice example of where strong document semantics matter more than just page-level visual similarity.

Example: crop robustness on slide retrieval

We also tested a cropped image from the Data Center slide.

Gemini Embedding 2: hit at rank 1
CLIP: hit at rank 3
Gemini Embedding 1 baseline: no result

That is exactly the kind of retrieval case I care about most in practice. People do not always type clean queries. Sometimes they upload a screenshot, crop a chart out of a deck, or paste in part of a slide.

Query crop:

Correct target slide:

So the claim I’d make here is:

Gemini Embedding 2 looks like a very strong fit for chart-aware and slide-aware RAG.

The report is also careful on an important point: this benchmark measures retrieval quality, not final free-form answer quality. So for high-stakes finance or analytics workflows, the safe production pattern is still:

retrieve the right chart / table / source section
extract or answer from that source
cite the underlying artifact

That’s the right way to think about it: great at finding the right evidence, not automatically a replacement for grounded numeric QA.

How this can improve ecommerce search

The ecommerce side was more nuanced, which honestly made it more interesting.

WANDS text search

System	R@1	R@3	R@5	MRR
Gemini Embedding 2 multimodal	0.15	0.30	0.45	0.2517
Gemini Embedding 1 baseline	0.20	0.40	0.45	0.2875
CLIP multimodal	0.05	0.10	0.25	0.1017

WANDS image search

System	R@1	R@3	R@5	MRR
Gemini Embedding 2 multimodal	0.25	0.55	0.80	0.4408
Gemini Embedding 1 baseline	0.00	0.00	0.00	0.0000
CLIP multimodal	0.40	0.65	0.80	0.5433

WANDS hybrid search

System	R@1	R@3	R@5	MRR
Gemini Embedding 2 multimodal	0.40	0.60	0.75	0.5158
Gemini Embedding 1 baseline	0.05	0.05	0.15	0.0700
CLIP multimodal	0.25	0.60	0.70	0.4250

A few things jump out here.

1. Plain text product search is not where the magic is

On text-only sofa search, the Gemini Embedding 1 baseline was actually a bit better than Gemini Embedding 2, and both beat CLIP. So multimodal embeddings are not automatically a huge win for classic text-only catalog retrieval.

2. CLIP still matters on image-led product search

On pure product-image retrieval, CLIP was the strongest system overall:

CLIP: R@1 = 0.40
Gemini Embedding 2: R@1 = 0.25

Both reached Recall@5 = 0.80, but CLIP was more precise at rank 1.

3. Hybrid search is where Gemini Embedding 2 really stands out

This was the clearest commercial win.

On WANDS hybrid queries:

Gemini Embedding 2: R@1 = 0.40, R@5 = 0.75
CLIP: R@1 = 0.25, R@5 = 0.70
Gemini Embedding 1 baseline: much weaker

That matters because hybrid search is much closer to how real users want to shop:

“find something like this photo”
“same style, but in leather”
“I want this sectional, but smaller”

Example: hybrid query where Gemini wins

One of the clearest showcase cases used the text:

sectional sofa with ottoman

...plus an image of the mendoza 103.5'' wide right hand facing sofa & chaise with ottoman.

Gemini Embedding 2: hit at rank 1
CLIP: hit at rank 2
Gemini Embedding 1 baseline: missed the top 5

That is exactly the kind of query where the text is broad, the image is specific, and both need to matter.

Example: image-only case where CLIP wins

We also had a nice counterexample where CLIP was clearly better.

For the image of janna 73'' wide reversible sofa and chaise with ottoman:

CLIP: hit at rank 1
Gemini Embedding 2: hit at rank 5
Gemini Embedding 1 baseline: no result

That’s a useful reminder that the story here is not:

“Gemini replaces everything.”

It’s more like:

best single default system overall: Gemini Embedding 2
best pure text baseline: still a strong text embedder
best image-led product baseline: CLIP
most interesting category for Gemini: hybrid retrieval

What I’d take away from this

Gemini Embedding 2 is a strong default if you want one multimodal retriever

If you need one retrieval layer that can handle documents, slides, charts, screenshots, product images, and mixed queries, Gemini Embedding 2 looks like a very strong starting point.

Text baselines still matter

On plain text search, especially product search, multimodal embeddings are not automatically a dramatic upgrade.

CLIP is still a serious baseline

Especially on image-heavy product retrieval, CLIP remains very relevant.

Hybrid retrieval is where things get really interesting

The most commercially useful result in the whole benchmark was not text search. It was the text+image setting, where Gemini Embedding 2 was best overall on WANDS hybrid queries.

That’s where multimodal retrieval stops being a neat demo and starts feeling like a better product experience.

Final thoughts

This is exactly why evals matter.

The useful conclusion from this benchmark was not “the new model wins everything.”
It was understanding:

where it wins
where it doesn’t
and what that means for the retrieval stack you actually want to build

If you’re working on:

chart-aware RAG
document retrieval over decks and PDFs
search-by-image
multimodal ecommerce discovery
embedding benchmarks

...then Gemini Embedding 2 is definitely worth testing.

And if your use case is strongly image-led, I would still keep CLIP in the benchmark rather than assuming the newest model automatically replaces it.

Closing

At Spring Prompt, this is exactly the kind of work we care about: structured evals, retrieval benchmarks, and figuring out where systems genuinely improve versus where they just look good in a launch demo.

We’ll be doing more experiments like this, and we’re also building more around the tooling side of LLM evaluation and benchmarking. Follow along if that’s your world too.

Gemini Embedding 2 Just Launched - So We Benchmarked It