Google launched gemini-embedding-2-preview on March 10, 2026 as its first multimodal embedding model, with one shared embedding space for text, images, video, audio, and PDFs. Google specifically positions it for cross-modal semantic search, document retrieval, and recommendation-style similarity tasks.
That made it a pretty obvious model to test on two things we care about a lot at Spring Prompt:
- RAG over mixed-media documents
- Search flows where users combine text and images
So instead of posting a launch-day hot take, we ran a benchmark.
We compared three setups:
- Gemini Embedding 2 multimodal
- Gemini Embedding 1 text baseline
- CLIP multimodal
And we tested them on two different retrieval problems:
- NVIDIA's latest earnings materials (press release and pdf presentation) for chart / slide / document retrieval
- WANDS sofa search for text, image, and hybrid ecommerce retrieval
The datasets
1. NVIDIA earnings corpus
For the document side, we built a small financial-document retrieval corpus around recent NVIDIA earnings materials.
That corpus included things like:
- press release chunks
- earnings slides
- charts
- table-adjacent content
- finance footnotes and outlook material
This made it a good test bed for chart-aware RAG and document retrieval over mixed media, because the useful information wasn’t only in plain paragraph text. Some of it lived in charts, slide layouts, short labels, footnotes, and financial tables. The report specifically notes successful retrieval on items like the revenue chart, GAAP / non-GAAP P&L slides, Q1 FY27 outlook material, reconciliation/footnote content, and press-release chunks containing the core quarterly numbers.
This is exactly the kind of corpus that tends to break simplistic “text only” retrieval setups.
2. WANDS sofa subset
For ecommerce, we used a WANDS sofa subset with:
- product titles
- product text / descriptions
- product images
- hybrid text+image queries
The key thing here is that this subset contains a lot of near-duplicate sofas and sectionals. That makes it a useful test for whether a model can handle fine-grained visual similarity, vague shopping intent, and image-led product discovery. It also makes the benchmark more realistic, because real ecommerce catalogs are messy: similar names, similar photos, similar materials, similar shapes.
So in short:
- NVIDIA tested whether the model can retrieve the right chart, slide, or finance context from mixed document data
- WANDS tested whether the model can improve product search, especially when text and images both matter
The benchmark setup
We used five balanced query sets with 20 searches each:
- NVIDIA text: 20 text queries against the NVIDIA earnings corpus
- NVIDIA image: 20 image queries against the NVIDIA earnings corpus
- WANDS text: 20 product-text queries against the indexed WANDS sofa subset
- WANDS image: 20 product-image queries against the indexed WANDS sofa subset
- WANDS hybrid: 20 text+image queries against the indexed WANDS sofa subset
We scored each system with:
- Recall@1: how often the correct result was ranked first
- Recall@3 / Recall@5: how often the correct result appeared somewhere in the top 3 or top 5
- MRR: Mean Reciprocal Rank, which rewards getting the right result near the top instead of buried further down the page
That matters because retrieval quality is not just about “did it eventually find the right thing?” It’s about whether it found it fast enough to feel useful.
One important detail from the report: for WANDS image queries, the Gemini Embedding 1 baseline is effectively unsupported, so it returns no result there. And for WANDS text and hybrid queries, matching is done at the product level, so the text baseline still gets credit if it retrieves the text record for the right product.
The headline result
Gemini Embedding 2 was the best all-round system.
That does not mean it won every category. It didn’t.
But it was the strongest default overall across:
- document text retrieval
- slide/chart image retrieval
- hybrid text+image product search
Honestly, that is a more useful result than “new model wins everything”.
What this means for RAG that can understand charts and image data
This was one of the clearest wins in the benchmark.
NVIDIA text search
| System | R@1 | R@3 | R@5 | MRR |
|---|---|---|---|---|
| Gemini Embedding 2 multimodal | 0.90 | 0.95 | 0.95 | 0.9167 |
| Gemini Embedding 1 baseline | 0.85 | 0.95 | 0.95 | 0.8917 |
| CLIP multimodal | 0.45 | 0.70 | 0.85 | 0.6100 |
NVIDIA image search
| System | R@1 | R@3 | R@5 | MRR |
|---|---|---|---|---|
| Gemini Embedding 2 multimodal | 0.90 | 1.00 | 1.00 | 0.9500 |
| Gemini Embedding 1 baseline | 0.00 | 0.00 | 0.00 | 0.0000 |
| CLIP multimodal | 0.90 | 1.00 | 1.00 | 0.9417 |
On the document side, the story is pretty simple:
- Gemini Embedding 2 was excellent on financial-document text retrieval
- Gemini Embedding 2 and CLIP were both excellent on slide/chart image lookup
- the text-only baseline could not do image-only retrieval at all
That is a strong signal for anyone building:
- analytics RAG
- investor-deck retrieval
- search over dashboards or PDFs
- support/search systems that need to understand charts, slides, and screenshots
Example: revenue chart retrieval
One simple example was this query:
record quarterly revenue of 68.1 billion up 73 percent year over year
Gemini Embedding 2 hit the right press-release chunk at rank 1.
The Gemini Embedding 1 baseline’s first hit was at rank 3.
CLIP missed in the top 5.
That’s a nice example of where strong document semantics matter more than just page-level visual similarity.

Example: crop robustness on slide retrieval
We also tested a cropped image from the Data Center slide.
- Gemini Embedding 2: hit at rank 1
- CLIP: hit at rank 3
- Gemini Embedding 1 baseline: no result
That is exactly the kind of retrieval case I care about most in practice. People do not always type clean queries. Sometimes they upload a screenshot, crop a chart out of a deck, or paste in part of a slide.
Query crop:

Correct target slide:

So the claim I’d make here is:
Gemini Embedding 2 looks like a very strong fit for chart-aware and slide-aware RAG.
The report is also careful on an important point: this benchmark measures retrieval quality, not final free-form answer quality. So for high-stakes finance or analytics workflows, the safe production pattern is still:
- retrieve the right chart / table / source section
- extract or answer from that source
- cite the underlying artifact
That’s the right way to think about it: great at finding the right evidence, not automatically a replacement for grounded numeric QA.
How this can improve ecommerce search
The ecommerce side was more nuanced, which honestly made it more interesting.
WANDS text search
| System | R@1 | R@3 | R@5 | MRR |
|---|---|---|---|---|
| Gemini Embedding 2 multimodal | 0.15 | 0.30 | 0.45 | 0.2517 |
| Gemini Embedding 1 baseline | 0.20 | 0.40 | 0.45 | 0.2875 |
| CLIP multimodal | 0.05 | 0.10 | 0.25 | 0.1017 |
WANDS image search
| System | R@1 | R@3 | R@5 | MRR |
|---|---|---|---|---|
| Gemini Embedding 2 multimodal | 0.25 | 0.55 | 0.80 | 0.4408 |
| Gemini Embedding 1 baseline | 0.00 | 0.00 | 0.00 | 0.0000 |
| CLIP multimodal | 0.40 | 0.65 | 0.80 | 0.5433 |
WANDS hybrid search
| System | R@1 | R@3 | R@5 | MRR |
|---|---|---|---|---|
| Gemini Embedding 2 multimodal | 0.40 | 0.60 | 0.75 | 0.5158 |
| Gemini Embedding 1 baseline | 0.05 | 0.05 | 0.15 | 0.0700 |
| CLIP multimodal | 0.25 | 0.60 | 0.70 | 0.4250 |
A few things jump out here.
1. Plain text product search is not where the magic is
On text-only sofa search, the Gemini Embedding 1 baseline was actually a bit better than Gemini Embedding 2, and both beat CLIP. So multimodal embeddings are not automatically a huge win for classic text-only catalog retrieval.
2. CLIP still matters on image-led product search
On pure product-image retrieval, CLIP was the strongest system overall:
- CLIP: R@1 = 0.40
- Gemini Embedding 2: R@1 = 0.25
Both reached Recall@5 = 0.80, but CLIP was more precise at rank 1.
3. Hybrid search is where Gemini Embedding 2 really stands out
This was the clearest commercial win.
On WANDS hybrid queries:
- Gemini Embedding 2: R@1 = 0.40, R@5 = 0.75
- CLIP: R@1 = 0.25, R@5 = 0.70
- Gemini Embedding 1 baseline: much weaker
That matters because hybrid search is much closer to how real users want to shop:
- “find something like this photo”
- “same style, but in leather”
- “I want this sectional, but smaller”
Example: hybrid query where Gemini wins
One of the clearest showcase cases used the text:
sectional sofa with ottoman
...plus an image of the mendoza 103.5'' wide right hand facing sofa & chaise with ottoman.
- Gemini Embedding 2: hit at rank 1
- CLIP: hit at rank 2
- Gemini Embedding 1 baseline: missed the top 5
That is exactly the kind of query where the text is broad, the image is specific, and both need to matter.

Example: image-only case where CLIP wins
We also had a nice counterexample where CLIP was clearly better.
For the image of janna 73'' wide reversible sofa and chaise with ottoman:
- CLIP: hit at rank 1
- Gemini Embedding 2: hit at rank 5
- Gemini Embedding 1 baseline: no result

That’s a useful reminder that the story here is not:
“Gemini replaces everything.”
It’s more like:
- best single default system overall: Gemini Embedding 2
- best pure text baseline: still a strong text embedder
- best image-led product baseline: CLIP
- most interesting category for Gemini: hybrid retrieval
What I’d take away from this
Gemini Embedding 2 is a strong default if you want one multimodal retriever
If you need one retrieval layer that can handle documents, slides, charts, screenshots, product images, and mixed queries, Gemini Embedding 2 looks like a very strong starting point.
Text baselines still matter
On plain text search, especially product search, multimodal embeddings are not automatically a dramatic upgrade.
CLIP is still a serious baseline
Especially on image-heavy product retrieval, CLIP remains very relevant.
Hybrid retrieval is where things get really interesting
The most commercially useful result in the whole benchmark was not text search. It was the text+image setting, where Gemini Embedding 2 was best overall on WANDS hybrid queries.
That’s where multimodal retrieval stops being a neat demo and starts feeling like a better product experience.
Final thoughts
This is exactly why evals matter.
The useful conclusion from this benchmark was not “the new model wins everything.”
It was understanding:
- where it wins
- where it doesn’t
- and what that means for the retrieval stack you actually want to build
If you’re working on:
- chart-aware RAG
- document retrieval over decks and PDFs
- search-by-image
- multimodal ecommerce discovery
- embedding benchmarks
...then Gemini Embedding 2 is definitely worth testing.
And if your use case is strongly image-led, I would still keep CLIP in the benchmark rather than assuming the newest model automatically replaces it.
Closing
At Spring Prompt, this is exactly the kind of work we care about: structured evals, retrieval benchmarks, and figuring out where systems genuinely improve versus where they just look good in a launch demo.
We’ll be doing more experiments like this, and we’re also building more around the tooling side of LLM evaluation and benchmarking. Follow along if that’s your world too.