AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new with the old Giancarlo Crocetti (Boehringer Ingelheim, USA)

Embedding Vs Relevancy Search
Comparing the Old with the New

1. Embedding Search
2. Document Level Embeddings
3. Selecting or Creating Embeddings
4. Results and Conclusions

Keyword Search
Characteristics
Considerable
document
pre-processing
•Document to text
•Text to tokens
•Named Entities
•Structure analysis
Based on
efficient
keyword or
phrase match
Basic relevancy
model
improved with
boosting (term
or field)
It works well
with a
dedicated team

Embedding Search
Characteristics
Lighter
document
pre-processing
• Document to text
• Text to (simple)
tokens
Inefficient
and less
flexible
Similarity
based on
context
Retrieval
based on
neighbor
search

Why Embeddings Search
1. Embeddings are good to identify
semantic and syntactic structures
2. A match can occur even without any
match in query keywords
“what are the side effects of ibuprofen?”
“NSAIDs can cause a range of adverse
reactions, especially …”
3. A search is nothing else that a dot-
product between two vectors

From Words to Documents
Given a document 𝑑𝑖 = {𝑤𝑖1, 𝑤𝑖2, … , 𝑤𝑖𝑚}
A possible document embedding for 𝑑𝑖 is:
𝑣 𝑑𝑖 = σ𝑗=1
𝑚
𝑣(𝑤𝑖𝑗) or
𝑣 𝑑𝑖 =
1
𝑚
σ𝑗=1
𝑚
𝑣(𝑤𝑖𝑗) (mean-pooling).
Note: 𝑖𝑓𝑤𝑖𝑗 ∉ V → 𝑣(𝑤𝑖𝑗) = 𝑣(′𝑢𝑛𝑘′) or use
embeddings like fastText

Smooth Inverse Frequency (SIF) Pooling Method
S. Arora, Y. Liang, and T. Ma, “A Simple but tough-to-beat baseline for sentence embeddings.” ICLR, 2017
hyperparameter
Word frequency
𝑢𝑢⊺ = 𝑆𝑖𝑛𝑔𝑢𝑙𝑎𝑟 𝑣𝑒𝑐𝑡𝑜𝑟: 𝑠𝑡𝑟𝑜𝑛𝑔𝑒𝑟 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡 𝑜𝑓 𝑆𝑖𝑛𝑔𝑢𝑙𝑎𝑟 𝑉𝑎𝑙𝑢𝑒 𝐷𝑒𝑐𝑜𝑚𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛
𝑢𝑢⊺ = 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑚𝑚𝑜𝑛 𝑡𝑜 𝑎𝑙𝑙 𝑣𝑒𝑐𝑡𝑜𝑟𝑠: 𝑟𝑒𝑚𝑜𝑣𝑖𝑛𝑔 𝑐𝑜𝑚𝑚𝑜𝑛 𝑝𝑎𝑡𝑡𝑒𝑟𝑛

Different Search Cases
• Classic Search: Keyword Search
• Use token-based embeddings
• Most similar sentence in a document
• Document Search: human-like questions
• Sentence or paragraph embeddings
• Best document match based on the question
• Media Search: image, sounds, etc.
• Object embeddings
• Best matched based on media characteristics

What do you Need?
1. A good document embedding
2. Select your measure of similarity
3. Generate the vector for your query
4. Find the most similar items to the query vector
Faiss (Facebook AI)
Non-Metric Space Library

Generating
Good Quality
Embeddings
1. Meditate on your real needs: KISS
2. Use your data: make it domain-specific
3. Refrain yourself to use the latest
technologies
• Might be better, but they are slower
• You must transfer learn, and forget about
one-shot learning
• You might not need context awareness with
domain specific data
4. Simpler embeddings are easier to
generate

Testing your
Embeddings
1. You must formally test your model
2. Different embeddings for different tasks
3. Extrinsic evaluation are good but …
4. Hyperparameters tuning is the key*
5. Hyperparameters tuning is task oriented*
O. Levy, Y. Goldberg, and I. Dagan, “Improving Distributional Similarity with Lessons Learned from Word Embeddings”. MIT Press, 2015
A. Gladkova, A. Drozd, and S. Matsuoka, “"Analogy-based detection of morphological and semantic relations” Association for Computational Linguistics, 2016

The Wild-West of Embeddings The fight over the 1%

Word2Vec
• The higher the dimension, better the performance.
• Skip-gram better than CBOW
• These are very general statements:
• Do not provide any insight on other hyperparameters?
• How does the corpus size influence the quality?
• What about specific tasks, like classification?

Intrinsic Testing
Syntactic Relationships:
• Apple and apples (regular plurals)
• Go and went (verb conjugations)
• Angry and angrier (comparative adjectives)
• Healthy and healthiest (superlative adjectives)
• Many more …
Semantic Relationships:
• DaVinci and Italy(name-nationality)
• Boy and girl(male-female)
• Car and automobile (synonym)
• Many more …

Extrinsic Testing
Classification
Class A
Class B
Class C
Class D

Findings
• Use Skip-Grams instead of CBOW
• High dimensionality space is not critical (surprising finding)
• start with 128 or 256
• Smaller the corpus size, smaller the embedding (e.g.; use 64)
• Larger the embedding size, larger the window
• Corpus size matters
• 5,000,000 documents or more
• With larger corpus sizes use a window size of 10 or more

Embedding Search
Sentence/paragraph
level tokenization Word2Vec
Doc2Vec
Query
Vector
Representation
Similar
Documents

Conclusions
• Look at your use case: is this a Search Embedding case?
• Start simple: e.g.; Word2Vec
• Use your own data or transfer learn using your data
• Create your baseline before employing more complex models
• Hyperparameter tuning is more important than fancy models

Giancarlo Crocetti
St. John’s University
crocettg@stjohns.edu

AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new with the old Giancarlo Crocetti (Boehringer Ingelheim, USA)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new with the old Giancarlo Crocetti (Boehringer Ingelheim, USA)

Ähnlich wie AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new with the old Giancarlo Crocetti (Boehringer Ingelheim, USA) (20)

Mehr von Dr. Haxel Consult

Mehr von Dr. Haxel Consult (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new with the old Giancarlo Crocetti (Boehringer Ingelheim, USA)