SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
Embedding Vs Relevancy Search
Comparing the Old with the New
1. Embedding Search
2. Document Level Embeddings
3. Selecting or Creating Embeddings
4. Results and Conclusions
Keyword Search
Characteristics
Considerable
document
pre-processing
•Document to text
•Text to tokens
•Named Entities
•Structure analysis
Based on
efficient
keyword or
phrase match
Basic relevancy
model
improved with
boosting (term
or field)
It works well
with a
dedicated team
Embedding Search
Characteristics
Lighter
document
pre-processing
• Document to text
• Text to (simple)
tokens
Inefficient
and less
flexible
Similarity
based on
context
Retrieval
based on
neighbor
search
Why Embeddings Search
1. Embeddings are good to identify
semantic and syntactic structures
2. A match can occur even without any
match in query keywords
“what are the side effects of ibuprofen?”
“NSAIDs can cause a range of adverse
reactions, especially …”
3. A search is nothing else that a dot-
product between two vectors
From Words to Documents
Given a document 𝑑𝑖 = {𝑤𝑖1, 𝑤𝑖2, … , 𝑤𝑖𝑚}
A possible document embedding for 𝑑𝑖 is:
𝑣 𝑑𝑖 = σ𝑗=1
𝑚
𝑣(𝑤𝑖𝑗) or
𝑣 𝑑𝑖 =
1
𝑚
σ𝑗=1
𝑚
𝑣(𝑤𝑖𝑗) (mean-pooling).
Note: 𝑖𝑓𝑤𝑖𝑗 ∉ V → 𝑣(𝑤𝑖𝑗) = 𝑣(′𝑢𝑛𝑘′) or use
embeddings like fastText
Smooth Inverse Frequency (SIF) Pooling Method
S. Arora, Y. Liang, and T. Ma, “A Simple but tough-to-beat baseline for sentence embeddings.” ICLR, 2017
hyperparameter
Word frequency
𝑢𝑢⊺ = 𝑆𝑖𝑛𝑔𝑢𝑙𝑎𝑟 𝑣𝑒𝑐𝑡𝑜𝑟: 𝑠𝑡𝑟𝑜𝑛𝑔𝑒𝑟 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡 𝑜𝑓 𝑆𝑖𝑛𝑔𝑢𝑙𝑎𝑟 𝑉𝑎𝑙𝑢𝑒 𝐷𝑒𝑐𝑜𝑚𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛
𝑢𝑢⊺ = 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑚𝑚𝑜𝑛 𝑡𝑜 𝑎𝑙𝑙 𝑣𝑒𝑐𝑡𝑜𝑟𝑠: 𝑟𝑒𝑚𝑜𝑣𝑖𝑛𝑔 𝑐𝑜𝑚𝑚𝑜𝑛 𝑝𝑎𝑡𝑡𝑒𝑟𝑛
Document Encoding – Doc2Vec
Different Search Cases
• Classic Search: Keyword Search
• Use token-based embeddings
• Most similar sentence in a document
• Document Search: human-like questions
• Sentence or paragraph embeddings
• Best document match based on the question
• Media Search: image, sounds, etc.
• Object embeddings
• Best matched based on media characteristics
What do you Need?
1. A good document embedding
2. Select your measure of similarity
3. Generate the vector for your query
4. Find the most similar items to the query vector
Faiss (Facebook AI)
Non-Metric Space Library
Generating
Good Quality
Embeddings
1. Meditate on your real needs: KISS
2. Use your data: make it domain-specific
3. Refrain yourself to use the latest
technologies
• Might be better, but they are slower
• You must transfer learn, and forget about
one-shot learning
• You might not need context awareness with
domain specific data
4. Simpler embeddings are easier to
generate
Testing your
Embeddings
1. You must formally test your model
2. Different embeddings for different tasks
3. Extrinsic evaluation are good but …
4. Hyperparameters tuning is the key*
5. Hyperparameters tuning is task oriented*
O. Levy, Y. Goldberg, and I. Dagan, “Improving Distributional Similarity with Lessons Learned from Word Embeddings”. MIT Press, 2015
A. Gladkova, A. Drozd, and S. Matsuoka, “"Analogy-based detection of morphological and semantic relations” Association for Computational Linguistics, 2016
The Wild-West of Embeddings The fight over the 1%
Word2Vec
• The higher the dimension, better the performance.
• Skip-gram better than CBOW
• These are very general statements:
• Do not provide any insight on other hyperparameters?
• How does the corpus size influence the quality?
• What about specific tasks, like classification?
Intrinsic Testing
Syntactic Relationships:
• Apple and apples (regular plurals)
• Go and went (verb conjugations)
• Angry and angrier (comparative adjectives)
• Healthy and healthiest (superlative adjectives)
• Many more …
Semantic Relationships:
• DaVinci and Italy(name-nationality)
• Boy and girl(male-female)
• Car and automobile (synonym)
• Many more …
Extrinsic Testing
Classification
Class A
Class B
Class C
Class D
Findings
• Use Skip-Grams instead of CBOW
• High dimensionality space is not critical (surprising finding)
• start with 128 or 256
• Smaller the corpus size, smaller the embedding (e.g.; use 64)
• Larger the embedding size, larger the window
• Corpus size matters
• 5,000,000 documents or more
• With larger corpus sizes use a window size of 10 or more
Embedding Search
Sentence/paragraph
level tokenization Word2Vec
Doc2Vec
Query
Vector
Representation
Similar
Documents
Conclusions
• Look at your use case: is this a Search Embedding case?
• Start simple: e.g.; Word2Vec
• Use your own data or transfer learn using your data
• Create your baseline before employing more complex models
• Hyperparameter tuning is more important than fancy models
Giancarlo Crocetti
St. John’s University
crocettg@stjohns.edu

Weitere ähnliche Inhalte

Ähnlich wie AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new with the old Giancarlo Crocetti (Boehringer Ingelheim, USA)

Owning the Discovery Experience for Your Patrons
Owning the Discovery Experience for Your PatronsOwning the Discovery Experience for Your Patrons
Owning the Discovery Experience for Your PatronsRobert H. McDonald
 
Social Work Masters Literature Review: Practical Searching
Social Work Masters Literature Review: Practical SearchingSocial Work Masters Literature Review: Practical Searching
Social Work Masters Literature Review: Practical SearchingElizabeth Moll-Willard
 
INFORMATION SKILLS: NAVIGATING RESEARCH IN LIBRARY
INFORMATION SKILLS: NAVIGATING RESEARCH IN LIBRARYINFORMATION SKILLS: NAVIGATING RESEARCH IN LIBRARY
INFORMATION SKILLS: NAVIGATING RESEARCH IN LIBRARYChris Okiki
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesOpenSource Connections
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectorsSimon Hughes
 
Test specifications and designs
Test specifications and designs  Test specifications and designs
Test specifications and designs ahfameri
 
Easton Comerford Fall 2015 Eng 1301 Presentation
Easton Comerford Fall 2015 Eng 1301 PresentationEaston Comerford Fall 2015 Eng 1301 Presentation
Easton Comerford Fall 2015 Eng 1301 Presentationjana1954
 
How not to reinvent the wheel - Literature Searching for ENCH400 2012
How not to reinvent the wheel - Literature Searching for ENCH400 2012How not to reinvent the wheel - Literature Searching for ENCH400 2012
How not to reinvent the wheel - Literature Searching for ENCH400 2012Deborah Fitchett
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes
 
Research seminar lecture_10_analysing_qualitative_data
Research seminar lecture_10_analysing_qualitative_dataResearch seminar lecture_10_analysing_qualitative_data
Research seminar lecture_10_analysing_qualitative_dataDaria Bogdanova
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comSimon Hughes
 
Pop culturecritics productivesearching
Pop culturecritics productivesearchingPop culturecritics productivesearching
Pop culturecritics productivesearchinghmfowler
 
A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)UNCResearchHub
 
How to Read Academic Papers
How to Read Academic PapersHow to Read Academic Papers
How to Read Academic PapersJia-Bin Huang
 

Ähnlich wie AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new with the old Giancarlo Crocetti (Boehringer Ingelheim, USA) (20)

Owning the Discovery Experience for Your Patrons
Owning the Discovery Experience for Your PatronsOwning the Discovery Experience for Your Patrons
Owning the Discovery Experience for Your Patrons
 
EDS for JIBS
EDS for JIBSEDS for JIBS
EDS for JIBS
 
Social Work Masters Literature Review: Practical Searching
Social Work Masters Literature Review: Practical SearchingSocial Work Masters Literature Review: Practical Searching
Social Work Masters Literature Review: Practical Searching
 
INFORMATION SKILLS: NAVIGATING RESEARCH IN LIBRARY
INFORMATION SKILLS: NAVIGATING RESEARCH IN LIBRARYINFORMATION SKILLS: NAVIGATING RESEARCH IN LIBRARY
INFORMATION SKILLS: NAVIGATING RESEARCH IN LIBRARY
 
Better Search Engine Testing
Better Search Engine TestingBetter Search Engine Testing
Better Search Engine Testing
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectors
 
Test specifications and designs session 4
Test specifications and designs  session 4Test specifications and designs  session 4
Test specifications and designs session 4
 
Test specifications and designs
Test specifications and designs  Test specifications and designs
Test specifications and designs
 
Everything You Wish You Knew About Search
Everything You Wish You Knew About SearchEverything You Wish You Knew About Search
Everything You Wish You Knew About Search
 
Easton Comerford Fall 2015 Eng 1301 Presentation
Easton Comerford Fall 2015 Eng 1301 PresentationEaston Comerford Fall 2015 Eng 1301 Presentation
Easton Comerford Fall 2015 Eng 1301 Presentation
 
How not to reinvent the wheel - Literature Searching for ENCH400 2012
How not to reinvent the wheel - Literature Searching for ENCH400 2012How not to reinvent the wheel - Literature Searching for ENCH400 2012
How not to reinvent the wheel - Literature Searching for ENCH400 2012
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
 
Research seminar lecture_10_analysing_qualitative_data
Research seminar lecture_10_analysing_qualitative_dataResearch seminar lecture_10_analysing_qualitative_data
Research seminar lecture_10_analysing_qualitative_data
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
 
Pop culturecritics productivesearching
Pop culturecritics productivesearchingPop culturecritics productivesearching
Pop culturecritics productivesearching
 
A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)
 
How to Read Academic Papers
How to Read Academic PapersHow to Read Academic Papers
How to Read Academic Papers
 

Mehr von Dr. Haxel Consult

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementDr. Haxel Consult
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...Dr. Haxel Consult
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...Dr. Haxel Consult
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...Dr. Haxel Consult
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...Dr. Haxel Consult
 
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...Dr. Haxel Consult
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...Dr. Haxel Consult
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...Dr. Haxel Consult
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...Dr. Haxel Consult
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...Dr. Haxel Consult
 
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...Dr. Haxel Consult
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...Dr. Haxel Consult
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...Dr. Haxel Consult
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...Dr. Haxel Consult
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterDr. Haxel Consult
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCDr. Haxel Consult
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...Dr. Haxel Consult
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...Dr. Haxel Consult
 
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...Dr. Haxel Consult
 

Mehr von Dr. Haxel Consult (20)

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
 
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
 
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance Center
 
AI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IPAI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IP
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOC
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
 
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
 

Kürzlich hochgeladen

✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663Call Girls Mumbai
 
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call GirlVIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girladitipandeya
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445ruhi
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...Diya Sharma
 
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Call Girls in Nagpur High Profile
 
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...SofiyaSharma5
 
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607dollysharma2066
 
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersMoving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersDamian Radcliffe
 
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
horny (9316020077 ) Goa Call Girls Service by VIP Call Girls in Goa
horny (9316020077 ) Goa  Call Girls Service by VIP Call Girls in Goahorny (9316020077 ) Goa  Call Girls Service by VIP Call Girls in Goa
horny (9316020077 ) Goa Call Girls Service by VIP Call Girls in Goasexy call girls service in goa
 
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$kojalkojal131
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...tanu pandey
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Servicesexy call girls service in goa
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...Neha Pandey
 

Kürzlich hochgeladen (20)

✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
 
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call GirlVIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
 
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
 
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
 
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
 
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
 
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersMoving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
 
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
 
horny (9316020077 ) Goa Call Girls Service by VIP Call Girls in Goa
horny (9316020077 ) Goa  Call Girls Service by VIP Call Girls in Goahorny (9316020077 ) Goa  Call Girls Service by VIP Call Girls in Goa
horny (9316020077 ) Goa Call Girls Service by VIP Call Girls in Goa
 
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
 

AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new with the old Giancarlo Crocetti (Boehringer Ingelheim, USA)

  • 1. Embedding Vs Relevancy Search Comparing the Old with the New
  • 2. 1. Embedding Search 2. Document Level Embeddings 3. Selecting or Creating Embeddings 4. Results and Conclusions
  • 3. Keyword Search Characteristics Considerable document pre-processing •Document to text •Text to tokens •Named Entities •Structure analysis Based on efficient keyword or phrase match Basic relevancy model improved with boosting (term or field) It works well with a dedicated team
  • 4. Embedding Search Characteristics Lighter document pre-processing • Document to text • Text to (simple) tokens Inefficient and less flexible Similarity based on context Retrieval based on neighbor search
  • 5. Why Embeddings Search 1. Embeddings are good to identify semantic and syntactic structures 2. A match can occur even without any match in query keywords “what are the side effects of ibuprofen?” “NSAIDs can cause a range of adverse reactions, especially …” 3. A search is nothing else that a dot- product between two vectors
  • 6. From Words to Documents Given a document 𝑑𝑖 = {𝑤𝑖1, 𝑤𝑖2, … , 𝑤𝑖𝑚} A possible document embedding for 𝑑𝑖 is: 𝑣 𝑑𝑖 = σ𝑗=1 𝑚 𝑣(𝑤𝑖𝑗) or 𝑣 𝑑𝑖 = 1 𝑚 σ𝑗=1 𝑚 𝑣(𝑤𝑖𝑗) (mean-pooling). Note: 𝑖𝑓𝑤𝑖𝑗 ∉ V → 𝑣(𝑤𝑖𝑗) = 𝑣(′𝑢𝑛𝑘′) or use embeddings like fastText
  • 7. Smooth Inverse Frequency (SIF) Pooling Method S. Arora, Y. Liang, and T. Ma, “A Simple but tough-to-beat baseline for sentence embeddings.” ICLR, 2017 hyperparameter Word frequency 𝑢𝑢⊺ = 𝑆𝑖𝑛𝑔𝑢𝑙𝑎𝑟 𝑣𝑒𝑐𝑡𝑜𝑟: 𝑠𝑡𝑟𝑜𝑛𝑔𝑒𝑟 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡 𝑜𝑓 𝑆𝑖𝑛𝑔𝑢𝑙𝑎𝑟 𝑉𝑎𝑙𝑢𝑒 𝐷𝑒𝑐𝑜𝑚𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑢𝑢⊺ = 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑚𝑚𝑜𝑛 𝑡𝑜 𝑎𝑙𝑙 𝑣𝑒𝑐𝑡𝑜𝑟𝑠: 𝑟𝑒𝑚𝑜𝑣𝑖𝑛𝑔 𝑐𝑜𝑚𝑚𝑜𝑛 𝑝𝑎𝑡𝑡𝑒𝑟𝑛
  • 9. Different Search Cases • Classic Search: Keyword Search • Use token-based embeddings • Most similar sentence in a document • Document Search: human-like questions • Sentence or paragraph embeddings • Best document match based on the question • Media Search: image, sounds, etc. • Object embeddings • Best matched based on media characteristics
  • 10. What do you Need? 1. A good document embedding 2. Select your measure of similarity 3. Generate the vector for your query 4. Find the most similar items to the query vector Faiss (Facebook AI) Non-Metric Space Library
  • 11. Generating Good Quality Embeddings 1. Meditate on your real needs: KISS 2. Use your data: make it domain-specific 3. Refrain yourself to use the latest technologies • Might be better, but they are slower • You must transfer learn, and forget about one-shot learning • You might not need context awareness with domain specific data 4. Simpler embeddings are easier to generate
  • 12. Testing your Embeddings 1. You must formally test your model 2. Different embeddings for different tasks 3. Extrinsic evaluation are good but … 4. Hyperparameters tuning is the key* 5. Hyperparameters tuning is task oriented* O. Levy, Y. Goldberg, and I. Dagan, “Improving Distributional Similarity with Lessons Learned from Word Embeddings”. MIT Press, 2015 A. Gladkova, A. Drozd, and S. Matsuoka, “"Analogy-based detection of morphological and semantic relations” Association for Computational Linguistics, 2016
  • 13. The Wild-West of Embeddings The fight over the 1%
  • 14. Word2Vec • The higher the dimension, better the performance. • Skip-gram better than CBOW • These are very general statements: • Do not provide any insight on other hyperparameters? • How does the corpus size influence the quality? • What about specific tasks, like classification?
  • 15. Intrinsic Testing Syntactic Relationships: • Apple and apples (regular plurals) • Go and went (verb conjugations) • Angry and angrier (comparative adjectives) • Healthy and healthiest (superlative adjectives) • Many more … Semantic Relationships: • DaVinci and Italy(name-nationality) • Boy and girl(male-female) • Car and automobile (synonym) • Many more …
  • 17. Findings • Use Skip-Grams instead of CBOW • High dimensionality space is not critical (surprising finding) • start with 128 or 256 • Smaller the corpus size, smaller the embedding (e.g.; use 64) • Larger the embedding size, larger the window • Corpus size matters • 5,000,000 documents or more • With larger corpus sizes use a window size of 10 or more
  • 18. Embedding Search Sentence/paragraph level tokenization Word2Vec Doc2Vec Query Vector Representation Similar Documents
  • 19. Conclusions • Look at your use case: is this a Search Embedding case? • Start simple: e.g.; Word2Vec • Use your own data or transfer learn using your data • Create your baseline before employing more complex models • Hyperparameter tuning is more important than fancy models
  • 20. Giancarlo Crocetti St. John’s University crocettg@stjohns.edu