SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Search Powered by Deep Learning
- From Content Similarity to Semantic Similarity
2
John Kuriakose
Principal Architect
Big Data & Analytics - IIP
Ashish Vishwas
Kaduskar
Architect
Big Data & Analytics - IIP
Debanjan Mahata
Senior Research Associate
Big Data & Analytics - IIP
Speakers
Infosys Information Platform (IIP)
3
4
Agenda
What are Word and
Doc Embeddings ?
How do they enrich
Search Applications ?
How to build them for a
search application ?
How to integrate them in a
search application ?
Searching across Research Articles
5
Ingestion
Enrichment
Content Selection
recurrent neural networks
• Data Preprocessing
• Indexing
• Training different word
embedding models
• Query Expansion
• Similar Article
Recommendation
• Keyword Extraction
Humans Vs Machines
6
Use the right
words/phrases
Capture the
widest range of
topics
Set the right
“window” of
context
1147001 Research Abstracts
Building the Models
7
Preprocessing Training Prediction Evaluation
Preprocessing
8
Sentence splitting
Phrase Tokenization
Removal of strings containing only numeric characters
Removal of functional words like ‘accordingly’, ‘although’, etc
Removal of named entities of types “DATE”, “TIME”, “PERCENT”, “MONEY”, “QUANTITY”, “ORDINAL”, “CARDINAL”
Word Embeddings
9
Source: http://sebastianruder.com/word-embeddings-1/index.html Mikolov et al, (2013a).
Semantic Similarity between Words
10
Kusner et al (2015)
Word2Vec
11
CBOW Skipgram
Mikolov et al, (2013a).
Word2Vec Training
12
Source: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
13
Word2Vec Training
Source: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
14
Word2Vec Training
Source: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
15
Word2Vec Training
Source:
http://mccormickml.com/2016/04/19/word2vec-
tutorial-the-skip-gram-model/
Fasttext
16
• Very Similar to Word2vec
• Primary Difference
– Takes into account the internal structure of words while learning word
representations
– The resultant word vector is a combination of vectors for its constituent character
ngrams
• Extremely fast training
• Very good for morphologically rich languages
• Takes into account both “Semantic as well as Syntactic Similarity”
• Very good performances in Syntactic Similarity tasks
Doc2Vec
17
Mikolov et al. (2014)
Training Parameters
18
• Skipgram
• Negative Sampling
• Dimensions = 1000
• Context Window Size = 5
• Learning Rate = 0.025
• Trained on unigrams and
phrases
• Skipgram
• Negative Sampling
• Dimensions = 1000
• Max Length of Char Ngrams = 6
• Min Length of Char Ngrams = 3
• Learning Rate = 0.05
• Trained on unigrams and
phrases
FasttextWord2vec
• Distributed Memory
• Dimensions = 1000
• Window Size = 10
• Epochs = 10
• Initial Learning Rate = 0.025
• Trained on unigrams and
phrases
Doc2Vec
Query Expansion
19
nlp
natural language processing
information extraction
machine translation
word sense disambiguation
knowledge discovery
text mining
text categorization
natural language generation
• Getting rid of thesaurus based or dictionary based query expansion.
• How many phrases to use for expansion ?
• Which model ?
• How to integrate the trained models with a search application for query expansion ?
20
Recommending Similar Research Articles
Content
Similarity
Vs
Semantic
Similarity
Ranked Keyphrase Extraction
21
• Known to give better results
• Drawbacks
– Domain Specific
– Training and Tuning of the models for
generalization
– Intelligent Feature Engineering
• Examples: KEA, MAUI
• Known to give worse results than
supervised in domain specific tasks
• Domain independent
• No need of feature engineering
• Algorithm determines relationships between
candidates for identifying the keyphrases
• Examples: TextRank, RAKE
UnsupervisedSupervised
Ranked Keyphrase Extraction
22
Text Preprocessing Candidate Phrase
Selection
Initialize Candidate
Phrases with Scores
from Classifier
Research
Article Supervised Model
Word2Vec Model
Build Strongly
Connected Graph
between the
Candidate Phrases
Ranked Keyphrases
Phrase 1
Phrase 2
Phrase K
Integration
23
Indexing
Preprocessing
Query
expansion
Ingestion
Document
Recommendation
Ranked
Keyword
Extraction
API
Content Selection
Enrichment
Word
Embedding
Model
Doc
Embedding
Model
Keyword
classifier
24
References
1. Sebastian Rudder, “On Word Embeddings”; http://sebastianruder.com/tag/word-embeddings/index.html
2. Christopher Olah, “Colah’s Blog”, http://colah.github.io/
3. Chris McCormick, “Word2Vec Tutorial – The Skip-Gram Model”, http://mccormickml.com/2016/04/19/word2vec-tutorial-
the-skip-gram-model/
4. Le, Quoc V., and Tomas Mikolov. "Distributed Representations of Sentences and Documents." ICML. Vol. 14. 2014.
5. Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in
neural information processing systems. 2013.
6. Bojanowski, Piotr, et al. "Enriching word vectors with subword information." arXiv preprint arXiv:1607.04606 (2016).
7. Rong, Xin. "word2vec parameter learning explained." arXiv preprint arXiv:1411.2738 (2014).
25
26
27

Weitere ähnliche Inhalte

Was ist angesagt?

“Semantic Technologies for Smart Services”
“Semantic Technologies for Smart Services” “Semantic Technologies for Smart Services”
“Semantic Technologies for Smart Services” diannepatricia
 
Paper id 26201475
Paper id 26201475Paper id 26201475
Paper id 26201475IJRAT
 
Analyzing observational data during qualitative research
Analyzing observational data during qualitative researchAnalyzing observational data during qualitative research
Analyzing observational data during qualitative researchWafa Iqbal
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksBICA Labs
 
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...Simplilearn
 
Sale application year 3
Sale application year 3Sale application year 3
Sale application year 3Hay Phousadan
 
Introduction to Text Analysis
Introduction to Text AnalysisIntroduction to Text Analysis
Introduction to Text AnalysisLauren Klein
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxShanmugasundaram M
 

Was ist angesagt? (9)

“Semantic Technologies for Smart Services”
“Semantic Technologies for Smart Services” “Semantic Technologies for Smart Services”
“Semantic Technologies for Smart Services”
 
Paper id 26201475
Paper id 26201475Paper id 26201475
Paper id 26201475
 
Analyzing observational data during qualitative research
Analyzing observational data during qualitative researchAnalyzing observational data during qualitative research
Analyzing observational data during qualitative research
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
 
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
 
Sale application year 3
Sale application year 3Sale application year 3
Sale application year 3
 
AI-SDV 2020: IPscreener
AI-SDV 2020: IPscreenerAI-SDV 2020: IPscreener
AI-SDV 2020: IPscreener
 
Introduction to Text Analysis
Introduction to Text AnalysisIntroduction to Text Analysis
Introduction to Text Analysis
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
 

Ähnlich wie Search Powered by Deep Learning SmartData 2017

RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
Candidate selection tutorial
Candidate selection tutorialCandidate selection tutorial
Candidate selection tutorialYiqun Liu
 
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...Aman Grover
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
 
Data analytics on Azure
Data analytics on AzureData analytics on Azure
Data analytics on AzureElena Lopez
 
Text Analytics for Non-Experts
Text Analytics for Non-ExpertsText Analytics for Non-Experts
Text Analytics for Non-ExpertsSynaptica, LLC
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge GraphTrey Grainger
 
A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...
A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...
A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...Michael Mortenson
 
A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...
A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...
A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...Michael Mortenson
 
Certified Internet Research Specialists (CIRS) Online Training Course outline
Certified Internet Research Specialists (CIRS) Online Training Course outlineCertified Internet Research Specialists (CIRS) Online Training Course outline
Certified Internet Research Specialists (CIRS) Online Training Course outlineOlivia Russell CIRS, MS
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes
 
AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016Manjula Ambur
 
Search technologies & aws cloud search
Search technologies & aws cloud searchSearch technologies & aws cloud search
Search technologies & aws cloud searchAmazon Web Services
 
Data and AI in education
Data and AI in educationData and AI in education
Data and AI in educationJisc
 
SPSBE14 Intranet Search #fail
SPSBE14 Intranet Search #failSPSBE14 Intranet Search #fail
SPSBE14 Intranet Search #failBen van Mol
 

Ähnlich wie Search Powered by Deep Learning SmartData 2017 (20)

RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
Candidate selection tutorial
Candidate selection tutorialCandidate selection tutorial
Candidate selection tutorial
 
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Deep learning for NLP
Deep learning for NLPDeep learning for NLP
Deep learning for NLP
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Data analytics on Azure
Data analytics on AzureData analytics on Azure
Data analytics on Azure
 
Text Analytics for Non-Experts
Text Analytics for Non-ExpertsText Analytics for Non-Experts
Text Analytics for Non-Experts
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
 
A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...
A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...
A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...
 
A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...
A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...
A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...
 
Certified Internet Research Specialists (CIRS) Online Training Course outline
Certified Internet Research Specialists (CIRS) Online Training Course outlineCertified Internet Research Specialists (CIRS) Online Training Course outline
Certified Internet Research Specialists (CIRS) Online Training Course outline
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
 
Text Analytics for Legal work
Text Analytics for Legal workText Analytics for Legal work
Text Analytics for Legal work
 
AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016
 
Dissertation literature search
Dissertation literature searchDissertation literature search
Dissertation literature search
 
Search technologies & aws cloud search
Search technologies & aws cloud searchSearch technologies & aws cloud search
Search technologies & aws cloud search
 
Data and AI in education
Data and AI in educationData and AI in education
Data and AI in education
 
SPSBE14 Intranet Search #fail
SPSBE14 Intranet Search #failSPSBE14 Intranet Search #fail
SPSBE14 Intranet Search #fail
 

Search Powered by Deep Learning SmartData 2017