SlideShare a Scribd company logo
1 of 30
IST 441
Query Formulation for Similarity Search
Student : Nitish Upreti
Customer : Kyle Williams
nzu100@cse.psu.edu
kwilliams@psu.edu
OUTLINE
• Introduction
• Motivation
• Challenges with Similarity Search
• Background & Reference Point
• Approaches to Similarity Search
• Our Approaches to Problem
• JateToolkit Introduction
• Solution Architecture
• Evaluation
• Conclusion
What is Similarity Search?
“ Given a sample document and a standard Web
search engine, the goal is to find similar
documents to the given document. ”
What is a similar document?
• Cosine Similarity
• Citation Similarity
• Code Similarity
• Multimedia Content Similarity
Motivation
Plagiarism Detection
Process of locating instances of plagiarism in a
suspicious document from the web.
Example : Turnitin™
Content Recommendation
Recommending articles from credible news sources based
on social media entities such as tweets.
Academic Scenario : Research Paper Recommendation
Finding relevant documents for research paper
recommendation.
Challenge Involved
• Constructing queries from the sample document
in order to find similar documents is not obvious.
• Several Constraints on the maximum number of
queries and results to be downloaded for
scalability constraints.
• Capture different facets of Similarity :
How can we be general enough to capture the
theme but also specific to capture unique
document attributes? (Domain Dependent)
BACKGROUND
The Big Picture
Credits : Plagiarism Candidate Retrieval Using Selective Query Formulation
and Discriminative Query Scoring
Notebook for PAN at CLEF 2013
Our Reference Point
• Source Retrieval is the KEY component.
(Dictates the possibility of future steps)
• Query Formulation is at the heart of this
problem.
• Challenges with :
– How can we design better algorithms to formulate
accurate queries?
– What has been done and what can be explored?
Our Reference Point (Contd..)
• CLEF: Conference and Labs of the Evaluation
Forum.
• PAN Labs centers around the topics of
plagiarism, authorship, and social software
misuse.
– Author Identification
– Author Profiling
– Plagiarism Detection
• Evaluation possible in a Plagiarism domain.
Approaching Similarity Search
Major classes of Similarity Search :
• Choosing sentences from text corpus.
• Choosing a set of generic keywords.
• Term Extraction Algorithms.
• Topic Mining for document using Machine
Learning techniques.
Mix and Match Ideas depending and employ
well known tweaks depending on the scenario.
(Most of it is experimental)
Query Formulation Approach
Term Extraction
(Automatic extraction of relevant terms from a given corpus)
Approach Contd…
• Central Theme : Term Extraction Algorithms
• Approach Similarity Search in context of Term
Extraction algorithms.
• Design a framework which incorporates which
these algorithms.
• Evaluate the algorithms.
• Document all the approaches.
Enter JateToolkit
Java Automatic Term Extraction toolkit
A library of state-of-the-art term extraction
algorithms and framework for developing term
extraction algorithms.
https://code.google.com/p/jatetoolkit/
Term Extraction Approaches…
• Term Extraction Algorithms :
– TF-IDF
– RIDF
– Weirdness
– C-value
– GlossEx
– TermEx
(Open Ended Project : Work in Progress)
– Justeson & Katz Algorithm
– NC Value Algorithm
– Rake Algorithm
– Chi-squared Algorithm
Solution Architecture
Phase 1 : Pre-Processing
Pre-Processing Document
StopList Pre-Processing
Extremely common words which would appear
to be of little value in helping select documents
matching a user need are excluded from the
vocabulary entirely. These words form the Stop
List.
• Use Jate’s built in “StopList” for filtering.
Pre-Processing Document Contd…
Lemmatization
Group together words that are present in the
document as different inflected forms to a single
word so they can be analyzed as a single item.
Example : “run, runs, ran and running are forms
of the same lexeme, with run as the lemma.”
Phase 2 : Candidate Term Extraction
Candidate Term Extraction
• Approaches to Candidate Term Extraction :
1. Simply extracting single words as candidate
terms. If you task extracts single words as terms.
(Naïve Approach)
2. A generic N-gram extractor that extracts ‘n
grams’.
Final Approach : Stanford’s OpenNLP NPE
(Noun Phrase Extractor) that extracts noun
phrases as candidate terms.
Why are other two Approaches
worth mentioning?
Performance of Term Extraction Algorithms is
text corpus dependent.
(Our dataset was more receptive to NPE)
Phase 3 : Index Building
Building Document Index
• Using Jate toolkit to build a corpus index (Pre-
Requisite for Term Extraction).
• Memory Based / Disk Resident file / Exporting
to HSQL (HyperSQL).
Phase 4 : Building Statistical Features
Building Features for Jate Toolkit
• Word Count
• Feature Corpus Term Frequency (A feature
store that contains information of term
distributions over a corpus)
• Feature Term Nest Frequency (A feature store
that contains information of nested terms)
Example: “Hedgehog" is a nested term in
"European Hedgehog".
• Executing a single or multithreaded client.
Phase 5 : Register and Execute
Algorithms
Jate Output File : term { variations } score
The output file is arranged in descending order
of score.
Phase 6 : Post Processing
Writing an Output file suitable for submission.
Format : DocumentId { query terms }
(Maximum 10 non-repeating query terms)
Evaluation
• Last year PAN CLEF Baseline :
Precision = 0.244388609715 (200) queries
• Performance for Term Extraction Algorithms:
(105) queries
1. IBM’s GlossEx : 0.171428571429
2. C Value : 0.0598255721489
3. TermEx : 0.0635
4. Weirdness : 0.03190851
5. RIDF : 0.176470588235
6. TF-IDF : 0.13058482157
RESULTS
• The code is live on github!
https://github.com/myth17/QF
• Code, Query Logs and entire results submitted to
Kyle.
• Working on incorporating the other alpha term
extraction algorithms.
• Future Work : How can the results be improved
and integrated with topic modeling?
Questions ?
(Thank You!)

More Related Content

What's hot

Resource Classification as the Basis for a Visualization Pipeline in LOD Scen...
Resource Classification as the Basis for a Visualization Pipeline in LOD Scen...Resource Classification as the Basis for a Visualization Pipeline in LOD Scen...
Resource Classification as the Basis for a Visualization Pipeline in LOD Scen...Oscar Peña del Rio
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
Topic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and ApplicationsTopic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and ApplicationsAyush Jain
 
Topic extraction using machine learning
Topic extraction using machine learningTopic extraction using machine learning
Topic extraction using machine learningSanjib Basak
 
Techniques of information retrieval
Techniques of information retrieval Techniques of information retrieval
Techniques of information retrieval Tariq Hassan
 
Social Phrases Having Impact in Altmetrics - SOPHIA
Social Phrases Having Impact in Altmetrics - SOPHIASocial Phrases Having Impact in Altmetrics - SOPHIA
Social Phrases Having Impact in Altmetrics - SOPHIAInsight_Altmetrics
 
Tutorial on Coreference Resolution
Tutorial on Coreference Resolution Tutorial on Coreference Resolution
Tutorial on Coreference Resolution Anirudh Jayakumar
 
Data analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsData analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsAltuna Akalin
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectorsSimon Hughes
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingOntotext
 
Collaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsCollaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsNavisro Analytics
 
Clustering Technique for Collaborative Filtering Recommendation and Applicat...
Clustering Technique for Collaborative  Filtering Recommendation and Applicat...Clustering Technique for Collaborative  Filtering Recommendation and Applicat...
Clustering Technique for Collaborative Filtering Recommendation and Applicat...Pham Cuong
 
"Hands Off! Best Practices for Code Hand Offs"
"Hands Off!  Best Practices for Code Hand Offs""Hands Off!  Best Practices for Code Hand Offs"
"Hands Off! Best Practices for Code Hand Offs"Naomi Dushay
 
Georgios Meditskos and Stamatia Dasiopoulou | Question Answering over Pattern...
Georgios Meditskos and Stamatia Dasiopoulou | Question Answering over Pattern...Georgios Meditskos and Stamatia Dasiopoulou | Question Answering over Pattern...
Georgios Meditskos and Stamatia Dasiopoulou | Question Answering over Pattern...semanticsconference
 
Recommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRecommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRoelof Pieters
 
Building NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML GroupBuilding NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML Groupbotsplash.com
 
Data Acquisition for Sentiment Analysis
Data Acquisition for Sentiment AnalysisData Acquisition for Sentiment Analysis
Data Acquisition for Sentiment AnalysisAli BELCAID
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architectureLiang Xiang
 

What's hot (20)

Resource Classification as the Basis for a Visualization Pipeline in LOD Scen...
Resource Classification as the Basis for a Visualization Pipeline in LOD Scen...Resource Classification as the Basis for a Visualization Pipeline in LOD Scen...
Resource Classification as the Basis for a Visualization Pipeline in LOD Scen...
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
Topic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and ApplicationsTopic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and Applications
 
Topic extraction using machine learning
Topic extraction using machine learningTopic extraction using machine learning
Topic extraction using machine learning
 
Techniques of information retrieval
Techniques of information retrieval Techniques of information retrieval
Techniques of information retrieval
 
Social Phrases Having Impact in Altmetrics - SOPHIA
Social Phrases Having Impact in Altmetrics - SOPHIASocial Phrases Having Impact in Altmetrics - SOPHIA
Social Phrases Having Impact in Altmetrics - SOPHIA
 
Tutorial on Coreference Resolution
Tutorial on Coreference Resolution Tutorial on Coreference Resolution
Tutorial on Coreference Resolution
 
Data analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsData analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomics
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectors
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining Processing
 
IR
IRIR
IR
 
Collaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsCollaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro Analytics
 
Ld4 l triannon
Ld4 l triannonLd4 l triannon
Ld4 l triannon
 
Clustering Technique for Collaborative Filtering Recommendation and Applicat...
Clustering Technique for Collaborative  Filtering Recommendation and Applicat...Clustering Technique for Collaborative  Filtering Recommendation and Applicat...
Clustering Technique for Collaborative Filtering Recommendation and Applicat...
 
"Hands Off! Best Practices for Code Hand Offs"
"Hands Off!  Best Practices for Code Hand Offs""Hands Off!  Best Practices for Code Hand Offs"
"Hands Off! Best Practices for Code Hand Offs"
 
Georgios Meditskos and Stamatia Dasiopoulou | Question Answering over Pattern...
Georgios Meditskos and Stamatia Dasiopoulou | Question Answering over Pattern...Georgios Meditskos and Stamatia Dasiopoulou | Question Answering over Pattern...
Georgios Meditskos and Stamatia Dasiopoulou | Question Answering over Pattern...
 
Recommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRecommender Systems, Matrices and Graphs
Recommender Systems, Matrices and Graphs
 
Building NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML GroupBuilding NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML Group
 
Data Acquisition for Sentiment Analysis
Data Acquisition for Sentiment AnalysisData Acquisition for Sentiment Analysis
Data Acquisition for Sentiment Analysis
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
 

Viewers also liked

Hassan presentation of corpus
Hassan presentation of corpusHassan presentation of corpus
Hassan presentation of corpusHassan Ammar
 
What can a corpus tell us about lexis (1)
What can a corpus tell us about lexis (1)What can a corpus tell us about lexis (1)
What can a corpus tell us about lexis (1)Pascual Pérez-Paredes
 
The Corpus In The Classroom
The Corpus In The ClassroomThe Corpus In The Classroom
The Corpus In The ClassroomColin Graham
 
What can a corpus tell us about discourse
What can a corpus tell us about discourseWhat can a corpus tell us about discourse
What can a corpus tell us about discoursePascual Pérez-Paredes
 
What can a corpus tell us about grammar?
What can a corpus tell us about grammar?What can a corpus tell us about grammar?
What can a corpus tell us about grammar?Pascual Pérez-Paredes
 
What can a corpus tell us about registers and genres douglas biber
What can a corpus tell us about registers and genres douglas biberWhat can a corpus tell us about registers and genres douglas biber
What can a corpus tell us about registers and genres douglas biberPascual Pérez-Paredes
 

Viewers also liked (7)

Hassan presentation of corpus
Hassan presentation of corpusHassan presentation of corpus
Hassan presentation of corpus
 
What can a corpus tell us about lexis (1)
What can a corpus tell us about lexis (1)What can a corpus tell us about lexis (1)
What can a corpus tell us about lexis (1)
 
The Corpus In The Classroom
The Corpus In The ClassroomThe Corpus In The Classroom
The Corpus In The Classroom
 
What can a corpus tell us about discourse
What can a corpus tell us about discourseWhat can a corpus tell us about discourse
What can a corpus tell us about discourse
 
What can a corpus tell us about grammar?
What can a corpus tell us about grammar?What can a corpus tell us about grammar?
What can a corpus tell us about grammar?
 
What can a corpus tell us about registers and genres douglas biber
What can a corpus tell us about registers and genres douglas biberWhat can a corpus tell us about registers and genres douglas biber
What can a corpus tell us about registers and genres douglas biber
 
Discourse Analysis
Discourse AnalysisDiscourse Analysis
Discourse Analysis
 

Similar to Final presentation

Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
score based ranking of documents
score based ranking of documentsscore based ranking of documents
score based ranking of documentsKriti Khanna
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
 
Big Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPBig Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPChristian Morbidoni
 
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...Lucidworks
 
empirical-SLR.pptx
empirical-SLR.pptxempirical-SLR.pptx
empirical-SLR.pptxJitha Kannan
 
Candidate selection tutorial
Candidate selection tutorialCandidate selection tutorial
Candidate selection tutorialYiqun Liu
 
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...Aman Grover
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineSalford Systems
 
Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Kai Li
 
Systematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping StudiesSystematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping Studiesalessio_ferrari
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...Angelo Salatino
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSujit Pal
 
Bioschemas Workshop
Bioschemas WorkshopBioschemas Workshop
Bioschemas WorkshopNiall Beard
 
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...PyData
 
Case Study Research in Software Engineering
Case Study Research in Software EngineeringCase Study Research in Software Engineering
Case Study Research in Software Engineeringalessio_ferrari
 

Similar to Final presentation (20)

Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
score based ranking of documents
score based ranking of documentsscore based ranking of documents
score based ranking of documents
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Big Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPBig Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLP
 
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
 
empirical-SLR.pptx
empirical-SLR.pptxempirical-SLR.pptx
empirical-SLR.pptx
 
SEppt
SEpptSEppt
SEppt
 
Candidate selection tutorial
Candidate selection tutorialCandidate selection tutorial
Candidate selection tutorial
 
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
 
Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...
 
Systematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping StudiesSystematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping Studies
 
Machine Learning & Apache Mahout
Machine Learning & Apache MahoutMachine Learning & Apache Mahout
Machine Learning & Apache Mahout
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
 
Bioschemas Workshop
Bioschemas WorkshopBioschemas Workshop
Bioschemas Workshop
 
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
 
Case Study Research in Software Engineering
Case Study Research in Software EngineeringCase Study Research in Software Engineering
Case Study Research in Software Engineering
 

More from Nitish Upreti

Facebook's TAO & Unicorn data storage and search platforms
Facebook's TAO & Unicorn data storage and search platformsFacebook's TAO & Unicorn data storage and search platforms
Facebook's TAO & Unicorn data storage and search platformsNitish Upreti
 
Socail Influence & Homophilly
Socail Influence & HomophillySocail Influence & Homophilly
Socail Influence & HomophillyNitish Upreti
 
PSU CSE 541 Project Idea
PSU CSE 541 Project IdeaPSU CSE 541 Project Idea
PSU CSE 541 Project IdeaNitish Upreti
 

More from Nitish Upreti (6)

Facebook's TAO & Unicorn data storage and search platforms
Facebook's TAO & Unicorn data storage and search platformsFacebook's TAO & Unicorn data storage and search platforms
Facebook's TAO & Unicorn data storage and search platforms
 
Spark
SparkSpark
Spark
 
Blinkdb
BlinkdbBlinkdb
Blinkdb
 
Socail Influence & Homophilly
Socail Influence & HomophillySocail Influence & Homophilly
Socail Influence & Homophilly
 
Software testing
Software testingSoftware testing
Software testing
 
PSU CSE 541 Project Idea
PSU CSE 541 Project IdeaPSU CSE 541 Project Idea
PSU CSE 541 Project Idea
 

Recently uploaded

School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdfKamal Acharya
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEselvakumar948
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadhamedmustafa094
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Call Girls Mumbai
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdfKamal Acharya
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxMuhammadAsimMuhammad6
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 

Recently uploaded (20)

School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 

Final presentation

  • 1. IST 441 Query Formulation for Similarity Search Student : Nitish Upreti Customer : Kyle Williams nzu100@cse.psu.edu kwilliams@psu.edu
  • 2. OUTLINE • Introduction • Motivation • Challenges with Similarity Search • Background & Reference Point • Approaches to Similarity Search • Our Approaches to Problem • JateToolkit Introduction • Solution Architecture • Evaluation • Conclusion
  • 3. What is Similarity Search? “ Given a sample document and a standard Web search engine, the goal is to find similar documents to the given document. ” What is a similar document? • Cosine Similarity • Citation Similarity • Code Similarity • Multimedia Content Similarity
  • 4. Motivation Plagiarism Detection Process of locating instances of plagiarism in a suspicious document from the web. Example : Turnitin™ Content Recommendation Recommending articles from credible news sources based on social media entities such as tweets. Academic Scenario : Research Paper Recommendation Finding relevant documents for research paper recommendation.
  • 5. Challenge Involved • Constructing queries from the sample document in order to find similar documents is not obvious. • Several Constraints on the maximum number of queries and results to be downloaded for scalability constraints. • Capture different facets of Similarity : How can we be general enough to capture the theme but also specific to capture unique document attributes? (Domain Dependent)
  • 7. The Big Picture Credits : Plagiarism Candidate Retrieval Using Selective Query Formulation and Discriminative Query Scoring Notebook for PAN at CLEF 2013
  • 8. Our Reference Point • Source Retrieval is the KEY component. (Dictates the possibility of future steps) • Query Formulation is at the heart of this problem. • Challenges with : – How can we design better algorithms to formulate accurate queries? – What has been done and what can be explored?
  • 9. Our Reference Point (Contd..) • CLEF: Conference and Labs of the Evaluation Forum. • PAN Labs centers around the topics of plagiarism, authorship, and social software misuse. – Author Identification – Author Profiling – Plagiarism Detection • Evaluation possible in a Plagiarism domain.
  • 10. Approaching Similarity Search Major classes of Similarity Search : • Choosing sentences from text corpus. • Choosing a set of generic keywords. • Term Extraction Algorithms. • Topic Mining for document using Machine Learning techniques. Mix and Match Ideas depending and employ well known tweaks depending on the scenario. (Most of it is experimental)
  • 11. Query Formulation Approach Term Extraction (Automatic extraction of relevant terms from a given corpus)
  • 12. Approach Contd… • Central Theme : Term Extraction Algorithms • Approach Similarity Search in context of Term Extraction algorithms. • Design a framework which incorporates which these algorithms. • Evaluate the algorithms. • Document all the approaches.
  • 13. Enter JateToolkit Java Automatic Term Extraction toolkit A library of state-of-the-art term extraction algorithms and framework for developing term extraction algorithms. https://code.google.com/p/jatetoolkit/
  • 14. Term Extraction Approaches… • Term Extraction Algorithms : – TF-IDF – RIDF – Weirdness – C-value – GlossEx – TermEx (Open Ended Project : Work in Progress) – Justeson & Katz Algorithm – NC Value Algorithm – Rake Algorithm – Chi-squared Algorithm
  • 16. Phase 1 : Pre-Processing
  • 17. Pre-Processing Document StopList Pre-Processing Extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words form the Stop List. • Use Jate’s built in “StopList” for filtering.
  • 18. Pre-Processing Document Contd… Lemmatization Group together words that are present in the document as different inflected forms to a single word so they can be analyzed as a single item. Example : “run, runs, ran and running are forms of the same lexeme, with run as the lemma.”
  • 19. Phase 2 : Candidate Term Extraction
  • 20. Candidate Term Extraction • Approaches to Candidate Term Extraction : 1. Simply extracting single words as candidate terms. If you task extracts single words as terms. (Naïve Approach) 2. A generic N-gram extractor that extracts ‘n grams’. Final Approach : Stanford’s OpenNLP NPE (Noun Phrase Extractor) that extracts noun phrases as candidate terms.
  • 21. Why are other two Approaches worth mentioning? Performance of Term Extraction Algorithms is text corpus dependent. (Our dataset was more receptive to NPE)
  • 22. Phase 3 : Index Building
  • 23. Building Document Index • Using Jate toolkit to build a corpus index (Pre- Requisite for Term Extraction). • Memory Based / Disk Resident file / Exporting to HSQL (HyperSQL).
  • 24. Phase 4 : Building Statistical Features
  • 25. Building Features for Jate Toolkit • Word Count • Feature Corpus Term Frequency (A feature store that contains information of term distributions over a corpus) • Feature Term Nest Frequency (A feature store that contains information of nested terms) Example: “Hedgehog" is a nested term in "European Hedgehog". • Executing a single or multithreaded client.
  • 26. Phase 5 : Register and Execute Algorithms Jate Output File : term { variations } score The output file is arranged in descending order of score.
  • 27. Phase 6 : Post Processing Writing an Output file suitable for submission. Format : DocumentId { query terms } (Maximum 10 non-repeating query terms)
  • 28. Evaluation • Last year PAN CLEF Baseline : Precision = 0.244388609715 (200) queries • Performance for Term Extraction Algorithms: (105) queries 1. IBM’s GlossEx : 0.171428571429 2. C Value : 0.0598255721489 3. TermEx : 0.0635 4. Weirdness : 0.03190851 5. RIDF : 0.176470588235 6. TF-IDF : 0.13058482157
  • 29. RESULTS • The code is live on github! https://github.com/myth17/QF • Code, Query Logs and entire results submitted to Kyle. • Working on incorporating the other alpha term extraction algorithms. • Future Work : How can the results be improved and integrated with topic modeling?