This document outlines an approach to query formulation for similarity search using term extraction algorithms. It discusses the challenges of similarity search and constructing queries from documents. The solution involves preprocessing documents, extracting candidate terms, building an index, calculating statistical features, executing term extraction algorithms, and postprocessing outputs. Evaluation on a plagiarism detection dataset found TF-IDF and RIDF performed best among algorithms tested. The code is available on GitHub and further improvements could integrate topic modeling.
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Final presentation
1. IST 441
Query Formulation for Similarity Search
Student : Nitish Upreti
Customer : Kyle Williams
nzu100@cse.psu.edu
kwilliams@psu.edu
2. OUTLINE
• Introduction
• Motivation
• Challenges with Similarity Search
• Background & Reference Point
• Approaches to Similarity Search
• Our Approaches to Problem
• JateToolkit Introduction
• Solution Architecture
• Evaluation
• Conclusion
3. What is Similarity Search?
“ Given a sample document and a standard Web
search engine, the goal is to find similar
documents to the given document. ”
What is a similar document?
• Cosine Similarity
• Citation Similarity
• Code Similarity
• Multimedia Content Similarity
4. Motivation
Plagiarism Detection
Process of locating instances of plagiarism in a
suspicious document from the web.
Example : Turnitin™
Content Recommendation
Recommending articles from credible news sources based
on social media entities such as tweets.
Academic Scenario : Research Paper Recommendation
Finding relevant documents for research paper
recommendation.
5. Challenge Involved
• Constructing queries from the sample document
in order to find similar documents is not obvious.
• Several Constraints on the maximum number of
queries and results to be downloaded for
scalability constraints.
• Capture different facets of Similarity :
How can we be general enough to capture the
theme but also specific to capture unique
document attributes? (Domain Dependent)
7. The Big Picture
Credits : Plagiarism Candidate Retrieval Using Selective Query Formulation
and Discriminative Query Scoring
Notebook for PAN at CLEF 2013
8. Our Reference Point
• Source Retrieval is the KEY component.
(Dictates the possibility of future steps)
• Query Formulation is at the heart of this
problem.
• Challenges with :
– How can we design better algorithms to formulate
accurate queries?
– What has been done and what can be explored?
9. Our Reference Point (Contd..)
• CLEF: Conference and Labs of the Evaluation
Forum.
• PAN Labs centers around the topics of
plagiarism, authorship, and social software
misuse.
– Author Identification
– Author Profiling
– Plagiarism Detection
• Evaluation possible in a Plagiarism domain.
10. Approaching Similarity Search
Major classes of Similarity Search :
• Choosing sentences from text corpus.
• Choosing a set of generic keywords.
• Term Extraction Algorithms.
• Topic Mining for document using Machine
Learning techniques.
Mix and Match Ideas depending and employ
well known tweaks depending on the scenario.
(Most of it is experimental)
12. Approach Contd…
• Central Theme : Term Extraction Algorithms
• Approach Similarity Search in context of Term
Extraction algorithms.
• Design a framework which incorporates which
these algorithms.
• Evaluate the algorithms.
• Document all the approaches.
13. Enter JateToolkit
Java Automatic Term Extraction toolkit
A library of state-of-the-art term extraction
algorithms and framework for developing term
extraction algorithms.
https://code.google.com/p/jatetoolkit/
14. Term Extraction Approaches…
• Term Extraction Algorithms :
– TF-IDF
– RIDF
– Weirdness
– C-value
– GlossEx
– TermEx
(Open Ended Project : Work in Progress)
– Justeson & Katz Algorithm
– NC Value Algorithm
– Rake Algorithm
– Chi-squared Algorithm
17. Pre-Processing Document
StopList Pre-Processing
Extremely common words which would appear
to be of little value in helping select documents
matching a user need are excluded from the
vocabulary entirely. These words form the Stop
List.
• Use Jate’s built in “StopList” for filtering.
18. Pre-Processing Document Contd…
Lemmatization
Group together words that are present in the
document as different inflected forms to a single
word so they can be analyzed as a single item.
Example : “run, runs, ran and running are forms
of the same lexeme, with run as the lemma.”
20. Candidate Term Extraction
• Approaches to Candidate Term Extraction :
1. Simply extracting single words as candidate
terms. If you task extracts single words as terms.
(Naïve Approach)
2. A generic N-gram extractor that extracts ‘n
grams’.
Final Approach : Stanford’s OpenNLP NPE
(Noun Phrase Extractor) that extracts noun
phrases as candidate terms.
21. Why are other two Approaches
worth mentioning?
Performance of Term Extraction Algorithms is
text corpus dependent.
(Our dataset was more receptive to NPE)
23. Building Document Index
• Using Jate toolkit to build a corpus index (Pre-
Requisite for Term Extraction).
• Memory Based / Disk Resident file / Exporting
to HSQL (HyperSQL).
25. Building Features for Jate Toolkit
• Word Count
• Feature Corpus Term Frequency (A feature
store that contains information of term
distributions over a corpus)
• Feature Term Nest Frequency (A feature store
that contains information of nested terms)
Example: “Hedgehog" is a nested term in
"European Hedgehog".
• Executing a single or multithreaded client.
26. Phase 5 : Register and Execute
Algorithms
Jate Output File : term { variations } score
The output file is arranged in descending order
of score.
27. Phase 6 : Post Processing
Writing an Output file suitable for submission.
Format : DocumentId { query terms }
(Maximum 10 non-repeating query terms)
28. Evaluation
• Last year PAN CLEF Baseline :
Precision = 0.244388609715 (200) queries
• Performance for Term Extraction Algorithms:
(105) queries
1. IBM’s GlossEx : 0.171428571429
2. C Value : 0.0598255721489
3. TermEx : 0.0635
4. Weirdness : 0.03190851
5. RIDF : 0.176470588235
6. TF-IDF : 0.13058482157
29. RESULTS
• The code is live on github!
https://github.com/myth17/QF
• Code, Query Logs and entire results submitted to
Kyle.
• Working on incorporating the other alpha term
extraction algorithms.
• Future Work : How can the results be improved
and integrated with topic modeling?