4. INTRODUCTION
WWW – a huge tangled web of information.
Issues faced – duplications, plagiarism, copyright
violation etc.
Aim : To detect and report duplicates
Method : Compare and output the level of similarity
which is “TEXT SIMILARITY”.
4
5. Text Similarity has 2 aspects :
Content Similarity : Words are compared.
e.g. “I have a car” and “I have a vehicle” are 75% similar.
Expression Similarity : Meaning of the information is
considered.
e.g. “I have a car” and “I have a vehicle” can be
considered 100% similar.
Scope – Content Similarity
5
6. 2 step process:
STEP 1 : Data Retrieval
“The area of study concerned with searching for
documents, for information within documents, and for
metadata about documents, as well as that of searching
structured storage, relational databases, and the World
WideWeb” [1]
STEP II : Similarity Measurements
To correlate the words or terms of two or more documents
or web pages.
6
8. DATA RETRIEVAL
Translation of literature to mathematics.
A variety of such concrete techniques exist –
TF/IDF
Document-Term Matrix
VSM
LSA
The corresponding mathematical structure is derived
based of the relevant concrete data retrieval
methodology used.
8
9. TF/IDF
Term Frequency / Inverse Document Frequency
Idea : More common the term, the less importance it
has and hence should be considered at the least end
of the query spectrum.
Two linear, independent aspects:
Term Frequency - frequency of occurrence of a term in a
given document.
Inverse Document Frequency - measure of the general
importance of the term.
9
10. TF IDF Example [7]
Three Documents –
D1: “Shipment of gold damaged in a fire”
D2: “Delivery of silver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”
Two steps
Calculate the Term Frequency
Calculate the Inverse Document Frequency
10
12. Document-Term Matrix
“A Document-Term Matrix is a mathematical matrix
that describes the frequency of terms that occur in a
collection of documents.” [2]
Rows – Documents
Columns – Terms
Only depicts which document contains which term
and the number of occurrences of that term in the
document.
12
13. Document-Term Matrix Example
D1 = “I like databases”
D2 = “I hate hate databases”
I like databases hate
D1 1 1 1 0
D2 1 0 1 2
13
14. VSM
“Vector Space Model (VSM) is an algebraic model
for representing text documents (and any objects, in
general) as vectors of identifiers, such as, for e.g.
index terms.” [3]
Each document and query is represented as a
vector:
document : dj = (w1,j , w2,j , .... , wn,j)
query : q = (w1,q , w2,q , .... , wn,q)
Terms can be individual words, keywords, or
phrases, based on the type of application.
14
15. VSM Example [7]
Three Documents –
D1: “Shipment of gold damaged in a fire”
D2: “Delivery of silver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”
Query –
Gold Silver Truck
15
17. LSA
“Latent Semantic Analysis (LSA) is a theory and
method for extracting and representing the meaning
of words and passages of words.” [4]
Built on the assumption that similar terms tend to
appear in close proximities and hence identification
of correlation patterns between documents or terms
becomes easier.
2 step process:
Construction of Document-Term Matrix
Singular Value Decomposition
17
18. LSA Example
Three Documents –
D1: “Shipment of gold damaged in a fire”
D2: “Delivery of silver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”
Query –
Gold Silver Truck
18
19. LSA Example contd...
STEP 1 : Constructing the Term-Document Matrix & Query Matrix
19
22. Similar SVD evaluation and reduction is done for the
query vector Q.
At the end we have:
Reduced SVD Matrix V (for the documents)
Reduced SVD Matrix Q (for the query)
V= Q=
This further can be supplied to similarity
measurement technique.
22
24. SIMILARITY MEASUREMENTS
Major focus of “Text Similarities” methodology.
Uses the Mathematical Structures generated by the
Data Retrieval techniques to evaluate the
percentage of likeness between two or more
documents or web pages.
Two major techniques in focus here:
Cosine Similarity
SOC-PMI
24
25. COSINE SIMILARITY
Evaluate similarity between 2 vectors by measuring
cosine of the angle between them.
Cosine of the angle will detemine whether the
vectors are roughly pointing in the same direction.
In our scope : similarity will range between 0 and 1,
since term weights are always positive.
i.e. The angle between two considered vectors will
never exceed 90
25
26. COSINE Example [7]
Example continued from VSM.
Three Documents –
D1: “Shipment of gold damaged in a fire”
D2: “Delivery of silver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”
Query – Gold Silver Truck
We have calculated weights using TF-IDF scheme.
Next Step – Calculate Cosine Similarity:
CosineΘDi = (Q . Di ) / (|Q| x |Di|)
i.e. First calculate Dot product: Q . Di
Then calculate scalar product: |Q| x |Di|
26
28. SOC-PMI
“Second-Order Co-occurence Pointwise Mutual
Information (SOC-PMI) is a semantic similarity
measure using pointwise mutual information to sort
lists of important neighbor words of the two target
words from a large corpus.” [5]
A lot of mathematics involved to generate the
formula.
This Similarity measure at the end is also normalized
so as to limit the range of similarity between 0 and 1.
28
29. SOC-PMI with an example
Complicated method with a lot of mathematical
formulae.
Example [6] :
W1 = car
W2 = automobile
m = 70, n = 43
Assumptions:
ϒ = 3, ∂ = 0.7
window of 11 words
β1 = β2 = 24.88 CORPUS
29
30. SOC-PMI example contd...
Bigram frequencies and the set X
Types & Frequencies
and the set Y of words with their PMI
30
values
33. APPLICATIONS
Plagiarism Detection
Term Similarity play an important in the field of
Plagiarism Detection.
Copyright Violation
Copies of restricted Software/Data can be detected using
Text Similarities.
Recommender Services
33
34. PROTOTYPE
AIM : Finding the degree of Similarity between files.
2 steps
Data Retrival
TF-IDF
Similarity Measurement
Cosine
Pearson Correlation
Distribution Matrix
Co-occurence
34
35. Prototype – Data Retrieval
Steps followed to retrive data using TF-IDF scheme
SequenceFilesFromDirectory
Converts files into sequence files. < Text, Text >
DocumentProcessor
Converts the sequence file into <Text, StringTuple>
DictionaryVectorizer
Creates TF Vectors <Text, VectorWritable>
Creates dfcount < IntWritable, LongWritable>
Creates wordcount <Text, LongWritable>
TFIDFConverter
Creates TF-IDF vectors <Text, VectorWritable>
35
36. Prototype – Similarity Measurement
Intermediate steps
Convert the TF-IDF into a Matrix <IntWritable,
VectorWritable>
Similarity Measurement
Distribution Multiplication
Matrix * Matrix´
Cosine, Pearson Correlation and Co-occuerrence
RowSimilarityJob (Similarity Classname)
SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE
SIMILARITY_PEARSON_CORRELATION
SIMILARITY_COOCCURRENCE
36
41. References
[1] Wikipedia: Information retrieval - Wikipedia, the free encyclopedia
(2012), http://en.wikipedia.org/wiki/Information_retrieval
[2] Wikipedia: Document-term matrix - Wikipedia, the free encyclopedia
(2011), http://en.wikipedia.org/wiki/Document-term_matrix
[3] Wikipedia: Vector space model - Wikipedia, the free encyclopedia
(2011), http://en.wikipedia.org/wiki/Vector_space_model
[4] Wikipedia: Latent semantic indexing - Wikipedia, the free
encyclopedia (2011),
http://en.wikipedia.org/wiki/Latent_semantic_indexing
[5] Wikipedia: Second-order co-occurrence pointwise mutual
information - Wikipedia, the free encyclopedia (2011),
http://en.wikipedia.org/wiki/Second-order_co-
occurrence_pointwise_mutual_information
[6] Islam, A. and Inkpen, D. (2006). Second Order Co-occurrence PMI
for Determining the Semantic Similarity of Words, in Proceedings of
the International Conference on Language Resources and
Evaluation (LREC 2006), Genoa, Italy, pp. 1033–1038.
[7] Dr. E. Garcia. Mi Islita.com - http://www.miislita.com/information-
retrieval-tutorial/svd-lsi-tutorial-4-lsi-how-to-calculations.html
41
Hinweis der Redaktion
Data retrieval - In layman terms, data retrieval means that the words or terms within a document or web page are translated to some mathematical structure.
This basically implies that given a document, each distinct word or term within it is translated to a particular mathematical structure; for e.g. vector, frequency matrix etc.
TF - In its simplest form, the term frequency is also called as Term Count which is nothing but the number of occurrence of the term in thedocument.IDF - obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotientIt should be noted that if a term has a high term-frequency in the given document and a low document-frequency in the considered bunch of documents (implying a high inverse document frequency), then a high tf-idf is achieved.
Too simple to be used. Not realistic.
http://www.miislita.com/term-vector/term-vector-3.htmlThe vector value (or term weights) for each existing term (in a document) is non-zero; which is calculated using some scheme. One such well-known scheme is TF/IDF.D1: "Shipment of gold damaged in a fire"D2: "Delivery of silver arrived in a silver truck"D3: "Shipment of gold arrived in a truck“Q: “Gold Silver Truck”