Text Similarities - PG Pushpin

PUSHPIN
TEXT SIMILARITIES
Junaid Surve
6644418

AGENDA
 Introduction
 Data Retrieval
 TF/IDF
 Document-Term Matrix
 VSM
 LSA
 Similarity Measurements
 Cosine Similarity
 SOC-PMI
 Applications & Prototype
 Summary
2

AGENDA
 Introduction
 Data Retrieval
 TF/IDF
 VSM
 LSA
 SOC-PMI
 Summary
3

INTRODUCTION
 WWW – a huge tangled web of information.

 Issues faced – duplications, plagiarism, copyright
violation etc.

 Aim : To detect and report duplicates

 Method : Compare and output the level of similarity
which is “TEXT SIMILARITY”.

4

 Text Similarity has 2 aspects :
 Content Similarity : Words are compared.
e.g. “I have a car” and “I have a vehicle” are 75% similar.

 Expression Similarity : Meaning of the information is
considered.
e.g. “I have a car” and “I have a vehicle” can be
considered 100% similar.

 Scope – Content Similarity

5

 2 step process:

 STEP 1 : Data Retrieval
“The area of study concerned with searching for
documents, for information within documents, and for
metadata about documents, as well as that of searching
structured storage, relational databases, and the World
WideWeb” [1]

 STEP II : Similarity Measurements
To correlate the words or terms of two or more documents
or web pages.

6

AGENDA
 Introduction
 Data Retrieval
 TF/IDF
 VSM
 LSA
 SOC-PMI
 Summary
7

DATA RETRIEVAL
 Translation of literature to mathematics.

 A variety of such concrete techniques exist –
 TF/IDF
 VSM
 LSA

 The corresponding mathematical structure is derived
based of the relevant concrete data retrieval
methodology used.

8

TF/IDF
 Term Frequency / Inverse Document Frequency

 Idea : More common the term, the less importance it
has and hence should be considered at the least end
of the query spectrum.

 Two linear, independent aspects:
 Term Frequency - frequency of occurrence of a term in a
given document.
 Inverse Document Frequency - measure of the general
importance of the term.

9

TF IDF Example [7]
 Three Documents –
 D1: “Shipment of gold damaged in a fire”
 D2: “Delivery of silver arrived in a silver truck”
 D3: “Shipment of gold arrived in a truck”

 Two steps
 Calculate the Term Frequency
 Calculate the Inverse Document Frequency

10

TF IDF Example
Terms D1 D2 D3 dfi D/df i IDF=
log(D/dfi)
a 1 1 1 3 3/3 = 1 0
arrived 1 1 2 3/2 = 1.5 0.1761
damaged 1 1 3/1 = 3 0.4771
delivery 1 1 3/1 = 3 0.4771
fire 1 1 3/1 = 3 0.4771
gold 1 1 2 3/2 = 1.5 0.1761
in 1 1 1 3 3/3 = 1 0
of 1 1 1 3 3/3 = 1 0
silver 2 1 3/1 = 3 0.4771
shipment 1 1 2 3/2 = 1.5 0.1761
truck 1 1 2 3/2 = 1.5 0.1761

11

Document-Term Matrix
 “A Document-Term Matrix is a mathematical matrix
that describes the frequency of terms that occur in a
collection of documents.” [2]

 Rows – Documents
Columns – Terms

 Only depicts which document contains which term
and the number of occurrences of that term in the
document.

12

Document-Term Matrix Example
 D1 = “I like databases”
 D2 = “I hate hate databases”

I like databases hate
D1 1 1 1 0
D2 1 0 1 2

13

VSM
 “Vector Space Model (VSM) is an algebraic model
for representing text documents (and any objects, in
general) as vectors of identifiers, such as, for e.g.
index terms.” [3]

 Each document and query is represented as a
vector:
 document : dj = (w1,j , w2,j , .... , wn,j)
 query : q = (w1,q , w2,q , .... , wn,q)

 Terms can be individual words, keywords, or
phrases, based on the type of application.
14

VSM Example [7]


 Query –
 Gold Silver Truck

15

VSM Example continued...
 Calculating TF-IDF
Terms Q D1 D2 D3 IDFi QxIDFi D1xIDFi D2xIDFi D3xIDFi

a 1 1 1 0
arrived 1 1 0.176 0.1761 0.1761
1
damage 1 0.477 0.4771
d 1
delivery 1 0.477 0.4771
1
fire 1 0.477 0.4771
1
gold 1 1 1 0.176 0.1761 0.1761 0.1761
1
in 1 1 1 0
of 16
1 1 1 0
silver 1 2 0.477 0.4771 0.9542

LSA
 “Latent Semantic Analysis (LSA) is a theory and
method for extracting and representing the meaning
of words and passages of words.” [4]

 Built on the assumption that similar terms tend to
appear in close proximities and hence identification
of correlation patterns between documents or terms
becomes easier.

 2 step process:
 Construction of Document-Term Matrix
 Singular Value Decomposition

17

LSA Example


 Query –
 Gold Silver Truck

18

LSA Example contd...

STEP 1 : Constructing the Term-Document Matrix & Query Matrix
19


STEP 2: Evaluating Singular Vector Decomposition
20


STEP 3 : Reducing Dimensionality w.r.t k
21

 Similar SVD evaluation and reduction is done for the
query vector Q.

 At the end we have:
 Reduced SVD Matrix V (for the documents)
 Reduced SVD Matrix Q (for the query)

 V= Q=

 This further can be supplied to similarity
measurement technique.
22

AGENDA
 Introduction
 Data Retrieval
 TF/IDF
 VSM
 LSA
 SOC-PMI
 Summary
23

SIMILARITY MEASUREMENTS
 Major focus of “Text Similarities” methodology.

 Uses the Mathematical Structures generated by the
Data Retrieval techniques to evaluate the
percentage of likeness between two or more
documents or web pages.

 Two major techniques in focus here:
 SOC-PMI

24

COSINE SIMILARITY
 Evaluate similarity between 2 vectors by measuring
cosine of the angle between them.

 Cosine of the angle will detemine whether the
vectors are roughly pointing in the same direction.

 In our scope : similarity will range between 0 and 1,
since term weights are always positive.
i.e. The angle between two considered vectors will
never exceed 90

25

COSINE Example [7]
 Example continued from VSM.
 Query – Gold Silver Truck

 We have calculated weights using TF-IDF scheme.

 Next Step – Calculate Cosine Similarity:
 CosineΘDi = (Q . Di ) / (|Q| x |Di|)
 i.e. First calculate Dot product: Q . Di
 Then calculate scalar product: |Q| x |Di|

26

COSINE Example continued...
 Dot Products: Q.Di = ∑i wQ,j wi,j
 Q.D1 = 0.0310, Q.D2 = 0.4862, Q.D3 = 0.0620

 Scalar Products: |Q| x |Di| = sqrt(∑i w2Q,j)sqrt(∑i w2i,j)
 |Q| x |D1| = 0.3871, |Q| x |D2| = 0.5896, |Q| x |D3| = 0.1896

 Cosine Similarity:
 CosineΘD1 = 0.0801

27

SOC-PMI
 “Second-Order Co-occurence Pointwise Mutual
Information (SOC-PMI) is a semantic similarity
measure using pointwise mutual information to sort
lists of important neighbor words of the two target
words from a large corpus.” [5]

 A lot of mathematics involved to generate the
formula.

 This Similarity measure at the end is also normalized
so as to limit the range of similarity between 0 and 1.

28

SOC-PMI with an example
 Complicated method with a lot of mathematical
formulae.

 Example [6] :
 W1 = car
 W2 = automobile

 m = 70, n = 43

 Assumptions:
 ϒ = 3, ∂ = 0.7
 window of 11 words
 β1 = β2 = 24.88 CORPUS
29

SOC-PMI example contd...

Bigram frequencies and the set X
Types & Frequencies
and the set Y of words with their PMI
30
values

SOC-PMI example contd...

31

AGENDA
 Introduction
 Data Retrieval
 TF/IDF
 VSM
 LSA
 SOC-PMI
 Summary
32

APPLICATIONS
 Plagiarism Detection
Term Similarity play an important in the field of
Plagiarism Detection.
 Copyright Violation
Copies of restricted Software/Data can be detected using
Text Similarities.
 Recommender Services

33

PROTOTYPE
 AIM : Finding the degree of Similarity between files.

 2 steps
 Data Retrival
 TF-IDF
 Similarity Measurement
 Cosine
 Pearson Correlation
 Distribution Matrix
 Co-occurence

34

Prototype – Data Retrieval
 Steps followed to retrive data using TF-IDF scheme
 SequenceFilesFromDirectory
 Converts files into sequence files. < Text, Text >

 DocumentProcessor
 Converts the sequence file into <Text, StringTuple>

 DictionaryVectorizer
 Creates TF Vectors <Text, VectorWritable>
 Creates dfcount < IntWritable, LongWritable>
 Creates wordcount <Text, LongWritable>

 TFIDFConverter
 Creates TF-IDF vectors <Text, VectorWritable>

35

Prototype – Similarity Measurement
 Intermediate steps
 Convert the TF-IDF into a Matrix <IntWritable,
VectorWritable>

 Similarity Measurement
 Distribution Multiplication
 Matrix * Matrix´
 Cosine, Pearson Correlation and Co-occuerrence
 RowSimilarityJob (Similarity Classname)
 SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE
 SIMILARITY_PEARSON_CORRELATION
 SIMILARITY_COOCCURRENCE
36

Prototype – Similarity Measurment
 Cosine

 Pearson Correlation

 Distribution Matrix

 Co-occurence

37

AGENDA
 Introduction
 Data Retrieval
 TF/IDF
 VSM
 LSA
 SOC-PMI
 Summary
38

SUMMARY
 What is Text Similarity.
 Scope - Content Similarity
 Steps involved in the process:
 Data Retrieval
 TF/IDF
 VSM
 LSA
 SOC-PMI

39

References
[1] Wikipedia: Information retrieval - Wikipedia, the free encyclopedia
(2012), http://en.wikipedia.org/wiki/Information_retrieval
[2] Wikipedia: Document-term matrix - Wikipedia, the free encyclopedia
(2011), http://en.wikipedia.org/wiki/Document-term_matrix
[3] Wikipedia: Vector space model - Wikipedia, the free encyclopedia
(2011), http://en.wikipedia.org/wiki/Vector_space_model
[4] Wikipedia: Latent semantic indexing - Wikipedia, the free
encyclopedia (2011),
http://en.wikipedia.org/wiki/Latent_semantic_indexing
[5] Wikipedia: Second-order co-occurrence pointwise mutual
information - Wikipedia, the free encyclopedia (2011),
http://en.wikipedia.org/wiki/Second-order_co-
occurrence_pointwise_mutual_information
[6] Islam, A. and Inkpen, D. (2006). Second Order Co-occurrence PMI
for Determining the Semantic Similarity of Words, in Proceedings of
the International Conference on Language Resources and
Evaluation (LREC 2006), Genoa, Italy, pp. 1033–1038.
[7] Dr. E. Garcia. Mi Islita.com - http://www.miislita.com/information-
retrieval-tutorial/svd-lsi-tutorial-4-lsi-how-to-calculations.html
41

Text Similarities - PG Pushpin

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (11)

Ähnlich wie Text Similarities - PG Pushpin

Ähnlich wie Text Similarities - PG Pushpin (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Text Similarities - PG Pushpin

Hinweis der Redaktion