SlideShare ist ein Scribd-Unternehmen logo
1 von 41
PUSHPIN
TEXT SIMILARITIES
           Junaid Surve
               6644418
AGENDA
       Introduction
       Data Retrieval
           TF/IDF
           Document-Term Matrix
           VSM
           LSA
       Similarity Measurements
           Cosine Similarity
           SOC-PMI
       Applications & Prototype
       Summary
    2
AGENDA
       Introduction
       Data Retrieval
           TF/IDF
           Document-Term Matrix
           VSM
           LSA
       Similarity Measurements
           Cosine Similarity
           SOC-PMI
       Applications & Prototype
       Summary
    3
INTRODUCTION
       WWW – a huge tangled web of information.

       Issues faced – duplications, plagiarism, copyright
        violation etc.

       Aim : To detect and report duplicates

       Method : Compare and output the level of similarity
        which is “TEXT SIMILARITY”.



    4
       Text Similarity has 2 aspects :
           Content Similarity : Words are compared.
            e.g. “I have a car” and “I have a vehicle” are 75% similar.

           Expression Similarity : Meaning of the information is
            considered.
            e.g. “I have a car” and “I have a vehicle” can be
            considered 100% similar.

       Scope – Content Similarity



    5
       2 step process:

           STEP 1 : Data Retrieval
            “The area of study concerned with searching for
            documents, for information within documents, and for
            metadata about documents, as well as that of searching
            structured storage, relational databases, and the World
            WideWeb” [1]

           STEP II : Similarity Measurements
            To correlate the words or terms of two or more documents
            or web pages.

    6
AGENDA
       Introduction
       Data Retrieval
           TF/IDF
           Document-Term Matrix
           VSM
           LSA
       Similarity Measurements
           Cosine Similarity
           SOC-PMI
       Applications & Prototype
       Summary
    7
DATA RETRIEVAL
       Translation of literature to mathematics.

       A variety of such concrete techniques exist –
           TF/IDF
           Document-Term Matrix
           VSM
           LSA

       The corresponding mathematical structure is derived
        based of the relevant concrete data retrieval
        methodology used.

    8
TF/IDF
       Term Frequency / Inverse Document Frequency

       Idea : More common the term, the less importance it
        has and hence should be considered at the least end
        of the query spectrum.

       Two linear, independent aspects:
           Term Frequency - frequency of occurrence of a term in a
            given document.
           Inverse Document Frequency - measure of the general
            importance of the term.

    9
TF IDF Example [7]
    Three Documents –
        D1: “Shipment of gold damaged in a fire”
        D2: “Delivery of silver arrived in a silver truck”
        D3: “Shipment of gold arrived in a truck”


    Two steps
        Calculate the Term Frequency
        Calculate the Inverse Document Frequency




    10
TF IDF Example
     Terms   D1   D2   D3   dfi    D/df i       IDF=
                                              log(D/dfi)
a            1    1    1    3     3/3 = 1         0
arrived           1    1    2     3/2 = 1.5    0.1761
damaged      1              1     3/1 = 3      0.4771
delivery          1         1     3/1 = 3      0.4771
fire         1              1     3/1 = 3      0.4771
gold         1         1    2     3/2 = 1.5    0.1761
in           1    1    1    3     3/3 = 1         0
of           1    1    1    3     3/3 = 1         0
silver            2         1     3/1 = 3      0.4771
shipment     1         1    2     3/2 = 1.5    0.1761
truck             1    1    2     3/2 = 1.5    0.1761


11
Document-Term Matrix
    “A Document-Term Matrix is a mathematical matrix
     that describes the frequency of terms that occur in a
     collection of documents.” [2]

    Rows – Documents
     Columns – Terms

    Only depicts which document contains which term
     and the number of occurrences of that term in the
     document.


    12
Document-Term Matrix Example
    D1 = “I like databases”
    D2 = “I hate hate databases”


                   I         like   databases   hate
    D1             1          1        1         0
    D2             1          0        1         2




    13
VSM
    “Vector Space Model (VSM) is an algebraic model
     for representing text documents (and any objects, in
     general) as vectors of identifiers, such as, for e.g.
     index terms.” [3]

    Each document and query is represented as a
     vector:
        document : dj = (w1,j , w2,j , .... , wn,j)
        query : q = (w1,q , w2,q , .... , wn,q)


    Terms can be individual words, keywords, or
     phrases, based on the type of application.
    14
VSM Example [7]

    Three Documents –
        D1: “Shipment of gold damaged in a fire”
        D2: “Delivery of silver arrived in a silver truck”
        D3: “Shipment of gold arrived in a truck”

    Query –
        Gold Silver Truck




    15
VSM Example continued...
        Calculating TF-IDF
 Terms       Q    D1   D2   D3   IDFi    QxIDFi   D1xIDFi   D2xIDFi   D3xIDFi

a                 1    1    1     0
arrived                1    1    0.176                      0.1761    0.1761
                                   1
damage            1              0.477            0.4771
d                                  1
delivery               1         0.477                      0.4771
                                   1
fire              1              0.477            0.4771
                                   1
gold          1   1         1    0.176   0.1761   0.1761              0.1761
                                   1
in                1    1    1     0
of      16
                  1    1    1     0
silver        1        2         0.477   0.4771             0.9542
LSA
    “Latent Semantic Analysis (LSA) is a theory and
     method for extracting and representing the meaning
     of words and passages of words.” [4]

    Built on the assumption that similar terms tend to
     appear in close proximities and hence identification
     of correlation patterns between documents or terms
     becomes easier.

    2 step process:
        Construction of Document-Term Matrix
        Singular Value Decomposition

    17
LSA Example

    Three Documents –
        D1: “Shipment of gold damaged in a fire”
        D2: “Delivery of silver arrived in a silver truck”
        D3: “Shipment of gold arrived in a truck”

    Query –
        Gold Silver Truck




    18
LSA Example contd...




     STEP 1 : Constructing the Term-Document Matrix & Query Matrix
19
LSA Example contd...




      STEP 2: Evaluating Singular Vector Decomposition
20
LSA Example contd...




         STEP 3 : Reducing Dimensionality w.r.t k
21
    Similar SVD evaluation and reduction is done for the
     query vector Q.

    At the end we have:
        Reduced SVD Matrix V (for the documents)
        Reduced SVD Matrix Q (for the query)


    V=                        Q=

    This further can be supplied to similarity
     measurement technique.
    22
AGENDA
    Introduction
    Data Retrieval
        TF/IDF
        Document-Term Matrix
        VSM
        LSA
    Similarity Measurements
        Cosine Similarity
        SOC-PMI
    Applications & Prototype
    Summary
    23
SIMILARITY MEASUREMENTS
    Major focus of “Text Similarities” methodology.

    Uses the Mathematical Structures generated by the
     Data Retrieval techniques to evaluate the
     percentage of likeness between two or more
     documents or web pages.

    Two major techniques in focus here:
        Cosine Similarity
        SOC-PMI


    24
COSINE SIMILARITY
    Evaluate similarity between 2 vectors by measuring
     cosine of the angle between them.

    Cosine of the angle will detemine whether the
     vectors are roughly pointing in the same direction.

    In our scope : similarity will range between 0 and 1,
     since term weights are always positive.
     i.e. The angle between two considered vectors will
     never exceed 90


    25
COSINE Example [7]
    Example continued from VSM.
        Three Documents –
            D1: “Shipment of gold damaged in a fire”
            D2: “Delivery of silver arrived in a silver truck”
            D3: “Shipment of gold arrived in a truck”
        Query – Gold Silver Truck

    We have calculated weights using TF-IDF scheme.

    Next Step – Calculate Cosine Similarity:
        CosineΘDi = (Q . Di ) / (|Q| x |Di|)
        i.e. First calculate Dot product: Q . Di
        Then calculate scalar product: |Q| x |Di|

    26
COSINE Example continued...
    Dot Products: Q.Di = ∑i wQ,j wi,j
        Q.D1 = 0.0310, Q.D2 = 0.4862, Q.D3 = 0.0620


    Scalar Products: |Q| x |Di| = sqrt(∑i w2Q,j)sqrt(∑i w2i,j)
        |Q| x |D1| = 0.3871, |Q| x |D2| = 0.5896, |Q| x |D3| = 0.1896


    Cosine Similarity:
        CosineΘD1 = 0.0801
        CosineΘD2 = 0.8246
        CosineΘD3 = 0.3271


    27
SOC-PMI
    “Second-Order Co-occurence Pointwise Mutual
     Information (SOC-PMI) is a semantic similarity
     measure using pointwise mutual information to sort
     lists of important neighbor words of the two target
     words from a large corpus.” [5]

    A lot of mathematics involved to generate the
     formula.

    This Similarity measure at the end is also normalized
     so as to limit the range of similarity between 0 and 1.

    28
SOC-PMI with an example
    Complicated method with a lot of mathematical
     formulae.

    Example [6] :
        W1 = car
        W2 = automobile

        m = 70, n = 43

    Assumptions:
        ϒ = 3, ∂ = 0.7
        window of 11 words
    β1 = β2 = 24.88                    CORPUS
    29
SOC-PMI example contd...




                            Bigram frequencies and the set X
     Types & Frequencies
                           and the set Y of words with their PMI
30
                                          values
SOC-PMI example contd...




31
AGENDA
    Introduction
    Data Retrieval
        TF/IDF
        Document-Term Matrix
        VSM
        LSA
    Similarity Measurements
        Cosine Similarity
        SOC-PMI
    Applications & Prototype
    Summary
    32
APPLICATIONS
    Plagiarism Detection
     Term Similarity play an important in the field of
     Plagiarism Detection.
    Copyright Violation
     Copies of restricted Software/Data can be detected using
      Text Similarities.
    Recommender Services




    33
PROTOTYPE
    AIM : Finding the degree of Similarity between files.

    2 steps
        Data Retrival
            TF-IDF
        Similarity Measurement
            Cosine
            Pearson Correlation
            Distribution Matrix
            Co-occurence




    34
Prototype – Data Retrieval
    Steps followed to retrive data using TF-IDF scheme
        SequenceFilesFromDirectory
            Converts files into sequence files. < Text, Text >

        DocumentProcessor
            Converts the sequence file into <Text, StringTuple>

        DictionaryVectorizer
            Creates TF Vectors <Text, VectorWritable>
            Creates dfcount < IntWritable, LongWritable>
            Creates wordcount <Text, LongWritable>

        TFIDFConverter
            Creates TF-IDF vectors <Text, VectorWritable>

    35
Prototype – Similarity Measurement
    Intermediate steps
        Convert the TF-IDF into a Matrix <IntWritable,
         VectorWritable>



    Similarity Measurement
        Distribution Multiplication
            Matrix * Matrix´
        Cosine, Pearson Correlation and Co-occuerrence
            RowSimilarityJob (Similarity Classname)
                SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE
                SIMILARITY_PEARSON_CORRELATION
                SIMILARITY_COOCCURRENCE
    36
Prototype – Similarity Measurment
    Cosine



    Pearson Correlation



    Distribution Matrix



    Co-occurence




    37
AGENDA
    Introduction
    Data Retrieval
        TF/IDF
        Document-Term Matrix
        VSM
        LSA
    Similarity Measurements
        Cosine Similarity
        SOC-PMI
    Applications & Prototype
    Summary
    38
SUMMARY
    What is Text Similarity.
    Scope - Content Similarity
    Steps involved in the process:
        Data Retrieval
            TF/IDF
            Document-Term Matrix
            VSM
            LSA
        Similarity Measurements
            Cosine Similarity
            SOC-PMI
    Applications & Prototype

    39
40
References
[1] Wikipedia: Information retrieval - Wikipedia, the free encyclopedia
   (2012), http://en.wikipedia.org/wiki/Information_retrieval
[2] Wikipedia: Document-term matrix - Wikipedia, the free encyclopedia
   (2011), http://en.wikipedia.org/wiki/Document-term_matrix
[3] Wikipedia: Vector space model - Wikipedia, the free encyclopedia
   (2011), http://en.wikipedia.org/wiki/Vector_space_model
[4] Wikipedia: Latent semantic indexing - Wikipedia, the free
   encyclopedia (2011),
   http://en.wikipedia.org/wiki/Latent_semantic_indexing
[5] Wikipedia: Second-order co-occurrence pointwise mutual
   information - Wikipedia, the free encyclopedia (2011),
   http://en.wikipedia.org/wiki/Second-order_co-
   occurrence_pointwise_mutual_information
[6] Islam, A. and Inkpen, D. (2006). Second Order Co-occurrence PMI
   for Determining the Semantic Similarity of Words, in Proceedings of
   the International Conference on Language Resources and
   Evaluation (LREC 2006), Genoa, Italy, pp. 1033–1038.
[7] Dr. E. Garcia. Mi Islita.com - http://www.miislita.com/information-
   retrieval-tutorial/svd-lsi-tutorial-4-lsi-how-to-calculations.html
 41

Weitere ähnliche Inhalte

Andere mochten auch

Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - IJaganadh Gopinadhan
 
Zeppelin(제플린) 서울시립대학교 데이터 마이닝연구실 활용사례
Zeppelin(제플린) 서울시립대학교 데이터 마이닝연구실 활용사례Zeppelin(제플린) 서울시립대학교 데이터 마이닝연구실 활용사례
Zeppelin(제플린) 서울시립대학교 데이터 마이닝연구실 활용사례Taejun Kim
 
OUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingOUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingFlorian Leitner
 
Data Mining Techniques
Data Mining TechniquesData Mining Techniques
Data Mining TechniquesHouw Liong The
 
Text Analytics for Dummies 2010
Text Analytics for Dummies 2010Text Analytics for Dummies 2010
Text Analytics for Dummies 2010Seth Grimes
 
Tutorial: Context In Recommender Systems
Tutorial: Context In Recommender SystemsTutorial: Context In Recommender Systems
Tutorial: Context In Recommender SystemsYONG ZHENG
 
Analysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataAnalysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataSeth Grimes
 
발표자료 11장
발표자료 11장발표자료 11장
발표자료 11장Juhui Park
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineNYC Predictive Analytics
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architectureLiang Xiang
 

Andere mochten auch (11)

Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
 
Zeppelin(제플린) 서울시립대학교 데이터 마이닝연구실 활용사례
Zeppelin(제플린) 서울시립대학교 데이터 마이닝연구실 활용사례Zeppelin(제플린) 서울시립대학교 데이터 마이닝연구실 활용사례
Zeppelin(제플린) 서울시립대학교 데이터 마이닝연구실 활용사례
 
OUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingOUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String Processing
 
Data Mining Techniques
Data Mining TechniquesData Mining Techniques
Data Mining Techniques
 
Text Analytics for Dummies 2010
Text Analytics for Dummies 2010Text Analytics for Dummies 2010
Text Analytics for Dummies 2010
 
Tutorial: Context In Recommender Systems
Tutorial: Context In Recommender SystemsTutorial: Context In Recommender Systems
Tutorial: Context In Recommender Systems
 
Analysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataAnalysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ Data
 
발표자료 11장
발표자료 11장발표자료 11장
발표자료 11장
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engine
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
 

Ähnlich wie Text Similarities - PG Pushpin

4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.pptBereketAraya
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.pptBereketAraya
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdfHabtamu100
 
NDD Project presentation
NDD Project presentationNDD Project presentation
NDD Project presentationahmedmishfaq
 
Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space modeldalal404
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGGeorge Simov
 
Ch03 Mining Massive Data Sets stanford
Ch03 Mining Massive Data Sets  stanfordCh03 Mining Massive Data Sets  stanford
Ch03 Mining Massive Data Sets stanfordSakthivel C R
 
집합모델 확장불린모델
집합모델  확장불린모델집합모델  확장불린모델
집합모델 확장불린모델guesta34d441
 
집합모델 확장불린모델
집합모델  확장불린모델집합모델  확장불린모델
집합모델 확장불린모델JUNGEUN KANG
 
3 - Finding similar items
3 - Finding similar items3 - Finding similar items
3 - Finding similar itemsViet-Trung TRAN
 
Exploiting Distributional Semantic Models in Question Answering
Exploiting Distributional Semantic Models in Question AnsweringExploiting Distributional Semantic Models in Question Answering
Exploiting Distributional Semantic Models in Question AnsweringPierpaolo Basile
 
CHAPTER 14 CLUSTERING.PPTX
CHAPTER 14 CLUSTERING.PPTXCHAPTER 14 CLUSTERING.PPTX
CHAPTER 14 CLUSTERING.PPTXVasudhaSrivatsa1
 
Tatyana Makhalova - 2015 - News clustering approach based on discourse text s...
Tatyana Makhalova - 2015 - News clustering approach based on discourse text s...Tatyana Makhalova - 2015 - News clustering approach based on discourse text s...
Tatyana Makhalova - 2015 - News clustering approach based on discourse text s...Association for Computational Linguistics
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Jonathon Hare
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Kira
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher
 
vectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptvectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptpepe3059
 

Ähnlich wie Text Similarities - PG Pushpin (20)

4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdf
 
NDD Project presentation
NDD Project presentationNDD Project presentation
NDD Project presentation
 
Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space model
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERING
 
Ch03 Mining Massive Data Sets stanford
Ch03 Mining Massive Data Sets  stanfordCh03 Mining Massive Data Sets  stanford
Ch03 Mining Massive Data Sets stanford
 
집합모델 확장불린모델
집합모델  확장불린모델집합모델  확장불린모델
집합모델 확장불린모델
 
집합모델 확장불린모델
집합모델  확장불린모델집합모델  확장불린모델
집합모델 확장불린모델
 
3 - Finding similar items
3 - Finding similar items3 - Finding similar items
3 - Finding similar items
 
확률모델
확률모델확률모델
확률모델
 
확률모델
확률모델확률모델
확률모델
 
Exploiting Distributional Semantic Models in Question Answering
Exploiting Distributional Semantic Models in Question AnsweringExploiting Distributional Semantic Models in Question Answering
Exploiting Distributional Semantic Models in Question Answering
 
CHAPTER 14 CLUSTERING.PPTX
CHAPTER 14 CLUSTERING.PPTXCHAPTER 14 CLUSTERING.PPTX
CHAPTER 14 CLUSTERING.PPTX
 
Tatyana Makhalova - 2015 - News clustering approach based on discourse text s...
Tatyana Makhalova - 2015 - News clustering approach based on discourse text s...Tatyana Makhalova - 2015 - News clustering approach based on discourse text s...
Tatyana Makhalova - 2015 - News clustering approach based on discourse text s...
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
vectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptvectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.ppt
 

Kürzlich hochgeladen

Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 

Kürzlich hochgeladen (20)

INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 

Text Similarities - PG Pushpin

  • 1. PUSHPIN TEXT SIMILARITIES Junaid Surve 6644418
  • 2. AGENDA  Introduction  Data Retrieval  TF/IDF  Document-Term Matrix  VSM  LSA  Similarity Measurements  Cosine Similarity  SOC-PMI  Applications & Prototype  Summary 2
  • 3. AGENDA  Introduction  Data Retrieval  TF/IDF  Document-Term Matrix  VSM  LSA  Similarity Measurements  Cosine Similarity  SOC-PMI  Applications & Prototype  Summary 3
  • 4. INTRODUCTION  WWW – a huge tangled web of information.  Issues faced – duplications, plagiarism, copyright violation etc.  Aim : To detect and report duplicates  Method : Compare and output the level of similarity which is “TEXT SIMILARITY”. 4
  • 5. Text Similarity has 2 aspects :  Content Similarity : Words are compared. e.g. “I have a car” and “I have a vehicle” are 75% similar.  Expression Similarity : Meaning of the information is considered. e.g. “I have a car” and “I have a vehicle” can be considered 100% similar.  Scope – Content Similarity 5
  • 6. 2 step process:  STEP 1 : Data Retrieval “The area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World WideWeb” [1]  STEP II : Similarity Measurements To correlate the words or terms of two or more documents or web pages. 6
  • 7. AGENDA  Introduction  Data Retrieval  TF/IDF  Document-Term Matrix  VSM  LSA  Similarity Measurements  Cosine Similarity  SOC-PMI  Applications & Prototype  Summary 7
  • 8. DATA RETRIEVAL  Translation of literature to mathematics.  A variety of such concrete techniques exist –  TF/IDF  Document-Term Matrix  VSM  LSA  The corresponding mathematical structure is derived based of the relevant concrete data retrieval methodology used. 8
  • 9. TF/IDF  Term Frequency / Inverse Document Frequency  Idea : More common the term, the less importance it has and hence should be considered at the least end of the query spectrum.  Two linear, independent aspects:  Term Frequency - frequency of occurrence of a term in a given document.  Inverse Document Frequency - measure of the general importance of the term. 9
  • 10. TF IDF Example [7]  Three Documents –  D1: “Shipment of gold damaged in a fire”  D2: “Delivery of silver arrived in a silver truck”  D3: “Shipment of gold arrived in a truck”  Two steps  Calculate the Term Frequency  Calculate the Inverse Document Frequency 10
  • 11. TF IDF Example Terms D1 D2 D3 dfi D/df i IDF= log(D/dfi) a 1 1 1 3 3/3 = 1 0 arrived 1 1 2 3/2 = 1.5 0.1761 damaged 1 1 3/1 = 3 0.4771 delivery 1 1 3/1 = 3 0.4771 fire 1 1 3/1 = 3 0.4771 gold 1 1 2 3/2 = 1.5 0.1761 in 1 1 1 3 3/3 = 1 0 of 1 1 1 3 3/3 = 1 0 silver 2 1 3/1 = 3 0.4771 shipment 1 1 2 3/2 = 1.5 0.1761 truck 1 1 2 3/2 = 1.5 0.1761 11
  • 12. Document-Term Matrix  “A Document-Term Matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.” [2]  Rows – Documents Columns – Terms  Only depicts which document contains which term and the number of occurrences of that term in the document. 12
  • 13. Document-Term Matrix Example  D1 = “I like databases”  D2 = “I hate hate databases” I like databases hate D1 1 1 1 0 D2 1 0 1 2 13
  • 14. VSM  “Vector Space Model (VSM) is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for e.g. index terms.” [3]  Each document and query is represented as a vector:  document : dj = (w1,j , w2,j , .... , wn,j)  query : q = (w1,q , w2,q , .... , wn,q)  Terms can be individual words, keywords, or phrases, based on the type of application. 14
  • 15. VSM Example [7]  Three Documents –  D1: “Shipment of gold damaged in a fire”  D2: “Delivery of silver arrived in a silver truck”  D3: “Shipment of gold arrived in a truck”  Query –  Gold Silver Truck 15
  • 16. VSM Example continued...  Calculating TF-IDF Terms Q D1 D2 D3 IDFi QxIDFi D1xIDFi D2xIDFi D3xIDFi a 1 1 1 0 arrived 1 1 0.176 0.1761 0.1761 1 damage 1 0.477 0.4771 d 1 delivery 1 0.477 0.4771 1 fire 1 0.477 0.4771 1 gold 1 1 1 0.176 0.1761 0.1761 0.1761 1 in 1 1 1 0 of 16 1 1 1 0 silver 1 2 0.477 0.4771 0.9542
  • 17. LSA  “Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the meaning of words and passages of words.” [4]  Built on the assumption that similar terms tend to appear in close proximities and hence identification of correlation patterns between documents or terms becomes easier.  2 step process:  Construction of Document-Term Matrix  Singular Value Decomposition 17
  • 18. LSA Example  Three Documents –  D1: “Shipment of gold damaged in a fire”  D2: “Delivery of silver arrived in a silver truck”  D3: “Shipment of gold arrived in a truck”  Query –  Gold Silver Truck 18
  • 19. LSA Example contd... STEP 1 : Constructing the Term-Document Matrix & Query Matrix 19
  • 20. LSA Example contd... STEP 2: Evaluating Singular Vector Decomposition 20
  • 21. LSA Example contd... STEP 3 : Reducing Dimensionality w.r.t k 21
  • 22. Similar SVD evaluation and reduction is done for the query vector Q.  At the end we have:  Reduced SVD Matrix V (for the documents)  Reduced SVD Matrix Q (for the query)  V= Q=  This further can be supplied to similarity measurement technique. 22
  • 23. AGENDA  Introduction  Data Retrieval  TF/IDF  Document-Term Matrix  VSM  LSA  Similarity Measurements  Cosine Similarity  SOC-PMI  Applications & Prototype  Summary 23
  • 24. SIMILARITY MEASUREMENTS  Major focus of “Text Similarities” methodology.  Uses the Mathematical Structures generated by the Data Retrieval techniques to evaluate the percentage of likeness between two or more documents or web pages.  Two major techniques in focus here:  Cosine Similarity  SOC-PMI 24
  • 25. COSINE SIMILARITY  Evaluate similarity between 2 vectors by measuring cosine of the angle between them.  Cosine of the angle will detemine whether the vectors are roughly pointing in the same direction.  In our scope : similarity will range between 0 and 1, since term weights are always positive. i.e. The angle between two considered vectors will never exceed 90 25
  • 26. COSINE Example [7]  Example continued from VSM.  Three Documents –  D1: “Shipment of gold damaged in a fire”  D2: “Delivery of silver arrived in a silver truck”  D3: “Shipment of gold arrived in a truck”  Query – Gold Silver Truck  We have calculated weights using TF-IDF scheme.  Next Step – Calculate Cosine Similarity:  CosineΘDi = (Q . Di ) / (|Q| x |Di|)  i.e. First calculate Dot product: Q . Di  Then calculate scalar product: |Q| x |Di| 26
  • 27. COSINE Example continued...  Dot Products: Q.Di = ∑i wQ,j wi,j  Q.D1 = 0.0310, Q.D2 = 0.4862, Q.D3 = 0.0620  Scalar Products: |Q| x |Di| = sqrt(∑i w2Q,j)sqrt(∑i w2i,j)  |Q| x |D1| = 0.3871, |Q| x |D2| = 0.5896, |Q| x |D3| = 0.1896  Cosine Similarity:  CosineΘD1 = 0.0801  CosineΘD2 = 0.8246  CosineΘD3 = 0.3271 27
  • 28. SOC-PMI  “Second-Order Co-occurence Pointwise Mutual Information (SOC-PMI) is a semantic similarity measure using pointwise mutual information to sort lists of important neighbor words of the two target words from a large corpus.” [5]  A lot of mathematics involved to generate the formula.  This Similarity measure at the end is also normalized so as to limit the range of similarity between 0 and 1. 28
  • 29. SOC-PMI with an example  Complicated method with a lot of mathematical formulae.  Example [6] :  W1 = car  W2 = automobile  m = 70, n = 43  Assumptions:  ϒ = 3, ∂ = 0.7  window of 11 words  β1 = β2 = 24.88 CORPUS 29
  • 30. SOC-PMI example contd... Bigram frequencies and the set X Types & Frequencies and the set Y of words with their PMI 30 values
  • 32. AGENDA  Introduction  Data Retrieval  TF/IDF  Document-Term Matrix  VSM  LSA  Similarity Measurements  Cosine Similarity  SOC-PMI  Applications & Prototype  Summary 32
  • 33. APPLICATIONS  Plagiarism Detection Term Similarity play an important in the field of Plagiarism Detection.  Copyright Violation Copies of restricted Software/Data can be detected using Text Similarities.  Recommender Services 33
  • 34. PROTOTYPE  AIM : Finding the degree of Similarity between files.  2 steps  Data Retrival  TF-IDF  Similarity Measurement  Cosine  Pearson Correlation  Distribution Matrix  Co-occurence 34
  • 35. Prototype – Data Retrieval  Steps followed to retrive data using TF-IDF scheme  SequenceFilesFromDirectory  Converts files into sequence files. < Text, Text >  DocumentProcessor  Converts the sequence file into <Text, StringTuple>  DictionaryVectorizer  Creates TF Vectors <Text, VectorWritable>  Creates dfcount < IntWritable, LongWritable>  Creates wordcount <Text, LongWritable>  TFIDFConverter  Creates TF-IDF vectors <Text, VectorWritable> 35
  • 36. Prototype – Similarity Measurement  Intermediate steps  Convert the TF-IDF into a Matrix <IntWritable, VectorWritable>  Similarity Measurement  Distribution Multiplication  Matrix * Matrix´  Cosine, Pearson Correlation and Co-occuerrence  RowSimilarityJob (Similarity Classname)  SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE  SIMILARITY_PEARSON_CORRELATION  SIMILARITY_COOCCURRENCE 36
  • 37. Prototype – Similarity Measurment  Cosine  Pearson Correlation  Distribution Matrix  Co-occurence 37
  • 38. AGENDA  Introduction  Data Retrieval  TF/IDF  Document-Term Matrix  VSM  LSA  Similarity Measurements  Cosine Similarity  SOC-PMI  Applications & Prototype  Summary 38
  • 39. SUMMARY  What is Text Similarity.  Scope - Content Similarity  Steps involved in the process:  Data Retrieval  TF/IDF  Document-Term Matrix  VSM  LSA  Similarity Measurements  Cosine Similarity  SOC-PMI  Applications & Prototype 39
  • 40. 40
  • 41. References [1] Wikipedia: Information retrieval - Wikipedia, the free encyclopedia (2012), http://en.wikipedia.org/wiki/Information_retrieval [2] Wikipedia: Document-term matrix - Wikipedia, the free encyclopedia (2011), http://en.wikipedia.org/wiki/Document-term_matrix [3] Wikipedia: Vector space model - Wikipedia, the free encyclopedia (2011), http://en.wikipedia.org/wiki/Vector_space_model [4] Wikipedia: Latent semantic indexing - Wikipedia, the free encyclopedia (2011), http://en.wikipedia.org/wiki/Latent_semantic_indexing [5] Wikipedia: Second-order co-occurrence pointwise mutual information - Wikipedia, the free encyclopedia (2011), http://en.wikipedia.org/wiki/Second-order_co- occurrence_pointwise_mutual_information [6] Islam, A. and Inkpen, D. (2006). Second Order Co-occurrence PMI for Determining the Semantic Similarity of Words, in Proceedings of the International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, pp. 1033–1038. [7] Dr. E. Garcia. Mi Islita.com - http://www.miislita.com/information- retrieval-tutorial/svd-lsi-tutorial-4-lsi-how-to-calculations.html 41

Hinweis der Redaktion

  1. Data retrieval - In layman terms, data retrieval means that the words or terms within a document or web page are translated to some mathematical structure.
  2. This basically implies that given a document, each distinct word or term within it is translated to a particular mathematical structure; for e.g. vector, frequency matrix etc.
  3. TF - In its simplest form, the term frequency is also called as Term Count which is nothing but the number of occurrence of the term in thedocument.IDF - obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotientIt should be noted that if a term has a high term-frequency in the given document and a low document-frequency in the considered bunch of documents (implying a high inverse document frequency), then a high tf-idf is achieved.
  4. Too simple to be used. Not realistic.
  5. http://www.miislita.com/term-vector/term-vector-3.htmlThe vector value (or term weights) for each existing term (in a document) is non-zero; which is calculated using some scheme. One such well-known scheme is TF/IDF.D1: &quot;Shipment of gold damaged in a fire&quot;D2: &quot;Delivery of silver arrived in a silver truck&quot;D3: &quot;Shipment of gold arrived in a truck“Q: “Gold Silver Truck”
  6. http://www.puffinwarellc.com/index.php/news-and-articles/articles/33-latent-semantic-analysis-tutorial.html?start=1
  7. k should be high enough to remove unwanted and most common words (e.g. a, the) and low enough to keep the important words within the context.