SlideShare ist ein Scribd-Unternehmen logo
1 von 118
.




                Seminar on

                                              Text Mining


By : Hadi Mohammadzadeh
Institute of Applied Information Processing
University of Ulm – 15 Dec. 2009

            Hadi Mohammadzadeh          Text Mining   Pages       1
.




Seminar on Text Mining


OutLine
–   Basics
–   Latent Semantic Indexing
–   Part of Speech(POS) Tagging
–   Information Extraction
–   Clustering Documents
–   Text Categorization




       Hadi Mohammadzadeh   Text Mining   Pages       2
.




             Seminar on Text Mining
                   Part One




                      Basics



Hadi Mohammadzadeh   Text Mining   Pages       3
.




           Definition: Text Mining
• Text Mining can be defined as a knowledge-intensive process
  in which a user interacts with a document collection over time
  by using a suite of analysis tools.

                                         And

• Text Mining seeks to extract useful information from data
  sources (document collections) through the identification and
  exploration of interesting patterns.




          Hadi Mohammadzadeh   Text Mining   Pages                 4
.




                   Similarities between
                Data Mining and Text Mining
• Both types of systems rely on:
   – Preprocessing routines
   – Pattern-discovery algorithms
   – Presentation-layer elements such as visualization tools




          Hadi Mohammadzadeh   Text Mining   Pages             5
.



                   Preprocessing Operations
                                            in


                Data Mining and Text Mining
• In Data Mining assume data
   – Stored in a structured format,
     so preprocessing focus on scrubbing and normalizing data,
     to create extensive numbers of table joins


• In Text Mining preprocessing operations center on
   – Identification & Extraction of representative features for
     NL documents,
     to transform unstructured data stored in doc collections
     into a more explicity structured intermediate format

         Hadi Mohammadzadeh   Text Mining        Pages            6
.




 Weakly Structured and Semi structured Docs

Documents
  – that have relatively little in the way of strong
     • typographical, layout, or markup indicators
    to denote structure are refered to as free-format or
    weakly structured docs (such as most scientific research papers,
    business reports, and news stories)


  – With extensive and consistent format elements in
    which field-type metadata can be more easily
    inferred are described as semistructured docs (such as
    some e-mail, HTML web pages, PDF files)



         Hadi Mohammadzadeh   Text Mining   Pages                      7
.




                           Document Features

• Although many potential features can be employed to
  represent docs, the following four types are most commonly
  uesd:
   –   Characters
   –   Words
   –   Terms
   –   Concepts
• High Feature Dimensionality ( HFD)
   – Problems relating to HFD are typically of much greater magnitude in
     TM systems than in classic DM systems.
• Feature Sparcity
   – Only a small percentage of all possible features for a document
     collection as a whole appear as in any single docs.


           Hadi Mohammadzadeh   Text Mining   Pages                        8
.




        Representational Model of a Document

• An essential task for most text mining systems is

  The identification of a simplified subset of document features


  that can be used to represent a particular document as
  a whole.
  We refer to such a set of features as the
  representational model of a document


         Hadi Mohammadzadeh   Text Mining   Pages                  9
.




            Character-level Representational

• Without Positional Information
  – Are often of very limited utility in TM applications
• With Positional Information
  – Are somewhat more useful and common (e.g.
    bigrams or trigrams)
• Disadvantage:
  – Character-base Rep. can often be unwieldy for
    some types of text processing techniques because
    the feature space for a docs is fairly unoptimized


       Hadi Mohammadzadeh   Text Mining   Pages            10
.




               Word-level Representational

• Without Positional Information
  – Are often of very limited utility in TM applications
• With Positional Information
  – Are somewhat more useful and common(e.g.
    bigrams or trigrams)
• Disadvantage:
  – Character-base Rep. can often be unwieldy for
    some types of text processing techniques because
    the feature space for a docs is fairly unoptimized


       Hadi Mohammadzadeh   Text Mining   Pages            11
.




                  Term-level Representational

• Normalized Terms comes out of Term-Extraction
  Methodology
  – Sequence of one or more tokenized and lemmatized word
• What are Term-Extraction Methodology?




          Hadi Mohammadzadeh   Text Mining   Pages          12
.




                 Concept-level Representational

• Concepts are features generated for a document by means
  of manual, statistical, rule-based, or hybrid categorization
  methodology




           Hadi Mohammadzadeh   Text Mining   Pages         13
.




       General Architecture of Text Mining Systems
                                  Abstract Level

• A text mining system takes in input raw docs and
  generates various types of output such as:
   – Patterns
   – Maps of connections
   – Trends


                  Input                               Output


                                                   Patterns
                                                  Connections
                                                    Trends
              Documents

           Hadi Mohammadzadeh   Text Mining   Pages                 14
.




         General Architecture of Text Mining Systems
                                   Functional Level

• TM systems follow the general model provided by some classic
  DM applications and are thus divisible into 4 main areas
   –   Preprocessing Tasks
   –   Core mining operations
   –   Presentation layer components and browsing functionality
   –   Refinement techniques




              Hadi Mohammadzadeh   Text Mining   Pages            15
.




     System Architecture for Generic
          Text Mining System




Hadi Mohammadzadeh   Text Mining   Pages       16
.




System Architecture for Domain-oriented
         Text Mining System




 Hadi Mohammadzadeh   Text Mining   Pages       17
.




System Architecture for an advanced Text Mining System
           with background knowledge base




      Hadi Mohammadzadeh   Text Mining   Pages           18
.




                 Seminar on Text Mining
                       Part Two




Latent Semantic Indexing(LSI)




   Hadi Mohammadzadeh   Text Mining   Pages       19
.




  Problems with Lexical Semantics
• Ambiguity and association in natural language

  – Polysemy: Words often have a multitude of meanings
    and different types of usage such as bank (more severe
    in very heterogeneous collections).
  – The vector space model is unable to discriminate
    between different meanings of the same word.




         Hadi Mohammadzadeh   Text Mining   Pages       20
.




Problems with Lexical Semantics
– Synonymy: Different terms may have an
  identical or a similar meaning (weaker:
  words indicating the same topic).
– No associations between words are made in
  the vector space representation.
– Problem of Synonyme may be solved with
  LSI


     Hadi Mohammadzadeh   Text Mining   Pages       21
.




          Polysemy and Context
• Document similarity on single word level:
  polysemy and context                      ring
                                                               jupiter
                                                                 •••
                                                                space
                                                   meaning 1   voyager
      …                            …
    planet                     saturn
      ...                         ...
                                                   meaning 2     car
                                                               company
                                                                 •••
       contribution to similarity, if                           dodge
       used in 1st meaning, but not                              ford
       if in 2nd

        Hadi Mohammadzadeh   Text Mining   Pages                         22
.




            Latent Semantic Indexing
                                Introduction

• Problem: The first frequency-based indexing method
  did not utilize any global relationships within the
  docs collection

• Solution: LSI is an indexing method based on the
  Singular Value Decomposition (SVD)

• How: SVD transform the word document matrix such
  that major intrinsic associative pattern in the
  collection are revealed



        Hadi Mohammadzadeh   Text Mining   Pages        23
.




            Latent Semantic Indexing
                                Introduction

• Main Adv: it does not depend on individual words to
  locate documents, but rather uses the concept or topic
  to find relevant docs

• Using: When a researcher submit a query, it is
  transformed to LSI space and compared with other
  docs in the same space




        Hadi Mohammadzadeh   Text Mining   Pages           24
.




      Singular Value Decomposition
For an M × N matrix A of rank r there exists a factorization
(Singular Value Decomposition = SVD) as follows:
                                   A = UΣV            T


                                 M×M          M×N         V is N×N

The columns of U are orthogonal eigenvectors of AAT.
The columns of V are orthogonal eigenvectors of ATA.
 Eigenvalues λ1 … λr of AAT are the eigenvalues of ATA.
                                σ i = λi
                   Σ = diag ( σ 1...σ r )                          Singular values.
           Hadi Mohammadzadeh   Text Mining   Pages                            25
.




   Singular Value Decomposition
• Illustration of SVD dimensions and sparseness




        Hadi Mohammadzadeh   Text Mining   Pages       26
.




       Low-rank Approximation
• Solution via SVD
       Ak = U diag(σ 1 ,..., σ k ,0,...,0)V T
                                          set smallest r-k
                                          singular values to zero




             k



                 Ak = ∑i =1σ i ui viT
                              k
                                                          column notation: sum
                                                          of rank 1 matrices
       Hadi Mohammadzadeh   Text Mining   Pages                                  27
.




                     Reduced SVD
• If we retain only k singular values, and set the rest to 0, then
  we don’t need the matrix parts in red
• Then Σ is k×k, U is M×k, VT is k×N, and Ak is M×N
• This is referred to as the reduced SVD
• It is the convenient (space-saving) and usual form for
  computational applications
• It’s what Matlab gives you




               k

        Hadi Mohammadzadeh   Text Mining   Pages                     28
.




                 Approximation error
• How good (bad) is this approximation?
• It’s the best possible, measured by the Frobenius
  norm of the error:

       min               A− X      F
                                       = A − Ak       F
                                                          = σ k +1
     X :rank ( X ) = k


where the σi are ordered such that σi ≥ σi+1.
Suggests why Frobenius error drops as k increased.


           Hadi Mohammadzadeh   Text Mining   Pages                  29
.




      SVD Low-rank approximation
• Whereas the term-doc matrix A may have M=50000,
  N=10 million (and rank close to 50000)
• We can construct an approximation A100 with rank 100.
   – Of all rank 100 matrices, it would have the lowest Frobenius
     error.


• Great … but why would we??
• Answer: Latent Semantic Indexing



           Hadi Mohammadzadeh   Text Mining   Pages                 30
.




  Latent Semantic Indexing (LSI)
• Perform a low-rank approximation of document-
  term matrix (typical rank 100-300)
• General idea
  – Map documents (and terms) to a low-dimensional
    representation.
  – Design a mapping such that the low-dimensional space
    reflects semantic associations (latent semantic space).
  – Compute document similarity based on the inner product
    in this latent semantic space



        Hadi Mohammadzadeh   Text Mining   Pages              31
.




                      Goals of LSI

• Similar terms map to similar location in
  low dimensional space
• Noise reduction by dimension reduction




       Hadi Mohammadzadeh   Text Mining   Pages       32
.




      Latent Semantic Analysis
• Latent semantic space: illustrating example




                                                  courtesy of Susan Dumais
       Hadi Mohammadzadeh   Text Mining   Pages                         33
.




            Performing the maps
• Each row and column of A gets mapped into the
  k-dimensional LSI space, by the SVD.
• Claim – this is not only the mapping with the best
  (Frobenius error) approximation to A, but in fact
  improves retrieval.
• A query q is also mapped into this space, by

                    qk = q T U k Σ −1
                                   k

  – Query NOT a sparse vector.

        Hadi Mohammadzadeh   Text Mining   Pages       34
.




      But why is this clustering?
• We’ve talked about docs, queries, retrieval and
  precision here.
• What does this have to do with clustering?
• Intuition: Dimension reduction through LSI
  brings together “related” axes in the vector
  space.




       Hadi Mohammadzadeh   Text Mining   Pages       35
.




    Intuition from block matrices

                                 N documents

         Block 1                     What’s the rank of this matrix?

                                                       0’s
                             Block 2
M
terms
                                                   …
                  0’s
                                                             Block k

                               = Homogeneous non-zero blocks.
        Hadi Mohammadzadeh   Text Mining   Pages                       36
.




    Intuition from block matrices
                                    N documents

             Block 1


                                                          0’s
                                Block 2
M
terms
                                                      …
                     0’s
                                                                Block k
        Vocabulary partitioned into k topics (clusters); each doc
        discusses only one topic.
           Hadi Mohammadzadeh   Text Mining   Pages                       37
.




    Intuition from block matrices
                                 N documents

                                         What’s the best rank-k
         Block 1
                                      approximation to this matrix?

                                                       0’s
                             Block 2
M
terms
                                                   …
                  0’s
                                                             Block k

                               = non-zero entries.
        Hadi Mohammadzadeh   Text Mining   Pages                       38
.




         Intuition from block matrices
                        Likely there’s a good rank-k
                       approximation to this matrix.
 wiper
 tire        Block 1
 V6

                                                  Few nonzero entries
                                 Block 2


                                                       …
                 Few nonzero entries
                                                                Block k
car        10
automobile 0 1
            Hadi Mohammadzadeh   Text Mining   Pages                      39
.




             Simplistic picture
Topic 1


                                           Topic 2




                                   Topic 3
Hadi Mohammadzadeh   Text Mining   Pages                 40
.




             Some wild extrapolation

• The “dimensionality” of a corpus is the number
  of distinct topics represented in it.
• More mathematical wild extrapolation:
  – if A has a rank k approximation of low Frobenius
    error, then there are no more than k distinct topics
    in the corpus.




       Hadi Mohammadzadeh   Text Mining   Pages            41
.




         LSI has many other applications
• In many settings in pattern recognition and retrieval,
  we have a feature-object matrix.
   –   For text, the terms are features and the docs are objects.
   –   Could be opinions and users …
   –   This matrix may be redundant in dimensionality.
   –   Can work with low-rank approximation.
   –   If entries are missing (e.g., users’ opinions), can recover if
       dimensionality is low.
• Powerful general analytical technique
   – Close, principled analog to clustering methods.

           Hadi Mohammadzadeh   Text Mining   Pages                     42
.




               Seminar on Text Mining
                     Part Three




Part of Speech(POS) Tagging




 Hadi Mohammadzadeh   Text Mining   Pages       43
.




                   Definition of POS
“The process of assigning a part-of-speech or other
  lexical class marker to each word in a corpus”
  (Jurafsky and Martin)

                WORDS
                                                    TAGS
                    the
                    girl
                    kissed                          N
                    the                             V
                    boy                             P
                    on                              DET
                    the
                    cheek




         Hadi Mohammadzadeh   Text Mining   Pages              44
.




                 An Example

    WORD                      LEMMA          TAG


      the                          the       +DET
      girl                         girl      +NOUN
      kissed                       kiss      +VPAST
      the                          the       +DET
      boy                          boy       +NOUN
      on                           on        +PREP
      the                          the       +DET
      cheek                        cheek     +NOUN


Hadi Mohammadzadeh   Text Mining     Pages            45
.




                Motivation of POS
• Speech synthesis — pronunciation
• Speech recognition — class-based N-grams
• Information retrieval — stemming, selection high-
  content words
• Word-sense disambiguation
• Corpus analysis of language & lexicography




        Hadi Mohammadzadeh   Text Mining   Pages       46
.




                        Word Classes
Basic word classes:
      Noun, Verb, Adjective, Adverb, Preposition, …

Open vs. Closed classes
 Open:
      Nouns, Verbs, Adjectives, Adverbs
  Closed:
      determiners: a, an, the
      pronouns: she, he, I
      prepositions: on, under, over, near, by, …



         Hadi Mohammadzadeh   Text Mining   Pages       47
.




            Word Classes: Tag Sets
• Vary in number of tags: a dozen to over 200
• Size of tag sets depends on language, objectives and
  purpose
   – Some tagging approaches (e.g., constraint grammar based)
     make fewer distinctions e.g., conflating prepositions,
     conjunctions, particles
   – Simple morphology = more ambiguity = fewer tags




         Hadi Mohammadzadeh   Text Mining   Pages               48
.




Word Classes: Tag set example




 Hadi Mohammadzadeh   Text Mining   Pages       49
.




                        The Problem
• Words often have more than one word class:
  this
  – This is a nice day = PRP
  – This day is nice = DT(determiner)
  – You can go this far = RB(adverb)




        Hadi Mohammadzadeh   Text Mining   Pages       50
.




                   Word Class Ambiguity
                         (in the Brown Corpus)

• Unambiguous (1 tag): 35,340
• Ambiguous (2-7 tags): 4,100


                              2 tags           3,760
                              3 tags             264
                              4 tags              61
                              5 tags              12
                              6 tags               2
                              7 tags               1   (Derose, 1988)


         Hadi Mohammadzadeh   Text Mining   Pages                       51
.




                POS Tagging Methods
• Stochastic Tagger: HMM-based(Using Viterbi Algorithm)
• Rule-Based Tagger: ENGTWOL (ENGlish TWO Level analysis)
• Transformation-Based Tagger (Brill)




         Hadi Mohammadzadeh   Text Mining   Pages           52
.




                      Stochastic Tagging
• Based on probability of certain tag occurring given various
  possibilities
• Requires a training corpus
• No probabilities for words not in corpus.
• Simple Method: Choose most frequent tag in training text for
  each word!
   –   Result: 90% accuracy
   –   Baseline
   –   Others will do better
   –   HMM is an example




           Hadi Mohammadzadeh   Text Mining   Pages              53
.




                         HMM Tagger
• Intuition: Pick the most likely tag for this word.
• HMM Taggers choose tag sequence that maximizes this
  formula:
   – P(word|tag) × P(tag|previous n tags)
• Let T = t1,t2,…,tn
  Let W = w1,w2,…,wn
• Find POS tags that generate a sequence of words, i.e., look for
  most probable sequence of tags T underlying the observed
  words W.




          Hadi Mohammadzadeh   Text Mining   Pages                  54
.




                    Rule-Based Tagging
• Basic Idea:
   – Assign all possible tags to words
   – Remove tags according to set of rules of type:

              if word+1 is an adj, adv, or quantifier and the following is
            a sentence boundary and word-1 is not a verb like “consider”
                      then eliminate non-adv else eliminate adv.



   – Typically more than 1000 hand-written rules, but may be machine-
     learned




          Hadi Mohammadzadeh   Text Mining   Pages                           55
.




               Stage 1 of ENGTWOL Tagging
First Stage:
   – Run words through Kimmo-style morphological analyzer to get all
     parts of speech.

Example: Pavlov had shown that salivation …

  Pavlov         PAVLOV N NOM SG PROPER
  had            HAVE V PAST VFIN SVO
                 HAVE PCP2 SVO
  shown          SHOW PCP2 SVOO SVO SV
  that           ADV
                 PRON DEM SG
                 DET CENTRAL DEM SG
                 CS
  salivation     N NOM SG



           Hadi Mohammadzadeh   Text Mining   Pages                    56
.




          Stage 2 of ENGTWOL Tagging

• Second Stage:
   – Apply constraints.
• Constraints used in negative way.
• Example: Adverbial “that” rule
  Given input: “that”
  If
      (+1 A/ADV/QUANT)
      (+2 SENT-LIM)
      (NOT -1 SVOC/A)
  Then eliminate non-ADV tags
  Else eliminate ADV


         Hadi Mohammadzadeh   Text Mining   Pages       57
.




            Transformation-Based Tagging
                                   (Brill Tagging)

• Combination of Rule-based and stochastic tagging
  methodologies
   – Like rule-based because rules are used to specify tags in a certain
     environment
   – Like stochastic approach because machine learning is used—with
     tagged corpus as input
• Input:
   – tagged corpus
   – dictionary (with most frequent tags)
         + Usually constructed from the tagged corpus




           Hadi Mohammadzadeh   Text Mining   Pages                        58
.




             Transformation-Based Tagging
                                           (cont.)
• Basic Idea:
   – Set the most probable tag for each word as a start value
   – Change tags according to rules of type “if word-1 is a determiner and word is a
     verb then change the tag to noun” in a specific order


• Training is done on tagged corpus:
   –   Write a set of rule templates
   –   Among the set of rules, find one with highest score
   –   Continue from 2 until lowest score threshold is passed
   –   Keep the ordered set of rules


• Rules make errors that are corrected by later rules


              Hadi Mohammadzadeh   Text Mining   Pages                                 59
.




                    TBL Rule Application
• Tagger labels every word with its most-likely tag
   – For example: race has the following probabilities in the
     Brown corpus:
      • P(NN|race) = .98
      • P(VB|race)= .02
• Transformation rules make changes to tags
   – “Change NN to VB when previous tag is TO”
     … is/VBZ expected/VBN to/TO race/NN tomorrow/NN
     becomes
     … is/VBZ expected/VBN to/TO race/VB tomorrow/NN



          Hadi Mohammadzadeh   Text Mining   Pages              60
.




                   TBL: Rule Learning
• 2 parts to a rule
   – Triggering environment
   – Rewrite rule
• The range of triggering environments of templates (from
  Manning & Schutze 1999:363)



   Schema ti-3         ti-2        ti-1       ti      ti+1   ti+2   ti+3
   1                                          *
   2                                          *
   3                                          *
   4                                          *
   5                                          *
   6                                          *
   7                                          *
   8                                          *
   9                                          *
           Hadi Mohammadzadeh   Text Mining   Pages                        61
.




                   TBL: The Algorithm
• Step 1: Label every word with most likely tag (from
  dictionary)
• Step 2: Check every possible transformation & select one
  which most improves tagging
• Step 3: Re-tag corpus applying the rules
• Repeat 2-3 until some criterion is reached, e.g., X% correct
  with respect to training corpus
• RESULT: Sequence of transformation rules




          Hadi Mohammadzadeh   Text Mining   Pages               62
.




               TBL: Rule Learning (cont’d)
• Problem: Could apply transformations ad infinitum!
• Constrain the set of transformations with “templates”:
   – Replace tag X with tag Y, provided tag Z or word Z’ appears in some
     position
• Rules are learned in ordered sequence
• Rules may interact.
• Rules are compact and can be inspected by humans




           Hadi Mohammadzadeh   Text Mining   Pages                        63
.




                       TBL: Problems
• Execution Speed: TBL tagger is slower than HMM
  approach
   – Solution: compile the rules to a Finite State Transducer (FST)
• Learning Speed: Brill’s implementation over a day (600k
  tokens)




          Hadi Mohammadzadeh   Text Mining   Pages                64
.




             Tagging Unknown Words
• New words added to (newspaper) language 20+ per
  month
• Plus many proper names …
• Increases error rates by 1-2%

• Method 1: assume they are nouns
• Method 2: assume the unknown words have a
  probability distribution similar to words only occurring
  once in the training set.
• Method 3: Use morphological information, e.g., words
  ending with –ed tend to be tagged VBN.

         Hadi Mohammadzadeh   Text Mining   Pages            65
.




                                   Evaluation
• The result is compared with a manually coded “Gold
  Standard”
   – Typically accuracy reaches 96-97%
   – This may be compared with result for a baseline tagger (one that uses
     no context).
• Important: 100% is impossible even for human annotators.

• Factors that affects the performance
   –   The amount of training data available
   –   The tag set
   –   The difference between training corpus and test corpus
   –   Dictionary
   –   Unknown words
              Hadi Mohammadzadeh   Text Mining   Pages                       66
.




              Seminar on Text Mining
                    Part Four




Information Extraction (IE)




 Hadi Mohammadzadeh   Text Mining   Pages       67
.




                                Definition
• An Information Extraction system generally converts
  unstructured text into a form that can be loaded into a
  database.




          Hadi Mohammadzadeh   Text Mining   Pages          68
.




   Information Retrieval vs. Information Extraction

• While
  information retrieval deals with the problem of
  finding relevant document in a collection,
  information extraction identifies useful (relevant) text
  in a document.

 Useful information is defined as a text segment and its
 associated attributes.



         Hadi Mohammadzadeh   Text Mining   Pages            69
.




                             An Example
• Query:
  – List the news reports of car bombings in Basra and
    surrounding areas between June and December 2004.
  Answering to this query is difficult with an information-
    retrieval system alone.
  To answer such queries, we need additional semantic
    information to identify text segments that refer to an
    attribute




        Hadi Mohammadzadeh   Text Mining   Pages              70
.




             Elements Extracted from Text
• There are four basic types of elements that can be
  extracted from text
   – Entities: The basic building blocks that can be found in text documents.
     e.g. people, companies, locations, drugs

   – Attributes: features of the extracted entities.
     e.g. title of a person, age of person, type of an organization

   – Facts: The relations that exist between entities.
     e.g. relationship between a person and a company

   – Events: an activity or occurrence of interest in which entities participate.
     e.g. terrorist act, a merger between two companies



             Hadi Mohammadzadeh   Text Mining   Pages                               71
.




                           IE Applications

•   E-Recruitment
•   Extracting sales information
•   Intelligence collection for news articles
•   Message Understanding (MU)




          Hadi Mohammadzadeh   Text Mining   Pages       72
.




         Named Entity Recognition (NER)
• NER can be viewed as a classification problem in which
  words are assigned to one or more semantic classes.
• The same methods we used to assign POS tags words can be
  applied here.
• Unlike POS tags, not every word is associated with a semantic
  class.
• Like POS taggers, we can train an entity extractor to find
  entities in text using a tagged data set.
• Decision Trees, HMM, and rule-based methods can be applied
  to the classification task.



         Hadi Mohammadzadeh   Text Mining   Pages                 73
.




                     Problems of NER
• Unknown words: it is difficult to categorize
• Finding the exact boundary of an entity
• Polysemy and synonymy- methods used for WSD are
  applicable here.




        Hadi Mohammadzadeh   Text Mining   Pages       74
.




                  Architecture of an IE System
•          Extraction of tokens and tags
•          Semantic analysis : A partial parser is usually sufficient
•          Extractor : we look at domain-specific entities, weather DB
•          Merging multiple references to the same entity: finding a
           single canonical form
•          Template Generation: A template contains a list of slots (fields)

                   Tokenization          Tokens           Sentence          POS
    Text
                   and tagging          POS tags          Analysis          groups



                    Combined                         Assigned
    Template                                                    Extractor
                                      Merging
    Generation        Entities                       Entities
                 Hadi Mohammadzadeh    Text Mining     Pages                         75
.




                                    IE tools
• Fastus
  – Finite State Automation Text Understanding System
• Rapier
  – Robust Automated Production of Information Extraction Rules




           Hadi Mohammadzadeh   Text Mining   Pages           76
.




                                       Fastus
• It is based on a series of finite-state machines to solve specific
  problems for each stage of the IE pipeline.

• A Finite-State Machine (FSM) generate a regular language that
  consists of regular expression to describe the language.

• A regular expression (regex) actually represents a string pattern.

• Regexs are used in IE to identify text segments that match some
  predefined pattern.
• An FSM applies a pattern to a window of text and transition from
  one state to another until a pattern matches or fails to match.
            Hadi Mohammadzadeh   Text Mining   Pages                   77
.




                           Stages of Fastus
• In the first stage, composite words and proper nouns
  are extracted. e.g. “set up” ,”carry out”




 Text        Stage 1   Complex       Stage 2          Basic        Stage 3
                        Words                        Phrases




         Merged         Stage 5           Event          Stage 4        Complex
        Structures                      Structures                      Phrases



             Hadi Mohammadzadeh   Text Mining   Pages                             78
.




             Seminar on Text Mining
                   Part Five




   Clustering Documents




Hadi Mohammadzadeh   Text Mining   Pages       79
.




                       What is clustering?

• Clustering: the process of grouping a set of objects into classes
  of similar objects
   – Documents within a cluster should be similar.
   – Documents from different clusters should be dissimilar.


• The commonest form of unsupervised learning
   – Unsupervised learning = learning from raw data, as opposed to
     supervised data where a classification of examples is given
   – A common and important task that finds many applications in IR and
     other places




            Hadi Mohammadzadeh   Text Mining   Pages                      80
.




         Applications of clustering in IR
• Whole corpus analysis/navigation(Scatter-gather)
   – Better user interface: search without typing
• For improving recall in search applications
   – Better search results
• For better navigation of search results
   – Effective “user recall” will be higher
• For speeding up vector space retrieval
   – Cluster-based retrieval gives faster search




          Hadi Mohammadzadeh   Text Mining   Pages       81
.




Google News: automatic clustering gives an effective
          news presentation metaphor




       Hadi Mohammadzadeh   Text Mining   Pages        82
.




1. Scatter/Gather: Cutting, Karger, and Pedersen




   Hadi Mohammadzadeh   Text Mining   Pages        83
.




         2. For improving search recall
• Cluster hypothesis - Documents in the same cluster behave
  similarly with respect to relevance to information needs
• Therefore, to improve search recall:
   – Cluster docs in corpus a priori
   – When a query matches a doc D, also return other docs in the
     cluster containing D


• Hope if we do this: The query “car” will also return docs
  containing automobile
   – Because clustering grouped together docs containing car with
     those containing automobile.



         Hadi Mohammadzadeh   Text Mining   Pages                   84
.




     3. For better navigation of search results

• For grouping search results thematically




         Hadi Mohammadzadeh   Text Mining   Pages       85
.




         What makes docs “related”?

• Ideal: semantic similarity.
• Practical: statistical similarity
   – We will use cosine similarity.
   – Docs as vectors.
   – For many algorithms, easier to think in terms of a
     distance (rather than similarity) between docs.
   – We will use Euclidean distance.




       Hadi Mohammadzadeh   Text Mining   Pages           86
.




                   Clustering Algorithms

• Flat algorithms
   – Usually start with a random (partial) partitioning
   – Refine it iteratively
      • K means clustering
      • (Model based clustering)
• Hierarchical algorithms
   – Bottom-up, agglomerative
   – (Top-down, divisive)




           Hadi Mohammadzadeh   Text Mining   Pages       87
.




                 Hard vs. soft clustering
• Hard clustering: Each document belongs to exactly one cluster
   – More common and easier to do
• Soft clustering: A document can belong to more than one
  cluster.
   – Makes more sense for applications like creating browsable
     hierarchies
   – You may want to put a pair of sneakers in two clusters: (i) sports
     apparel and (ii) shoes
   – You can only do that with a soft clustering approach.




          Hadi Mohammadzadeh   Text Mining   Pages                        88
.




                  Partitioning Algorithms

• Partitioning method: Construct a partition of n documents into
  a set of K clusters

• Given: a set of documents and the number K
• Find: a partition of K clusters that optimizes the chosen
  partitioning criterion
   – Globally optimal: exhaustively enumerate all partitions
   – Effective heuristic methods: K-means and K-medoids algorithms




            Hadi Mohammadzadeh   Text Mining   Pages                 89
.




                               K-Means

• Assumes documents are real-valued vectors.
• Clusters based on centroids (aka the center of gravity or
  mean) of points in a cluster, c:
                              1       
                      μ(c) =       ∑x
                             | c | x∈c
                                   


• Reassignment of instances to clusters is based on distance
  to the current cluster centroids.




        Hadi Mohammadzadeh   Text Mining   Pages               90
.




                   K-Means Algorithm
Select K random docs {s1, s2,… sK} as seeds.

Until clustering converges or other stopping criterion:

   For each doc di:
     Assign di to the cluster cj such that dist(xi, sj) is minimal.
   (Update the seeds to the centroid of each cluster)
   For each cluster cj
       sj = µ(cj)


         Hadi Mohammadzadeh   Text Mining   Pages                     91
.




               Termination conditions
• Several possibilities, e.g.,
   – A fixed number of iterations.
   – Doc partition unchanged.
   – Centroid positions don’t change.




        Hadi Mohammadzadeh   Text Mining   Pages       92
.




                                Seed Choice
• Results can vary based on random seed
  selection.                                            Example showing
• Some seeds can result in poor                         sensitivity to seeds
  convergence rate, or convergence to
  sub-optimal clusterings.
   – Select good seeds using a heuristic (e.g.,
     doc least similar to any existing mean)
   – Try out multiple starting points                 In the above, if you start
   – Initialize with the results of another           with B and E as centroids
     method.                                          you converge to {A,B,C}
                                                      and {D,E,F}
                                                      If you start with D and F
                                                      you converge to
                                                      {A,B,D,E} {C,F}



           Hadi Mohammadzadeh   Text Mining   Pages                            93
.




                   How Many Clusters?
• Number of clusters K is given
   – Partition n docs into predetermined number of clusters
• Finding the “right” number of clusters is part of the
  problem
   – Given docs, partition into an “appropriate” number of subsets.
   – E.g., for query results - ideal value of K not known up front -
     though UI may impose limits.
• Can usually take an algorithm for one flavor and convert to
  the other.




          Hadi Mohammadzadeh   Text Mining   Pages                     94
.




           K not specified in advance
• Given a clustering, define the Benefit for a doc to be
  the cosine similarity to its centroid

• Define the Total Benefit to be the sum of the
  individual doc Benefits.




        Hadi Mohammadzadeh   Text Mining   Pages           95
.




                 Penalize lots of clusters
• For each cluster, we have a Cost C.
• Thus for a clustering with K clusters, the Total Cost is KC.
• Define the Value of a clustering to be =

   Total Benefit - Total Cost.


• Find the clustering of highest value, over all choices of K.
   – Total benefit increases with increasing K. But can stop when it
     doesn’t increase by “much”. The Cost term enforces this.




           Hadi Mohammadzadeh   Text Mining   Pages                    96
.




                Hierarchical Clustering
• Build a tree-based hierarchical taxonomy (dendrogram)
  from a set of documents.


                                            animal

                    vertebrate                        invertebrate

         fish reptile amphib. mammal                worm insect crustacean



• One approach: recursive application of a partitional
  clustering algorithm.

         Hadi Mohammadzadeh   Text Mining   Pages                            97
.




    Dendogram: Hierarchical Clustering

• Clustering obtained by cutting
  the dendrogram at a desired
  level: each connected
  component forms a cluster.




         Hadi Mohammadzadeh   Text Mining   Pages       98
.




 Hierarchical Agglomerative Clustering (HAC)
• Starts with each doc in a separate cluster
   – then repeatedly joins the closest pair of clusters, until
     there is only one cluster.
• The history of merging forms a binary tree or
  hierarchy.




         Hadi Mohammadzadeh   Text Mining   Pages                99
.




               Closest pair of clusters
Many variants to defining closest pair of clusters
• Single-link
   – Similarity of the most cosine-similar (single-link)
• Complete-link
   – Similarity of the “furthest” points, the least cosine-similar
• Centroid
   – Clusters whose centroids (centers of gravity) are the most cosine-
     similar
• Average-link
   – Average cosine between pairs of elements




          Hadi Mohammadzadeh   Text Mining   Pages                        100
.




     Closest pair of clusters




Hadi Mohammadzadeh   Text Mining   Pages       101
.




   Single Link Agglomerative Clustering
• Use maximum similarity of pairs:
          sim(ci ,c j ) = max sim( x, y )
                                       x∈ci , y∈c j

• Can result in “straggly” (long and thin) clusters
  due to chaining effect.
• After merging ci and cj, the similarity of the
  resulting cluster to another cluster, ck, is:

sim((ci ∪ c j ), ck ) = max(sim(ci , ck ), sim(c j , ck ))


         Hadi Mohammadzadeh   Text Mining   Pages            102
.




         Single Link Example




Hadi Mohammadzadeh   Text Mining   Pages       103
.




     Complete Link Agglomerative Clustering
 • Use minimum similarity of pairs:
        sim(ci ,c j ) = min sim( x, y )
                                      x∈ i , y∈ j
                                        c      c

 • Makes “tighter,” spherical clusters that are typically
   preferable.
 • After merging ci and cj, the similarity of the resulting
   cluster to another cluster, ck, is:
sim((ci ∪c j ), ck ) = min( sim(ci , ck ), sim(c j , ck ))

              Ci                              Cj               Ck

           Hadi Mohammadzadeh   Text Mining        Pages            104
.




      Complete Link Example




Hadi Mohammadzadeh   Text Mining   Pages       105
.




  Group Average Agglomerative Clustering
 • Similarity of two clusters = average similarity of all
   pairs within merged cluster.
                            1                                          
sim(ci , c j ) =
                                          
                                             ∑c ) y∈(c ∑)sim( x, y)
                 ci ∪ c j ( ci ∪ c j − 1) x∈( ci ∪ j             
                                                         ∪c j : y ≠ x
                                                       i


 • Compromise between single and complete link.
 • Two options:
     – Averaged across all ordered pairs in the merged cluster
     – Averaged over all pairs between the two original clusters
 • No clear difference in efficacy



            Hadi Mohammadzadeh   Text Mining   Pages                    106
.




   Computing Group Average Similarity
• Always maintain sum of vectors in each cluster.

                                             
                           s (c j ) =        ∑x
                                             
                                             x∈c j


• Compute similarity of clusters in constant time:
                                                      
                     ( s (ci ) + s (c j )) • ( s (ci ) + s (c j )) − (| ci | + | c j |)
  sim(ci , c j ) =
                                  (| ci | + | c j |)(| ci | + | c j | −1)



          Hadi Mohammadzadeh   Text Mining      Pages                                     107
.




             Seminar on Text Mining
                    Part Six




Text Categorization(TC)




Hadi Mohammadzadeh   Text Mining   Pages       108
.




                      Approaches to TC
There are two main approaches to TC:
• Knowledge Engineering
   – The main drawback of the KEA is what might be called the
     Knowledge acquisition bottleneck. The huge amount of highly
     skilled labor and expert knowledge required to create and maintain
     the knowledge-encoding rules


• Machine Learning
   – Requires only a set of manually classified training instances that
     muchless costly to produce.




          Hadi Mohammadzadeh   Text Mining   Pages                        109
.




                  Applocations of TC
Three common TC appications are:
• Text Indexing
• Document sorting and text filtering
• Web page categorization




       Hadi Mohammadzadeh   Text Mining   Pages       110
.




                     Text Indexing(TI)
• The task of assigning keywords from a controlled
  vocabulary to text documents is called TI. If the keywords
  are viewed as categories, then TI is an instance of general
  TC problem.




         Hadi Mohammadzadeh   Text Mining   Pages               111
.




     Document sorting and text filtering
• Examples:
   – In a newspaper, the classified ads may need to be categorized
     into “Personal”, “Car Sales”, “Real State”
   – Emails can be sorted intocategories such as “Complaints”,
     “Deals”, “Job applications”
• Text Filtering activity can be seen as document sorting
  with only two bins- the “relevant” and “irrelevant” docs.




         Hadi Mohammadzadeh   Text Mining   Pages                    112
.




              Web page categorization
• A common use of TC is the automatic classification of
  Web pages under the hierarchical calalogues posted by
  popular Internet portals such as Yahoo.

• Whenever the number of docs in a category exceeds k, it
  should be spilt into two or more subcategories.

• The Web docs contain links, which may be important
  source of information for classifier because linked docs
  often share semantics.




         Hadi Mohammadzadeh   Text Mining   Pages            113
.




             Definition of the Problem
• The General text categorization task can be formally
  defined as the task of approximating an unknown category
  assignment function
                        F : D × C → { 0,1}
• Where D is the set of all possible docs and C is the set of
  predefined categories.
• The value of F ( d , c ) is 1 if the document d belongs to
  the category c and 0 otherwise.
• The approximation function M : D ×C →{0,1} is called a
  classifer, and the task is to build a classifer that produces
  results as “close” as possible to the true category
  assignment function F.
         Hadi Mohammadzadeh   Text Mining   Pages                 114
.




                 Types of Categorization
• Single-Label versus Multilabel Categorization
   – In multilabel categorization the categories overlap, and a document
     may belongs to any number of categories.
• Document-Pivoted versus Category-Pivoted Categorization
   – The difference is significant only in the case in which not all docs or
     not all categories are immediately available.
• Hard Versus Soft Categorization
   – Fully automated , and semiautomated




           Hadi Mohammadzadeh   Text Mining   Pages                            115
.




      Machine Learning Approache to TC
•   Decition Tree Classifiers
•   Naïve Bayes(Probablistic classifer)
•   K-Nearest Neighbor classifiaction
•   Rocchio Methods
•   Decition Rule classifer
•   Neural Networks
•   Support Vector Machine




          Hadi Mohammadzadeh   Text Mining   Pages       116
.




                               References
• Books
  –   Introduction to Information Retrieval-2008
  –   Managing Gigabytes-1999
  –   The Text Mining Handbook
  –   Text Mining Application Programming
  –   Web Data Mining




          Hadi Mohammadzadeh   Text Mining   Pages       117
.




                                               References

• Power Points
  –   Introduction to Information Retrieval-2008
  –   Text Mining Application Programming
  –   Web Data Mining
  –   Word classes and part of speech tagging
       •   Rada Mihalcea Note: Some of the material in this slide set was adapted from Chris Brew’s (OSU) slides on part of speech tagging




             Hadi Mohammadzadeh                   Text Mining              Pages                                                             118

Weitere ähnliche Inhalte

Was ist angesagt?

Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  	Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  
sstose
 
Ontology and Ontology Libraries: a critical study
Ontology and Ontology Libraries: a critical studyOntology and Ontology Libraries: a critical study
Ontology and Ontology Libraries: a critical study
Debashisnaskar
 
Techniques of information retrieval
Techniques of information retrieval Techniques of information retrieval
Techniques of information retrieval
Tariq Hassan
 
Introduction to Dublin Core Metadata
Introduction to Dublin Core MetadataIntroduction to Dublin Core Metadata
Introduction to Dublin Core Metadata
Hannes Ebner
 
Adaptive information extraction
Adaptive information extractionAdaptive information extraction
Adaptive information extraction
unyil96
 

Was ist angesagt? (20)

Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
 
Text mining introduction-1
Text mining   introduction-1Text mining   introduction-1
Text mining introduction-1
 
Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  	Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Dimensions of Media Object Comprehensibility
Dimensions of Media Object ComprehensibilityDimensions of Media Object Comprehensibility
Dimensions of Media Object Comprehensibility
 
Ontology and Ontology Libraries: a critical study
Ontology and Ontology Libraries: a critical studyOntology and Ontology Libraries: a critical study
Ontology and Ontology Libraries: a critical study
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectors
 
ISO 25964: Thesauri and Interoperability with Other Vocabularies
ISO 25964: Thesauri and Interoperability with Other VocabulariesISO 25964: Thesauri and Interoperability with Other Vocabularies
ISO 25964: Thesauri and Interoperability with Other Vocabularies
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic Matching
 
The Semantic Web #7 - RDF Semantics
The Semantic Web #7 - RDF SemanticsThe Semantic Web #7 - RDF Semantics
The Semantic Web #7 - RDF Semantics
 
The Semantic Web #8 - Ontology
The Semantic Web #8 - OntologyThe Semantic Web #8 - Ontology
The Semantic Web #8 - Ontology
 
Techniques of information retrieval
Techniques of information retrieval Techniques of information retrieval
Techniques of information retrieval
 
Text Mining
Text MiningText Mining
Text Mining
 
Automated Abstracts and Big Data
Automated Abstracts and Big DataAutomated Abstracts and Big Data
Automated Abstracts and Big Data
 
Analysing Demonetisation through Text Mining using Live Twitter Data!
Analysing Demonetisation through Text Mining using Live Twitter Data!Analysing Demonetisation through Text Mining using Live Twitter Data!
Analysing Demonetisation through Text Mining using Live Twitter Data!
 
The Role of Thesauri in Data Modeling
The Role of Thesauri in Data ModelingThe Role of Thesauri in Data Modeling
The Role of Thesauri in Data Modeling
 
Introduction to Dublin Core Metadata
Introduction to Dublin Core MetadataIntroduction to Dublin Core Metadata
Introduction to Dublin Core Metadata
 
Customer-Focused Thesauri
Customer-Focused ThesauriCustomer-Focused Thesauri
Customer-Focused Thesauri
 
Adaptive information extraction
Adaptive information extractionAdaptive information extraction
Adaptive information extraction
 
A Mathematical Approach to Ontology Authoring and Documentation
A Mathematical Approach to Ontology Authoring and DocumentationA Mathematical Approach to Ontology Authoring and Documentation
A Mathematical Approach to Ontology Authoring and Documentation
 

Andere mochten auch

Text Analytics on 2 Million Documents: A Case Study
Text Analytics on 2 Million Documents: A Case StudyText Analytics on 2 Million Documents: A Case Study
Text Analytics on 2 Million Documents: A Case Study
Alyona Medelyan
 
European Transport Networks
European Transport NetworksEuropean Transport Networks
European Transport Networks
caglarozpinar
 

Andere mochten auch (20)

Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Text mining
Text miningText mining
Text mining
 
Quick Tour of Text Mining
Quick Tour of Text MiningQuick Tour of Text Mining
Quick Tour of Text Mining
 
Machine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web DataMachine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web Data
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
Low-rank matrix approximations in Python by Christian Thurau PyData 2014
Low-rank matrix approximations in Python by Christian Thurau PyData 2014Low-rank matrix approximations in Python by Christian Thurau PyData 2014
Low-rank matrix approximations in Python by Christian Thurau PyData 2014
 
Text mining
Text miningText mining
Text mining
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Quick tour all handout
Quick tour all handoutQuick tour all handout
Quick tour all handout
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
Data mining
Data miningData mining
Data mining
 
Hierarchical matrix techniques for maximum likelihood covariance estimation
Hierarchical matrix techniques for maximum likelihood covariance estimationHierarchical matrix techniques for maximum likelihood covariance estimation
Hierarchical matrix techniques for maximum likelihood covariance estimation
 
How to Audit PPC (Adwords) Accounts - Complete Checklist
How to Audit PPC (Adwords) Accounts - Complete ChecklistHow to Audit PPC (Adwords) Accounts - Complete Checklist
How to Audit PPC (Adwords) Accounts - Complete Checklist
 
Cluster Analysis - Keyword Clustering
Cluster Analysis -  Keyword ClusteringCluster Analysis -  Keyword Clustering
Cluster Analysis - Keyword Clustering
 
Text Analytics on 2 Million Documents: A Case Study
Text Analytics on 2 Million Documents: A Case StudyText Analytics on 2 Million Documents: A Case Study
Text Analytics on 2 Million Documents: A Case Study
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extraction
 
European Transport Networks
European Transport NetworksEuropean Transport Networks
European Transport Networks
 

Ähnlich wie Text mining, By Hadi Mohammadzadeh

Content extraction: By Hadi Mohammadzadeh
Content extraction: By Hadi MohammadzadehContent extraction: By Hadi Mohammadzadeh
Content extraction: By Hadi Mohammadzadeh
Hadi Mohammadzadeh
 
Knowledge Management inside Alfresco
Knowledge Management inside AlfrescoKnowledge Management inside Alfresco
Knowledge Management inside Alfresco
XeniT Solutions nv
 

Ähnlich wie Text mining, By Hadi Mohammadzadeh (20)

Content extraction: By Hadi Mohammadzadeh
Content extraction: By Hadi MohammadzadehContent extraction: By Hadi Mohammadzadeh
Content extraction: By Hadi Mohammadzadeh
 
The Web of Data: The W3C Semantic Web Initiative
The Web of Data: The W3C Semantic Web InitiativeThe Web of Data: The W3C Semantic Web Initiative
The Web of Data: The W3C Semantic Web Initiative
 
A review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebA review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic Web
 
Text Mining
Text MiningText Mining
Text Mining
 
Supporting program comprehension with source code summarization icse nier 2010
Supporting program comprehension with source code summarization icse nier 2010Supporting program comprehension with source code summarization icse nier 2010
Supporting program comprehension with source code summarization icse nier 2010
 
The Mysteries of Metadata
The Mysteries of MetadataThe Mysteries of Metadata
The Mysteries of Metadata
 
unit 1.pptx
unit 1.pptxunit 1.pptx
unit 1.pptx
 
Text Mining : Experience
Text Mining : ExperienceText Mining : Experience
Text Mining : Experience
 
Open minted content_provision
Open minted content_provisionOpen minted content_provision
Open minted content_provision
 
MongoDB quickstart for Java, PHP, and Python developers
MongoDB quickstart for Java, PHP, and Python developersMongoDB quickstart for Java, PHP, and Python developers
MongoDB quickstart for Java, PHP, and Python developers
 
Multimedia Database
Multimedia DatabaseMultimedia Database
Multimedia Database
 
Knowledge Management inside Alfresco
Knowledge Management inside AlfrescoKnowledge Management inside Alfresco
Knowledge Management inside Alfresco
 
Linked Open Data and Schema.org Panel
Linked Open Data and Schema.org PanelLinked Open Data and Schema.org Panel
Linked Open Data and Schema.org Panel
 
Infos4
Infos4Infos4
Infos4
 
Gilbane Boston 2012 Big Data 101
Gilbane Boston 2012 Big Data 101Gilbane Boston 2012 Big Data 101
Gilbane Boston 2012 Big Data 101
 
Introduction to Metadata for IDAH Fellows
Introduction to Metadata for IDAH FellowsIntroduction to Metadata for IDAH Fellows
Introduction to Metadata for IDAH Fellows
 
Chapter 09
Chapter 09Chapter 09
Chapter 09
 
Mongo DB for Java, Python and PHP Developers
Mongo DB for Java, Python and PHP DevelopersMongo DB for Java, Python and PHP Developers
Mongo DB for Java, Python and PHP Developers
 
Engineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platforms
 
1 _text_mining_v0a
1  _text_mining_v0a1  _text_mining_v0a
1 _text_mining_v0a
 

Mehr von Hadi Mohammadzadeh (8)

TitleFinder Extracting the Headline of News Web Pages
TitleFinder Extracting the Headline of News Web PagesTitleFinder Extracting the Headline of News Web Pages
TitleFinder Extracting the Headline of News Web Pages
 
Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Revealing Trends Based on Defined Queries in Biological Publications Using Co...Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Revealing Trends Based on Defined Queries in Biological Publications Using Co...
 
Webist2012 presentation
Webist2012 presentationWebist2012 presentation
Webist2012 presentation
 
Improving Retrieval Accuracy in Main Content Extraction from HTML Web Docu...
Improving Retrieval Accuracy  in Main Content Extraction  from  HTML Web Docu...Improving Retrieval Accuracy  in Main Content Extraction  from  HTML Web Docu...
Improving Retrieval Accuracy in Main Content Extraction from HTML Web Docu...
 
Accurate Main Content Extraction from Persian HTML Files
Accurate Main Content Extraction from Persian HTML FilesAccurate Main Content Extraction from Persian HTML Files
Accurate Main Content Extraction from Persian HTML Files
 
Main Content Extraction from Persian HTML Files
Main Content Extraction from Persian HTML FilesMain Content Extraction from Persian HTML Files
Main Content Extraction from Persian HTML Files
 
Information filtering, By Hadi Mohammadzadeh
Information filtering, By Hadi MohammadzadehInformation filtering, By Hadi Mohammadzadeh
Information filtering, By Hadi Mohammadzadeh
 
Information retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi MohammadzadehInformation retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi Mohammadzadeh
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Text mining, By Hadi Mohammadzadeh

  • 1. . Seminar on Text Mining By : Hadi Mohammadzadeh Institute of Applied Information Processing University of Ulm – 15 Dec. 2009 Hadi Mohammadzadeh Text Mining Pages 1
  • 2. . Seminar on Text Mining OutLine – Basics – Latent Semantic Indexing – Part of Speech(POS) Tagging – Information Extraction – Clustering Documents – Text Categorization Hadi Mohammadzadeh Text Mining Pages 2
  • 3. . Seminar on Text Mining Part One Basics Hadi Mohammadzadeh Text Mining Pages 3
  • 4. . Definition: Text Mining • Text Mining can be defined as a knowledge-intensive process in which a user interacts with a document collection over time by using a suite of analysis tools. And • Text Mining seeks to extract useful information from data sources (document collections) through the identification and exploration of interesting patterns. Hadi Mohammadzadeh Text Mining Pages 4
  • 5. . Similarities between Data Mining and Text Mining • Both types of systems rely on: – Preprocessing routines – Pattern-discovery algorithms – Presentation-layer elements such as visualization tools Hadi Mohammadzadeh Text Mining Pages 5
  • 6. . Preprocessing Operations in Data Mining and Text Mining • In Data Mining assume data – Stored in a structured format, so preprocessing focus on scrubbing and normalizing data, to create extensive numbers of table joins • In Text Mining preprocessing operations center on – Identification & Extraction of representative features for NL documents, to transform unstructured data stored in doc collections into a more explicity structured intermediate format Hadi Mohammadzadeh Text Mining Pages 6
  • 7. . Weakly Structured and Semi structured Docs Documents – that have relatively little in the way of strong • typographical, layout, or markup indicators to denote structure are refered to as free-format or weakly structured docs (such as most scientific research papers, business reports, and news stories) – With extensive and consistent format elements in which field-type metadata can be more easily inferred are described as semistructured docs (such as some e-mail, HTML web pages, PDF files) Hadi Mohammadzadeh Text Mining Pages 7
  • 8. . Document Features • Although many potential features can be employed to represent docs, the following four types are most commonly uesd: – Characters – Words – Terms – Concepts • High Feature Dimensionality ( HFD) – Problems relating to HFD are typically of much greater magnitude in TM systems than in classic DM systems. • Feature Sparcity – Only a small percentage of all possible features for a document collection as a whole appear as in any single docs. Hadi Mohammadzadeh Text Mining Pages 8
  • 9. . Representational Model of a Document • An essential task for most text mining systems is The identification of a simplified subset of document features that can be used to represent a particular document as a whole. We refer to such a set of features as the representational model of a document Hadi Mohammadzadeh Text Mining Pages 9
  • 10. . Character-level Representational • Without Positional Information – Are often of very limited utility in TM applications • With Positional Information – Are somewhat more useful and common (e.g. bigrams or trigrams) • Disadvantage: – Character-base Rep. can often be unwieldy for some types of text processing techniques because the feature space for a docs is fairly unoptimized Hadi Mohammadzadeh Text Mining Pages 10
  • 11. . Word-level Representational • Without Positional Information – Are often of very limited utility in TM applications • With Positional Information – Are somewhat more useful and common(e.g. bigrams or trigrams) • Disadvantage: – Character-base Rep. can often be unwieldy for some types of text processing techniques because the feature space for a docs is fairly unoptimized Hadi Mohammadzadeh Text Mining Pages 11
  • 12. . Term-level Representational • Normalized Terms comes out of Term-Extraction Methodology – Sequence of one or more tokenized and lemmatized word • What are Term-Extraction Methodology? Hadi Mohammadzadeh Text Mining Pages 12
  • 13. . Concept-level Representational • Concepts are features generated for a document by means of manual, statistical, rule-based, or hybrid categorization methodology Hadi Mohammadzadeh Text Mining Pages 13
  • 14. . General Architecture of Text Mining Systems Abstract Level • A text mining system takes in input raw docs and generates various types of output such as: – Patterns – Maps of connections – Trends Input Output Patterns Connections Trends Documents Hadi Mohammadzadeh Text Mining Pages 14
  • 15. . General Architecture of Text Mining Systems Functional Level • TM systems follow the general model provided by some classic DM applications and are thus divisible into 4 main areas – Preprocessing Tasks – Core mining operations – Presentation layer components and browsing functionality – Refinement techniques Hadi Mohammadzadeh Text Mining Pages 15
  • 16. . System Architecture for Generic Text Mining System Hadi Mohammadzadeh Text Mining Pages 16
  • 17. . System Architecture for Domain-oriented Text Mining System Hadi Mohammadzadeh Text Mining Pages 17
  • 18. . System Architecture for an advanced Text Mining System with background knowledge base Hadi Mohammadzadeh Text Mining Pages 18
  • 19. . Seminar on Text Mining Part Two Latent Semantic Indexing(LSI) Hadi Mohammadzadeh Text Mining Pages 19
  • 20. . Problems with Lexical Semantics • Ambiguity and association in natural language – Polysemy: Words often have a multitude of meanings and different types of usage such as bank (more severe in very heterogeneous collections). – The vector space model is unable to discriminate between different meanings of the same word. Hadi Mohammadzadeh Text Mining Pages 20
  • 21. . Problems with Lexical Semantics – Synonymy: Different terms may have an identical or a similar meaning (weaker: words indicating the same topic). – No associations between words are made in the vector space representation. – Problem of Synonyme may be solved with LSI Hadi Mohammadzadeh Text Mining Pages 21
  • 22. . Polysemy and Context • Document similarity on single word level: polysemy and context ring jupiter ••• space meaning 1 voyager … … planet saturn ... ... meaning 2 car company ••• contribution to similarity, if dodge used in 1st meaning, but not ford if in 2nd Hadi Mohammadzadeh Text Mining Pages 22
  • 23. . Latent Semantic Indexing Introduction • Problem: The first frequency-based indexing method did not utilize any global relationships within the docs collection • Solution: LSI is an indexing method based on the Singular Value Decomposition (SVD) • How: SVD transform the word document matrix such that major intrinsic associative pattern in the collection are revealed Hadi Mohammadzadeh Text Mining Pages 23
  • 24. . Latent Semantic Indexing Introduction • Main Adv: it does not depend on individual words to locate documents, but rather uses the concept or topic to find relevant docs • Using: When a researcher submit a query, it is transformed to LSI space and compared with other docs in the same space Hadi Mohammadzadeh Text Mining Pages 24
  • 25. . Singular Value Decomposition For an M × N matrix A of rank r there exists a factorization (Singular Value Decomposition = SVD) as follows: A = UΣV T M×M M×N V is N×N The columns of U are orthogonal eigenvectors of AAT. The columns of V are orthogonal eigenvectors of ATA. Eigenvalues λ1 … λr of AAT are the eigenvalues of ATA. σ i = λi Σ = diag ( σ 1...σ r ) Singular values. Hadi Mohammadzadeh Text Mining Pages 25
  • 26. . Singular Value Decomposition • Illustration of SVD dimensions and sparseness Hadi Mohammadzadeh Text Mining Pages 26
  • 27. . Low-rank Approximation • Solution via SVD Ak = U diag(σ 1 ,..., σ k ,0,...,0)V T set smallest r-k singular values to zero k Ak = ∑i =1σ i ui viT k column notation: sum of rank 1 matrices Hadi Mohammadzadeh Text Mining Pages 27
  • 28. . Reduced SVD • If we retain only k singular values, and set the rest to 0, then we don’t need the matrix parts in red • Then Σ is k×k, U is M×k, VT is k×N, and Ak is M×N • This is referred to as the reduced SVD • It is the convenient (space-saving) and usual form for computational applications • It’s what Matlab gives you k Hadi Mohammadzadeh Text Mining Pages 28
  • 29. . Approximation error • How good (bad) is this approximation? • It’s the best possible, measured by the Frobenius norm of the error: min A− X F = A − Ak F = σ k +1 X :rank ( X ) = k where the σi are ordered such that σi ≥ σi+1. Suggests why Frobenius error drops as k increased. Hadi Mohammadzadeh Text Mining Pages 29
  • 30. . SVD Low-rank approximation • Whereas the term-doc matrix A may have M=50000, N=10 million (and rank close to 50000) • We can construct an approximation A100 with rank 100. – Of all rank 100 matrices, it would have the lowest Frobenius error. • Great … but why would we?? • Answer: Latent Semantic Indexing Hadi Mohammadzadeh Text Mining Pages 30
  • 31. . Latent Semantic Indexing (LSI) • Perform a low-rank approximation of document- term matrix (typical rank 100-300) • General idea – Map documents (and terms) to a low-dimensional representation. – Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space). – Compute document similarity based on the inner product in this latent semantic space Hadi Mohammadzadeh Text Mining Pages 31
  • 32. . Goals of LSI • Similar terms map to similar location in low dimensional space • Noise reduction by dimension reduction Hadi Mohammadzadeh Text Mining Pages 32
  • 33. . Latent Semantic Analysis • Latent semantic space: illustrating example courtesy of Susan Dumais Hadi Mohammadzadeh Text Mining Pages 33
  • 34. . Performing the maps • Each row and column of A gets mapped into the k-dimensional LSI space, by the SVD. • Claim – this is not only the mapping with the best (Frobenius error) approximation to A, but in fact improves retrieval. • A query q is also mapped into this space, by qk = q T U k Σ −1 k – Query NOT a sparse vector. Hadi Mohammadzadeh Text Mining Pages 34
  • 35. . But why is this clustering? • We’ve talked about docs, queries, retrieval and precision here. • What does this have to do with clustering? • Intuition: Dimension reduction through LSI brings together “related” axes in the vector space. Hadi Mohammadzadeh Text Mining Pages 35
  • 36. . Intuition from block matrices N documents Block 1 What’s the rank of this matrix? 0’s Block 2 M terms … 0’s Block k = Homogeneous non-zero blocks. Hadi Mohammadzadeh Text Mining Pages 36
  • 37. . Intuition from block matrices N documents Block 1 0’s Block 2 M terms … 0’s Block k Vocabulary partitioned into k topics (clusters); each doc discusses only one topic. Hadi Mohammadzadeh Text Mining Pages 37
  • 38. . Intuition from block matrices N documents What’s the best rank-k Block 1 approximation to this matrix? 0’s Block 2 M terms … 0’s Block k = non-zero entries. Hadi Mohammadzadeh Text Mining Pages 38
  • 39. . Intuition from block matrices Likely there’s a good rank-k approximation to this matrix. wiper tire Block 1 V6 Few nonzero entries Block 2 … Few nonzero entries Block k car 10 automobile 0 1 Hadi Mohammadzadeh Text Mining Pages 39
  • 40. . Simplistic picture Topic 1 Topic 2 Topic 3 Hadi Mohammadzadeh Text Mining Pages 40
  • 41. . Some wild extrapolation • The “dimensionality” of a corpus is the number of distinct topics represented in it. • More mathematical wild extrapolation: – if A has a rank k approximation of low Frobenius error, then there are no more than k distinct topics in the corpus. Hadi Mohammadzadeh Text Mining Pages 41
  • 42. . LSI has many other applications • In many settings in pattern recognition and retrieval, we have a feature-object matrix. – For text, the terms are features and the docs are objects. – Could be opinions and users … – This matrix may be redundant in dimensionality. – Can work with low-rank approximation. – If entries are missing (e.g., users’ opinions), can recover if dimensionality is low. • Powerful general analytical technique – Close, principled analog to clustering methods. Hadi Mohammadzadeh Text Mining Pages 42
  • 43. . Seminar on Text Mining Part Three Part of Speech(POS) Tagging Hadi Mohammadzadeh Text Mining Pages 43
  • 44. . Definition of POS “The process of assigning a part-of-speech or other lexical class marker to each word in a corpus” (Jurafsky and Martin) WORDS TAGS the girl kissed N the V boy P on DET the cheek Hadi Mohammadzadeh Text Mining Pages 44
  • 45. . An Example WORD LEMMA TAG the the +DET girl girl +NOUN kissed kiss +VPAST the the +DET boy boy +NOUN on on +PREP the the +DET cheek cheek +NOUN Hadi Mohammadzadeh Text Mining Pages 45
  • 46. . Motivation of POS • Speech synthesis — pronunciation • Speech recognition — class-based N-grams • Information retrieval — stemming, selection high- content words • Word-sense disambiguation • Corpus analysis of language & lexicography Hadi Mohammadzadeh Text Mining Pages 46
  • 47. . Word Classes Basic word classes: Noun, Verb, Adjective, Adverb, Preposition, … Open vs. Closed classes Open: Nouns, Verbs, Adjectives, Adverbs Closed: determiners: a, an, the pronouns: she, he, I prepositions: on, under, over, near, by, … Hadi Mohammadzadeh Text Mining Pages 47
  • 48. . Word Classes: Tag Sets • Vary in number of tags: a dozen to over 200 • Size of tag sets depends on language, objectives and purpose – Some tagging approaches (e.g., constraint grammar based) make fewer distinctions e.g., conflating prepositions, conjunctions, particles – Simple morphology = more ambiguity = fewer tags Hadi Mohammadzadeh Text Mining Pages 48
  • 49. . Word Classes: Tag set example Hadi Mohammadzadeh Text Mining Pages 49
  • 50. . The Problem • Words often have more than one word class: this – This is a nice day = PRP – This day is nice = DT(determiner) – You can go this far = RB(adverb) Hadi Mohammadzadeh Text Mining Pages 50
  • 51. . Word Class Ambiguity (in the Brown Corpus) • Unambiguous (1 tag): 35,340 • Ambiguous (2-7 tags): 4,100 2 tags 3,760 3 tags 264 4 tags 61 5 tags 12 6 tags 2 7 tags 1 (Derose, 1988) Hadi Mohammadzadeh Text Mining Pages 51
  • 52. . POS Tagging Methods • Stochastic Tagger: HMM-based(Using Viterbi Algorithm) • Rule-Based Tagger: ENGTWOL (ENGlish TWO Level analysis) • Transformation-Based Tagger (Brill) Hadi Mohammadzadeh Text Mining Pages 52
  • 53. . Stochastic Tagging • Based on probability of certain tag occurring given various possibilities • Requires a training corpus • No probabilities for words not in corpus. • Simple Method: Choose most frequent tag in training text for each word! – Result: 90% accuracy – Baseline – Others will do better – HMM is an example Hadi Mohammadzadeh Text Mining Pages 53
  • 54. . HMM Tagger • Intuition: Pick the most likely tag for this word. • HMM Taggers choose tag sequence that maximizes this formula: – P(word|tag) × P(tag|previous n tags) • Let T = t1,t2,…,tn Let W = w1,w2,…,wn • Find POS tags that generate a sequence of words, i.e., look for most probable sequence of tags T underlying the observed words W. Hadi Mohammadzadeh Text Mining Pages 54
  • 55. . Rule-Based Tagging • Basic Idea: – Assign all possible tags to words – Remove tags according to set of rules of type: if word+1 is an adj, adv, or quantifier and the following is a sentence boundary and word-1 is not a verb like “consider” then eliminate non-adv else eliminate adv. – Typically more than 1000 hand-written rules, but may be machine- learned Hadi Mohammadzadeh Text Mining Pages 55
  • 56. . Stage 1 of ENGTWOL Tagging First Stage: – Run words through Kimmo-style morphological analyzer to get all parts of speech. Example: Pavlov had shown that salivation … Pavlov PAVLOV N NOM SG PROPER had HAVE V PAST VFIN SVO HAVE PCP2 SVO shown SHOW PCP2 SVOO SVO SV that ADV PRON DEM SG DET CENTRAL DEM SG CS salivation N NOM SG Hadi Mohammadzadeh Text Mining Pages 56
  • 57. . Stage 2 of ENGTWOL Tagging • Second Stage: – Apply constraints. • Constraints used in negative way. • Example: Adverbial “that” rule Given input: “that” If (+1 A/ADV/QUANT) (+2 SENT-LIM) (NOT -1 SVOC/A) Then eliminate non-ADV tags Else eliminate ADV Hadi Mohammadzadeh Text Mining Pages 57
  • 58. . Transformation-Based Tagging (Brill Tagging) • Combination of Rule-based and stochastic tagging methodologies – Like rule-based because rules are used to specify tags in a certain environment – Like stochastic approach because machine learning is used—with tagged corpus as input • Input: – tagged corpus – dictionary (with most frequent tags) + Usually constructed from the tagged corpus Hadi Mohammadzadeh Text Mining Pages 58
  • 59. . Transformation-Based Tagging (cont.) • Basic Idea: – Set the most probable tag for each word as a start value – Change tags according to rules of type “if word-1 is a determiner and word is a verb then change the tag to noun” in a specific order • Training is done on tagged corpus: – Write a set of rule templates – Among the set of rules, find one with highest score – Continue from 2 until lowest score threshold is passed – Keep the ordered set of rules • Rules make errors that are corrected by later rules Hadi Mohammadzadeh Text Mining Pages 59
  • 60. . TBL Rule Application • Tagger labels every word with its most-likely tag – For example: race has the following probabilities in the Brown corpus: • P(NN|race) = .98 • P(VB|race)= .02 • Transformation rules make changes to tags – “Change NN to VB when previous tag is TO” … is/VBZ expected/VBN to/TO race/NN tomorrow/NN becomes … is/VBZ expected/VBN to/TO race/VB tomorrow/NN Hadi Mohammadzadeh Text Mining Pages 60
  • 61. . TBL: Rule Learning • 2 parts to a rule – Triggering environment – Rewrite rule • The range of triggering environments of templates (from Manning & Schutze 1999:363) Schema ti-3 ti-2 ti-1 ti ti+1 ti+2 ti+3 1 * 2 * 3 * 4 * 5 * 6 * 7 * 8 * 9 * Hadi Mohammadzadeh Text Mining Pages 61
  • 62. . TBL: The Algorithm • Step 1: Label every word with most likely tag (from dictionary) • Step 2: Check every possible transformation & select one which most improves tagging • Step 3: Re-tag corpus applying the rules • Repeat 2-3 until some criterion is reached, e.g., X% correct with respect to training corpus • RESULT: Sequence of transformation rules Hadi Mohammadzadeh Text Mining Pages 62
  • 63. . TBL: Rule Learning (cont’d) • Problem: Could apply transformations ad infinitum! • Constrain the set of transformations with “templates”: – Replace tag X with tag Y, provided tag Z or word Z’ appears in some position • Rules are learned in ordered sequence • Rules may interact. • Rules are compact and can be inspected by humans Hadi Mohammadzadeh Text Mining Pages 63
  • 64. . TBL: Problems • Execution Speed: TBL tagger is slower than HMM approach – Solution: compile the rules to a Finite State Transducer (FST) • Learning Speed: Brill’s implementation over a day (600k tokens) Hadi Mohammadzadeh Text Mining Pages 64
  • 65. . Tagging Unknown Words • New words added to (newspaper) language 20+ per month • Plus many proper names … • Increases error rates by 1-2% • Method 1: assume they are nouns • Method 2: assume the unknown words have a probability distribution similar to words only occurring once in the training set. • Method 3: Use morphological information, e.g., words ending with –ed tend to be tagged VBN. Hadi Mohammadzadeh Text Mining Pages 65
  • 66. . Evaluation • The result is compared with a manually coded “Gold Standard” – Typically accuracy reaches 96-97% – This may be compared with result for a baseline tagger (one that uses no context). • Important: 100% is impossible even for human annotators. • Factors that affects the performance – The amount of training data available – The tag set – The difference between training corpus and test corpus – Dictionary – Unknown words Hadi Mohammadzadeh Text Mining Pages 66
  • 67. . Seminar on Text Mining Part Four Information Extraction (IE) Hadi Mohammadzadeh Text Mining Pages 67
  • 68. . Definition • An Information Extraction system generally converts unstructured text into a form that can be loaded into a database. Hadi Mohammadzadeh Text Mining Pages 68
  • 69. . Information Retrieval vs. Information Extraction • While information retrieval deals with the problem of finding relevant document in a collection, information extraction identifies useful (relevant) text in a document. Useful information is defined as a text segment and its associated attributes. Hadi Mohammadzadeh Text Mining Pages 69
  • 70. . An Example • Query: – List the news reports of car bombings in Basra and surrounding areas between June and December 2004. Answering to this query is difficult with an information- retrieval system alone. To answer such queries, we need additional semantic information to identify text segments that refer to an attribute Hadi Mohammadzadeh Text Mining Pages 70
  • 71. . Elements Extracted from Text • There are four basic types of elements that can be extracted from text – Entities: The basic building blocks that can be found in text documents. e.g. people, companies, locations, drugs – Attributes: features of the extracted entities. e.g. title of a person, age of person, type of an organization – Facts: The relations that exist between entities. e.g. relationship between a person and a company – Events: an activity or occurrence of interest in which entities participate. e.g. terrorist act, a merger between two companies Hadi Mohammadzadeh Text Mining Pages 71
  • 72. . IE Applications • E-Recruitment • Extracting sales information • Intelligence collection for news articles • Message Understanding (MU) Hadi Mohammadzadeh Text Mining Pages 72
  • 73. . Named Entity Recognition (NER) • NER can be viewed as a classification problem in which words are assigned to one or more semantic classes. • The same methods we used to assign POS tags words can be applied here. • Unlike POS tags, not every word is associated with a semantic class. • Like POS taggers, we can train an entity extractor to find entities in text using a tagged data set. • Decision Trees, HMM, and rule-based methods can be applied to the classification task. Hadi Mohammadzadeh Text Mining Pages 73
  • 74. . Problems of NER • Unknown words: it is difficult to categorize • Finding the exact boundary of an entity • Polysemy and synonymy- methods used for WSD are applicable here. Hadi Mohammadzadeh Text Mining Pages 74
  • 75. . Architecture of an IE System • Extraction of tokens and tags • Semantic analysis : A partial parser is usually sufficient • Extractor : we look at domain-specific entities, weather DB • Merging multiple references to the same entity: finding a single canonical form • Template Generation: A template contains a list of slots (fields) Tokenization Tokens Sentence POS Text and tagging POS tags Analysis groups Combined Assigned Template Extractor Merging Generation Entities Entities Hadi Mohammadzadeh Text Mining Pages 75
  • 76. . IE tools • Fastus – Finite State Automation Text Understanding System • Rapier – Robust Automated Production of Information Extraction Rules Hadi Mohammadzadeh Text Mining Pages 76
  • 77. . Fastus • It is based on a series of finite-state machines to solve specific problems for each stage of the IE pipeline. • A Finite-State Machine (FSM) generate a regular language that consists of regular expression to describe the language. • A regular expression (regex) actually represents a string pattern. • Regexs are used in IE to identify text segments that match some predefined pattern. • An FSM applies a pattern to a window of text and transition from one state to another until a pattern matches or fails to match. Hadi Mohammadzadeh Text Mining Pages 77
  • 78. . Stages of Fastus • In the first stage, composite words and proper nouns are extracted. e.g. “set up” ,”carry out” Text Stage 1 Complex Stage 2 Basic Stage 3 Words Phrases Merged Stage 5 Event Stage 4 Complex Structures Structures Phrases Hadi Mohammadzadeh Text Mining Pages 78
  • 79. . Seminar on Text Mining Part Five Clustering Documents Hadi Mohammadzadeh Text Mining Pages 79
  • 80. . What is clustering? • Clustering: the process of grouping a set of objects into classes of similar objects – Documents within a cluster should be similar. – Documents from different clusters should be dissimilar. • The commonest form of unsupervised learning – Unsupervised learning = learning from raw data, as opposed to supervised data where a classification of examples is given – A common and important task that finds many applications in IR and other places Hadi Mohammadzadeh Text Mining Pages 80
  • 81. . Applications of clustering in IR • Whole corpus analysis/navigation(Scatter-gather) – Better user interface: search without typing • For improving recall in search applications – Better search results • For better navigation of search results – Effective “user recall” will be higher • For speeding up vector space retrieval – Cluster-based retrieval gives faster search Hadi Mohammadzadeh Text Mining Pages 81
  • 82. . Google News: automatic clustering gives an effective news presentation metaphor Hadi Mohammadzadeh Text Mining Pages 82
  • 83. . 1. Scatter/Gather: Cutting, Karger, and Pedersen Hadi Mohammadzadeh Text Mining Pages 83
  • 84. . 2. For improving search recall • Cluster hypothesis - Documents in the same cluster behave similarly with respect to relevance to information needs • Therefore, to improve search recall: – Cluster docs in corpus a priori – When a query matches a doc D, also return other docs in the cluster containing D • Hope if we do this: The query “car” will also return docs containing automobile – Because clustering grouped together docs containing car with those containing automobile. Hadi Mohammadzadeh Text Mining Pages 84
  • 85. . 3. For better navigation of search results • For grouping search results thematically Hadi Mohammadzadeh Text Mining Pages 85
  • 86. . What makes docs “related”? • Ideal: semantic similarity. • Practical: statistical similarity – We will use cosine similarity. – Docs as vectors. – For many algorithms, easier to think in terms of a distance (rather than similarity) between docs. – We will use Euclidean distance. Hadi Mohammadzadeh Text Mining Pages 86
  • 87. . Clustering Algorithms • Flat algorithms – Usually start with a random (partial) partitioning – Refine it iteratively • K means clustering • (Model based clustering) • Hierarchical algorithms – Bottom-up, agglomerative – (Top-down, divisive) Hadi Mohammadzadeh Text Mining Pages 87
  • 88. . Hard vs. soft clustering • Hard clustering: Each document belongs to exactly one cluster – More common and easier to do • Soft clustering: A document can belong to more than one cluster. – Makes more sense for applications like creating browsable hierarchies – You may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes – You can only do that with a soft clustering approach. Hadi Mohammadzadeh Text Mining Pages 88
  • 89. . Partitioning Algorithms • Partitioning method: Construct a partition of n documents into a set of K clusters • Given: a set of documents and the number K • Find: a partition of K clusters that optimizes the chosen partitioning criterion – Globally optimal: exhaustively enumerate all partitions – Effective heuristic methods: K-means and K-medoids algorithms Hadi Mohammadzadeh Text Mining Pages 89
  • 90. . K-Means • Assumes documents are real-valued vectors. • Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c:  1  μ(c) = ∑x | c | x∈c  • Reassignment of instances to clusters is based on distance to the current cluster centroids. Hadi Mohammadzadeh Text Mining Pages 90
  • 91. . K-Means Algorithm Select K random docs {s1, s2,… sK} as seeds. Until clustering converges or other stopping criterion: For each doc di: Assign di to the cluster cj such that dist(xi, sj) is minimal. (Update the seeds to the centroid of each cluster) For each cluster cj sj = µ(cj) Hadi Mohammadzadeh Text Mining Pages 91
  • 92. . Termination conditions • Several possibilities, e.g., – A fixed number of iterations. – Doc partition unchanged. – Centroid positions don’t change. Hadi Mohammadzadeh Text Mining Pages 92
  • 93. . Seed Choice • Results can vary based on random seed selection. Example showing • Some seeds can result in poor sensitivity to seeds convergence rate, or convergence to sub-optimal clusterings. – Select good seeds using a heuristic (e.g., doc least similar to any existing mean) – Try out multiple starting points In the above, if you start – Initialize with the results of another with B and E as centroids method. you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} {C,F} Hadi Mohammadzadeh Text Mining Pages 93
  • 94. . How Many Clusters? • Number of clusters K is given – Partition n docs into predetermined number of clusters • Finding the “right” number of clusters is part of the problem – Given docs, partition into an “appropriate” number of subsets. – E.g., for query results - ideal value of K not known up front - though UI may impose limits. • Can usually take an algorithm for one flavor and convert to the other. Hadi Mohammadzadeh Text Mining Pages 94
  • 95. . K not specified in advance • Given a clustering, define the Benefit for a doc to be the cosine similarity to its centroid • Define the Total Benefit to be the sum of the individual doc Benefits. Hadi Mohammadzadeh Text Mining Pages 95
  • 96. . Penalize lots of clusters • For each cluster, we have a Cost C. • Thus for a clustering with K clusters, the Total Cost is KC. • Define the Value of a clustering to be = Total Benefit - Total Cost. • Find the clustering of highest value, over all choices of K. – Total benefit increases with increasing K. But can stop when it doesn’t increase by “much”. The Cost term enforces this. Hadi Mohammadzadeh Text Mining Pages 96
  • 97. . Hierarchical Clustering • Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents. animal vertebrate invertebrate fish reptile amphib. mammal worm insect crustacean • One approach: recursive application of a partitional clustering algorithm. Hadi Mohammadzadeh Text Mining Pages 97
  • 98. . Dendogram: Hierarchical Clustering • Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster. Hadi Mohammadzadeh Text Mining Pages 98
  • 99. . Hierarchical Agglomerative Clustering (HAC) • Starts with each doc in a separate cluster – then repeatedly joins the closest pair of clusters, until there is only one cluster. • The history of merging forms a binary tree or hierarchy. Hadi Mohammadzadeh Text Mining Pages 99
  • 100. . Closest pair of clusters Many variants to defining closest pair of clusters • Single-link – Similarity of the most cosine-similar (single-link) • Complete-link – Similarity of the “furthest” points, the least cosine-similar • Centroid – Clusters whose centroids (centers of gravity) are the most cosine- similar • Average-link – Average cosine between pairs of elements Hadi Mohammadzadeh Text Mining Pages 100
  • 101. . Closest pair of clusters Hadi Mohammadzadeh Text Mining Pages 101
  • 102. . Single Link Agglomerative Clustering • Use maximum similarity of pairs: sim(ci ,c j ) = max sim( x, y ) x∈ci , y∈c j • Can result in “straggly” (long and thin) clusters due to chaining effect. • After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is: sim((ci ∪ c j ), ck ) = max(sim(ci , ck ), sim(c j , ck )) Hadi Mohammadzadeh Text Mining Pages 102
  • 103. . Single Link Example Hadi Mohammadzadeh Text Mining Pages 103
  • 104. . Complete Link Agglomerative Clustering • Use minimum similarity of pairs: sim(ci ,c j ) = min sim( x, y ) x∈ i , y∈ j c c • Makes “tighter,” spherical clusters that are typically preferable. • After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is: sim((ci ∪c j ), ck ) = min( sim(ci , ck ), sim(c j , ck )) Ci Cj Ck Hadi Mohammadzadeh Text Mining Pages 104
  • 105. . Complete Link Example Hadi Mohammadzadeh Text Mining Pages 105
  • 106. . Group Average Agglomerative Clustering • Similarity of two clusters = average similarity of all pairs within merged cluster. 1   sim(ci , c j ) =  ∑c ) y∈(c ∑)sim( x, y) ci ∪ c j ( ci ∪ c j − 1) x∈( ci ∪ j   ∪c j : y ≠ x i • Compromise between single and complete link. • Two options: – Averaged across all ordered pairs in the merged cluster – Averaged over all pairs between the two original clusters • No clear difference in efficacy Hadi Mohammadzadeh Text Mining Pages 106
  • 107. . Computing Group Average Similarity • Always maintain sum of vectors in each cluster.   s (c j ) = ∑x  x∈c j • Compute similarity of clusters in constant time:     ( s (ci ) + s (c j )) • ( s (ci ) + s (c j )) − (| ci | + | c j |) sim(ci , c j ) = (| ci | + | c j |)(| ci | + | c j | −1) Hadi Mohammadzadeh Text Mining Pages 107
  • 108. . Seminar on Text Mining Part Six Text Categorization(TC) Hadi Mohammadzadeh Text Mining Pages 108
  • 109. . Approaches to TC There are two main approaches to TC: • Knowledge Engineering – The main drawback of the KEA is what might be called the Knowledge acquisition bottleneck. The huge amount of highly skilled labor and expert knowledge required to create and maintain the knowledge-encoding rules • Machine Learning – Requires only a set of manually classified training instances that muchless costly to produce. Hadi Mohammadzadeh Text Mining Pages 109
  • 110. . Applocations of TC Three common TC appications are: • Text Indexing • Document sorting and text filtering • Web page categorization Hadi Mohammadzadeh Text Mining Pages 110
  • 111. . Text Indexing(TI) • The task of assigning keywords from a controlled vocabulary to text documents is called TI. If the keywords are viewed as categories, then TI is an instance of general TC problem. Hadi Mohammadzadeh Text Mining Pages 111
  • 112. . Document sorting and text filtering • Examples: – In a newspaper, the classified ads may need to be categorized into “Personal”, “Car Sales”, “Real State” – Emails can be sorted intocategories such as “Complaints”, “Deals”, “Job applications” • Text Filtering activity can be seen as document sorting with only two bins- the “relevant” and “irrelevant” docs. Hadi Mohammadzadeh Text Mining Pages 112
  • 113. . Web page categorization • A common use of TC is the automatic classification of Web pages under the hierarchical calalogues posted by popular Internet portals such as Yahoo. • Whenever the number of docs in a category exceeds k, it should be spilt into two or more subcategories. • The Web docs contain links, which may be important source of information for classifier because linked docs often share semantics. Hadi Mohammadzadeh Text Mining Pages 113
  • 114. . Definition of the Problem • The General text categorization task can be formally defined as the task of approximating an unknown category assignment function F : D × C → { 0,1} • Where D is the set of all possible docs and C is the set of predefined categories. • The value of F ( d , c ) is 1 if the document d belongs to the category c and 0 otherwise. • The approximation function M : D ×C →{0,1} is called a classifer, and the task is to build a classifer that produces results as “close” as possible to the true category assignment function F. Hadi Mohammadzadeh Text Mining Pages 114
  • 115. . Types of Categorization • Single-Label versus Multilabel Categorization – In multilabel categorization the categories overlap, and a document may belongs to any number of categories. • Document-Pivoted versus Category-Pivoted Categorization – The difference is significant only in the case in which not all docs or not all categories are immediately available. • Hard Versus Soft Categorization – Fully automated , and semiautomated Hadi Mohammadzadeh Text Mining Pages 115
  • 116. . Machine Learning Approache to TC • Decition Tree Classifiers • Naïve Bayes(Probablistic classifer) • K-Nearest Neighbor classifiaction • Rocchio Methods • Decition Rule classifer • Neural Networks • Support Vector Machine Hadi Mohammadzadeh Text Mining Pages 116
  • 117. . References • Books – Introduction to Information Retrieval-2008 – Managing Gigabytes-1999 – The Text Mining Handbook – Text Mining Application Programming – Web Data Mining Hadi Mohammadzadeh Text Mining Pages 117
  • 118. . References • Power Points – Introduction to Information Retrieval-2008 – Text Mining Application Programming – Web Data Mining – Word classes and part of speech tagging • Rada Mihalcea Note: Some of the material in this slide set was adapted from Chris Brew’s (OSU) slides on part of speech tagging Hadi Mohammadzadeh Text Mining Pages 118