Why Teams call analytics are critical to your entire business
Text mining, By Hadi Mohammadzadeh
1. .
Seminar on
Text Mining
By : Hadi Mohammadzadeh
Institute of Applied Information Processing
University of Ulm – 15 Dec. 2009
Hadi Mohammadzadeh Text Mining Pages 1
2. .
Seminar on Text Mining
OutLine
– Basics
– Latent Semantic Indexing
– Part of Speech(POS) Tagging
– Information Extraction
– Clustering Documents
– Text Categorization
Hadi Mohammadzadeh Text Mining Pages 2
3. .
Seminar on Text Mining
Part One
Basics
Hadi Mohammadzadeh Text Mining Pages 3
4. .
Definition: Text Mining
• Text Mining can be defined as a knowledge-intensive process
in which a user interacts with a document collection over time
by using a suite of analysis tools.
And
• Text Mining seeks to extract useful information from data
sources (document collections) through the identification and
exploration of interesting patterns.
Hadi Mohammadzadeh Text Mining Pages 4
5. .
Similarities between
Data Mining and Text Mining
• Both types of systems rely on:
– Preprocessing routines
– Pattern-discovery algorithms
– Presentation-layer elements such as visualization tools
Hadi Mohammadzadeh Text Mining Pages 5
6. .
Preprocessing Operations
in
Data Mining and Text Mining
• In Data Mining assume data
– Stored in a structured format,
so preprocessing focus on scrubbing and normalizing data,
to create extensive numbers of table joins
• In Text Mining preprocessing operations center on
– Identification & Extraction of representative features for
NL documents,
to transform unstructured data stored in doc collections
into a more explicity structured intermediate format
Hadi Mohammadzadeh Text Mining Pages 6
7. .
Weakly Structured and Semi structured Docs
Documents
– that have relatively little in the way of strong
• typographical, layout, or markup indicators
to denote structure are refered to as free-format or
weakly structured docs (such as most scientific research papers,
business reports, and news stories)
– With extensive and consistent format elements in
which field-type metadata can be more easily
inferred are described as semistructured docs (such as
some e-mail, HTML web pages, PDF files)
Hadi Mohammadzadeh Text Mining Pages 7
8. .
Document Features
• Although many potential features can be employed to
represent docs, the following four types are most commonly
uesd:
– Characters
– Words
– Terms
– Concepts
• High Feature Dimensionality ( HFD)
– Problems relating to HFD are typically of much greater magnitude in
TM systems than in classic DM systems.
• Feature Sparcity
– Only a small percentage of all possible features for a document
collection as a whole appear as in any single docs.
Hadi Mohammadzadeh Text Mining Pages 8
9. .
Representational Model of a Document
• An essential task for most text mining systems is
The identification of a simplified subset of document features
that can be used to represent a particular document as
a whole.
We refer to such a set of features as the
representational model of a document
Hadi Mohammadzadeh Text Mining Pages 9
10. .
Character-level Representational
• Without Positional Information
– Are often of very limited utility in TM applications
• With Positional Information
– Are somewhat more useful and common (e.g.
bigrams or trigrams)
• Disadvantage:
– Character-base Rep. can often be unwieldy for
some types of text processing techniques because
the feature space for a docs is fairly unoptimized
Hadi Mohammadzadeh Text Mining Pages 10
11. .
Word-level Representational
• Without Positional Information
– Are often of very limited utility in TM applications
• With Positional Information
– Are somewhat more useful and common(e.g.
bigrams or trigrams)
• Disadvantage:
– Character-base Rep. can often be unwieldy for
some types of text processing techniques because
the feature space for a docs is fairly unoptimized
Hadi Mohammadzadeh Text Mining Pages 11
12. .
Term-level Representational
• Normalized Terms comes out of Term-Extraction
Methodology
– Sequence of one or more tokenized and lemmatized word
• What are Term-Extraction Methodology?
Hadi Mohammadzadeh Text Mining Pages 12
13. .
Concept-level Representational
• Concepts are features generated for a document by means
of manual, statistical, rule-based, or hybrid categorization
methodology
Hadi Mohammadzadeh Text Mining Pages 13
14. .
General Architecture of Text Mining Systems
Abstract Level
• A text mining system takes in input raw docs and
generates various types of output such as:
– Patterns
– Maps of connections
– Trends
Input Output
Patterns
Connections
Trends
Documents
Hadi Mohammadzadeh Text Mining Pages 14
15. .
General Architecture of Text Mining Systems
Functional Level
• TM systems follow the general model provided by some classic
DM applications and are thus divisible into 4 main areas
– Preprocessing Tasks
– Core mining operations
– Presentation layer components and browsing functionality
– Refinement techniques
Hadi Mohammadzadeh Text Mining Pages 15
16. .
System Architecture for Generic
Text Mining System
Hadi Mohammadzadeh Text Mining Pages 16
17. .
System Architecture for Domain-oriented
Text Mining System
Hadi Mohammadzadeh Text Mining Pages 17
18. .
System Architecture for an advanced Text Mining System
with background knowledge base
Hadi Mohammadzadeh Text Mining Pages 18
19. .
Seminar on Text Mining
Part Two
Latent Semantic Indexing(LSI)
Hadi Mohammadzadeh Text Mining Pages 19
20. .
Problems with Lexical Semantics
• Ambiguity and association in natural language
– Polysemy: Words often have a multitude of meanings
and different types of usage such as bank (more severe
in very heterogeneous collections).
– The vector space model is unable to discriminate
between different meanings of the same word.
Hadi Mohammadzadeh Text Mining Pages 20
21. .
Problems with Lexical Semantics
– Synonymy: Different terms may have an
identical or a similar meaning (weaker:
words indicating the same topic).
– No associations between words are made in
the vector space representation.
– Problem of Synonyme may be solved with
LSI
Hadi Mohammadzadeh Text Mining Pages 21
22. .
Polysemy and Context
• Document similarity on single word level:
polysemy and context ring
jupiter
•••
space
meaning 1 voyager
… …
planet saturn
... ...
meaning 2 car
company
•••
contribution to similarity, if dodge
used in 1st meaning, but not ford
if in 2nd
Hadi Mohammadzadeh Text Mining Pages 22
23. .
Latent Semantic Indexing
Introduction
• Problem: The first frequency-based indexing method
did not utilize any global relationships within the
docs collection
• Solution: LSI is an indexing method based on the
Singular Value Decomposition (SVD)
• How: SVD transform the word document matrix such
that major intrinsic associative pattern in the
collection are revealed
Hadi Mohammadzadeh Text Mining Pages 23
24. .
Latent Semantic Indexing
Introduction
• Main Adv: it does not depend on individual words to
locate documents, but rather uses the concept or topic
to find relevant docs
• Using: When a researcher submit a query, it is
transformed to LSI space and compared with other
docs in the same space
Hadi Mohammadzadeh Text Mining Pages 24
25. .
Singular Value Decomposition
For an M × N matrix A of rank r there exists a factorization
(Singular Value Decomposition = SVD) as follows:
A = UΣV T
M×M M×N V is N×N
The columns of U are orthogonal eigenvectors of AAT.
The columns of V are orthogonal eigenvectors of ATA.
Eigenvalues λ1 … λr of AAT are the eigenvalues of ATA.
σ i = λi
Σ = diag ( σ 1...σ r ) Singular values.
Hadi Mohammadzadeh Text Mining Pages 25
26. .
Singular Value Decomposition
• Illustration of SVD dimensions and sparseness
Hadi Mohammadzadeh Text Mining Pages 26
27. .
Low-rank Approximation
• Solution via SVD
Ak = U diag(σ 1 ,..., σ k ,0,...,0)V T
set smallest r-k
singular values to zero
k
Ak = ∑i =1σ i ui viT
k
column notation: sum
of rank 1 matrices
Hadi Mohammadzadeh Text Mining Pages 27
28. .
Reduced SVD
• If we retain only k singular values, and set the rest to 0, then
we don’t need the matrix parts in red
• Then Σ is k×k, U is M×k, VT is k×N, and Ak is M×N
• This is referred to as the reduced SVD
• It is the convenient (space-saving) and usual form for
computational applications
• It’s what Matlab gives you
k
Hadi Mohammadzadeh Text Mining Pages 28
29. .
Approximation error
• How good (bad) is this approximation?
• It’s the best possible, measured by the Frobenius
norm of the error:
min A− X F
= A − Ak F
= σ k +1
X :rank ( X ) = k
where the σi are ordered such that σi ≥ σi+1.
Suggests why Frobenius error drops as k increased.
Hadi Mohammadzadeh Text Mining Pages 29
30. .
SVD Low-rank approximation
• Whereas the term-doc matrix A may have M=50000,
N=10 million (and rank close to 50000)
• We can construct an approximation A100 with rank 100.
– Of all rank 100 matrices, it would have the lowest Frobenius
error.
• Great … but why would we??
• Answer: Latent Semantic Indexing
Hadi Mohammadzadeh Text Mining Pages 30
31. .
Latent Semantic Indexing (LSI)
• Perform a low-rank approximation of document-
term matrix (typical rank 100-300)
• General idea
– Map documents (and terms) to a low-dimensional
representation.
– Design a mapping such that the low-dimensional space
reflects semantic associations (latent semantic space).
– Compute document similarity based on the inner product
in this latent semantic space
Hadi Mohammadzadeh Text Mining Pages 31
32. .
Goals of LSI
• Similar terms map to similar location in
low dimensional space
• Noise reduction by dimension reduction
Hadi Mohammadzadeh Text Mining Pages 32
33. .
Latent Semantic Analysis
• Latent semantic space: illustrating example
courtesy of Susan Dumais
Hadi Mohammadzadeh Text Mining Pages 33
34. .
Performing the maps
• Each row and column of A gets mapped into the
k-dimensional LSI space, by the SVD.
• Claim – this is not only the mapping with the best
(Frobenius error) approximation to A, but in fact
improves retrieval.
• A query q is also mapped into this space, by
qk = q T U k Σ −1
k
– Query NOT a sparse vector.
Hadi Mohammadzadeh Text Mining Pages 34
35. .
But why is this clustering?
• We’ve talked about docs, queries, retrieval and
precision here.
• What does this have to do with clustering?
• Intuition: Dimension reduction through LSI
brings together “related” axes in the vector
space.
Hadi Mohammadzadeh Text Mining Pages 35
36. .
Intuition from block matrices
N documents
Block 1 What’s the rank of this matrix?
0’s
Block 2
M
terms
…
0’s
Block k
= Homogeneous non-zero blocks.
Hadi Mohammadzadeh Text Mining Pages 36
37. .
Intuition from block matrices
N documents
Block 1
0’s
Block 2
M
terms
…
0’s
Block k
Vocabulary partitioned into k topics (clusters); each doc
discusses only one topic.
Hadi Mohammadzadeh Text Mining Pages 37
38. .
Intuition from block matrices
N documents
What’s the best rank-k
Block 1
approximation to this matrix?
0’s
Block 2
M
terms
…
0’s
Block k
= non-zero entries.
Hadi Mohammadzadeh Text Mining Pages 38
39. .
Intuition from block matrices
Likely there’s a good rank-k
approximation to this matrix.
wiper
tire Block 1
V6
Few nonzero entries
Block 2
…
Few nonzero entries
Block k
car 10
automobile 0 1
Hadi Mohammadzadeh Text Mining Pages 39
40. .
Simplistic picture
Topic 1
Topic 2
Topic 3
Hadi Mohammadzadeh Text Mining Pages 40
41. .
Some wild extrapolation
• The “dimensionality” of a corpus is the number
of distinct topics represented in it.
• More mathematical wild extrapolation:
– if A has a rank k approximation of low Frobenius
error, then there are no more than k distinct topics
in the corpus.
Hadi Mohammadzadeh Text Mining Pages 41
42. .
LSI has many other applications
• In many settings in pattern recognition and retrieval,
we have a feature-object matrix.
– For text, the terms are features and the docs are objects.
– Could be opinions and users …
– This matrix may be redundant in dimensionality.
– Can work with low-rank approximation.
– If entries are missing (e.g., users’ opinions), can recover if
dimensionality is low.
• Powerful general analytical technique
– Close, principled analog to clustering methods.
Hadi Mohammadzadeh Text Mining Pages 42
43. .
Seminar on Text Mining
Part Three
Part of Speech(POS) Tagging
Hadi Mohammadzadeh Text Mining Pages 43
44. .
Definition of POS
“The process of assigning a part-of-speech or other
lexical class marker to each word in a corpus”
(Jurafsky and Martin)
WORDS
TAGS
the
girl
kissed N
the V
boy P
on DET
the
cheek
Hadi Mohammadzadeh Text Mining Pages 44
45. .
An Example
WORD LEMMA TAG
the the +DET
girl girl +NOUN
kissed kiss +VPAST
the the +DET
boy boy +NOUN
on on +PREP
the the +DET
cheek cheek +NOUN
Hadi Mohammadzadeh Text Mining Pages 45
46. .
Motivation of POS
• Speech synthesis — pronunciation
• Speech recognition — class-based N-grams
• Information retrieval — stemming, selection high-
content words
• Word-sense disambiguation
• Corpus analysis of language & lexicography
Hadi Mohammadzadeh Text Mining Pages 46
47. .
Word Classes
Basic word classes:
Noun, Verb, Adjective, Adverb, Preposition, …
Open vs. Closed classes
Open:
Nouns, Verbs, Adjectives, Adverbs
Closed:
determiners: a, an, the
pronouns: she, he, I
prepositions: on, under, over, near, by, …
Hadi Mohammadzadeh Text Mining Pages 47
48. .
Word Classes: Tag Sets
• Vary in number of tags: a dozen to over 200
• Size of tag sets depends on language, objectives and
purpose
– Some tagging approaches (e.g., constraint grammar based)
make fewer distinctions e.g., conflating prepositions,
conjunctions, particles
– Simple morphology = more ambiguity = fewer tags
Hadi Mohammadzadeh Text Mining Pages 48
50. .
The Problem
• Words often have more than one word class:
this
– This is a nice day = PRP
– This day is nice = DT(determiner)
– You can go this far = RB(adverb)
Hadi Mohammadzadeh Text Mining Pages 50
51. .
Word Class Ambiguity
(in the Brown Corpus)
• Unambiguous (1 tag): 35,340
• Ambiguous (2-7 tags): 4,100
2 tags 3,760
3 tags 264
4 tags 61
5 tags 12
6 tags 2
7 tags 1 (Derose, 1988)
Hadi Mohammadzadeh Text Mining Pages 51
52. .
POS Tagging Methods
• Stochastic Tagger: HMM-based(Using Viterbi Algorithm)
• Rule-Based Tagger: ENGTWOL (ENGlish TWO Level analysis)
• Transformation-Based Tagger (Brill)
Hadi Mohammadzadeh Text Mining Pages 52
53. .
Stochastic Tagging
• Based on probability of certain tag occurring given various
possibilities
• Requires a training corpus
• No probabilities for words not in corpus.
• Simple Method: Choose most frequent tag in training text for
each word!
– Result: 90% accuracy
– Baseline
– Others will do better
– HMM is an example
Hadi Mohammadzadeh Text Mining Pages 53
54. .
HMM Tagger
• Intuition: Pick the most likely tag for this word.
• HMM Taggers choose tag sequence that maximizes this
formula:
– P(word|tag) × P(tag|previous n tags)
• Let T = t1,t2,…,tn
Let W = w1,w2,…,wn
• Find POS tags that generate a sequence of words, i.e., look for
most probable sequence of tags T underlying the observed
words W.
Hadi Mohammadzadeh Text Mining Pages 54
55. .
Rule-Based Tagging
• Basic Idea:
– Assign all possible tags to words
– Remove tags according to set of rules of type:
if word+1 is an adj, adv, or quantifier and the following is
a sentence boundary and word-1 is not a verb like “consider”
then eliminate non-adv else eliminate adv.
– Typically more than 1000 hand-written rules, but may be machine-
learned
Hadi Mohammadzadeh Text Mining Pages 55
56. .
Stage 1 of ENGTWOL Tagging
First Stage:
– Run words through Kimmo-style morphological analyzer to get all
parts of speech.
Example: Pavlov had shown that salivation …
Pavlov PAVLOV N NOM SG PROPER
had HAVE V PAST VFIN SVO
HAVE PCP2 SVO
shown SHOW PCP2 SVOO SVO SV
that ADV
PRON DEM SG
DET CENTRAL DEM SG
CS
salivation N NOM SG
Hadi Mohammadzadeh Text Mining Pages 56
57. .
Stage 2 of ENGTWOL Tagging
• Second Stage:
– Apply constraints.
• Constraints used in negative way.
• Example: Adverbial “that” rule
Given input: “that”
If
(+1 A/ADV/QUANT)
(+2 SENT-LIM)
(NOT -1 SVOC/A)
Then eliminate non-ADV tags
Else eliminate ADV
Hadi Mohammadzadeh Text Mining Pages 57
58. .
Transformation-Based Tagging
(Brill Tagging)
• Combination of Rule-based and stochastic tagging
methodologies
– Like rule-based because rules are used to specify tags in a certain
environment
– Like stochastic approach because machine learning is used—with
tagged corpus as input
• Input:
– tagged corpus
– dictionary (with most frequent tags)
+ Usually constructed from the tagged corpus
Hadi Mohammadzadeh Text Mining Pages 58
59. .
Transformation-Based Tagging
(cont.)
• Basic Idea:
– Set the most probable tag for each word as a start value
– Change tags according to rules of type “if word-1 is a determiner and word is a
verb then change the tag to noun” in a specific order
• Training is done on tagged corpus:
– Write a set of rule templates
– Among the set of rules, find one with highest score
– Continue from 2 until lowest score threshold is passed
– Keep the ordered set of rules
• Rules make errors that are corrected by later rules
Hadi Mohammadzadeh Text Mining Pages 59
60. .
TBL Rule Application
• Tagger labels every word with its most-likely tag
– For example: race has the following probabilities in the
Brown corpus:
• P(NN|race) = .98
• P(VB|race)= .02
• Transformation rules make changes to tags
– “Change NN to VB when previous tag is TO”
… is/VBZ expected/VBN to/TO race/NN tomorrow/NN
becomes
… is/VBZ expected/VBN to/TO race/VB tomorrow/NN
Hadi Mohammadzadeh Text Mining Pages 60
61. .
TBL: Rule Learning
• 2 parts to a rule
– Triggering environment
– Rewrite rule
• The range of triggering environments of templates (from
Manning & Schutze 1999:363)
Schema ti-3 ti-2 ti-1 ti ti+1 ti+2 ti+3
1 *
2 *
3 *
4 *
5 *
6 *
7 *
8 *
9 *
Hadi Mohammadzadeh Text Mining Pages 61
62. .
TBL: The Algorithm
• Step 1: Label every word with most likely tag (from
dictionary)
• Step 2: Check every possible transformation & select one
which most improves tagging
• Step 3: Re-tag corpus applying the rules
• Repeat 2-3 until some criterion is reached, e.g., X% correct
with respect to training corpus
• RESULT: Sequence of transformation rules
Hadi Mohammadzadeh Text Mining Pages 62
63. .
TBL: Rule Learning (cont’d)
• Problem: Could apply transformations ad infinitum!
• Constrain the set of transformations with “templates”:
– Replace tag X with tag Y, provided tag Z or word Z’ appears in some
position
• Rules are learned in ordered sequence
• Rules may interact.
• Rules are compact and can be inspected by humans
Hadi Mohammadzadeh Text Mining Pages 63
64. .
TBL: Problems
• Execution Speed: TBL tagger is slower than HMM
approach
– Solution: compile the rules to a Finite State Transducer (FST)
• Learning Speed: Brill’s implementation over a day (600k
tokens)
Hadi Mohammadzadeh Text Mining Pages 64
65. .
Tagging Unknown Words
• New words added to (newspaper) language 20+ per
month
• Plus many proper names …
• Increases error rates by 1-2%
• Method 1: assume they are nouns
• Method 2: assume the unknown words have a
probability distribution similar to words only occurring
once in the training set.
• Method 3: Use morphological information, e.g., words
ending with –ed tend to be tagged VBN.
Hadi Mohammadzadeh Text Mining Pages 65
66. .
Evaluation
• The result is compared with a manually coded “Gold
Standard”
– Typically accuracy reaches 96-97%
– This may be compared with result for a baseline tagger (one that uses
no context).
• Important: 100% is impossible even for human annotators.
• Factors that affects the performance
– The amount of training data available
– The tag set
– The difference between training corpus and test corpus
– Dictionary
– Unknown words
Hadi Mohammadzadeh Text Mining Pages 66
67. .
Seminar on Text Mining
Part Four
Information Extraction (IE)
Hadi Mohammadzadeh Text Mining Pages 67
68. .
Definition
• An Information Extraction system generally converts
unstructured text into a form that can be loaded into a
database.
Hadi Mohammadzadeh Text Mining Pages 68
69. .
Information Retrieval vs. Information Extraction
• While
information retrieval deals with the problem of
finding relevant document in a collection,
information extraction identifies useful (relevant) text
in a document.
Useful information is defined as a text segment and its
associated attributes.
Hadi Mohammadzadeh Text Mining Pages 69
70. .
An Example
• Query:
– List the news reports of car bombings in Basra and
surrounding areas between June and December 2004.
Answering to this query is difficult with an information-
retrieval system alone.
To answer such queries, we need additional semantic
information to identify text segments that refer to an
attribute
Hadi Mohammadzadeh Text Mining Pages 70
71. .
Elements Extracted from Text
• There are four basic types of elements that can be
extracted from text
– Entities: The basic building blocks that can be found in text documents.
e.g. people, companies, locations, drugs
– Attributes: features of the extracted entities.
e.g. title of a person, age of person, type of an organization
– Facts: The relations that exist between entities.
e.g. relationship between a person and a company
– Events: an activity or occurrence of interest in which entities participate.
e.g. terrorist act, a merger between two companies
Hadi Mohammadzadeh Text Mining Pages 71
72. .
IE Applications
• E-Recruitment
• Extracting sales information
• Intelligence collection for news articles
• Message Understanding (MU)
Hadi Mohammadzadeh Text Mining Pages 72
73. .
Named Entity Recognition (NER)
• NER can be viewed as a classification problem in which
words are assigned to one or more semantic classes.
• The same methods we used to assign POS tags words can be
applied here.
• Unlike POS tags, not every word is associated with a semantic
class.
• Like POS taggers, we can train an entity extractor to find
entities in text using a tagged data set.
• Decision Trees, HMM, and rule-based methods can be applied
to the classification task.
Hadi Mohammadzadeh Text Mining Pages 73
74. .
Problems of NER
• Unknown words: it is difficult to categorize
• Finding the exact boundary of an entity
• Polysemy and synonymy- methods used for WSD are
applicable here.
Hadi Mohammadzadeh Text Mining Pages 74
75. .
Architecture of an IE System
• Extraction of tokens and tags
• Semantic analysis : A partial parser is usually sufficient
• Extractor : we look at domain-specific entities, weather DB
• Merging multiple references to the same entity: finding a
single canonical form
• Template Generation: A template contains a list of slots (fields)
Tokenization Tokens Sentence POS
Text
and tagging POS tags Analysis groups
Combined Assigned
Template Extractor
Merging
Generation Entities Entities
Hadi Mohammadzadeh Text Mining Pages 75
76. .
IE tools
• Fastus
– Finite State Automation Text Understanding System
• Rapier
– Robust Automated Production of Information Extraction Rules
Hadi Mohammadzadeh Text Mining Pages 76
77. .
Fastus
• It is based on a series of finite-state machines to solve specific
problems for each stage of the IE pipeline.
• A Finite-State Machine (FSM) generate a regular language that
consists of regular expression to describe the language.
• A regular expression (regex) actually represents a string pattern.
• Regexs are used in IE to identify text segments that match some
predefined pattern.
• An FSM applies a pattern to a window of text and transition from
one state to another until a pattern matches or fails to match.
Hadi Mohammadzadeh Text Mining Pages 77
78. .
Stages of Fastus
• In the first stage, composite words and proper nouns
are extracted. e.g. “set up” ,”carry out”
Text Stage 1 Complex Stage 2 Basic Stage 3
Words Phrases
Merged Stage 5 Event Stage 4 Complex
Structures Structures Phrases
Hadi Mohammadzadeh Text Mining Pages 78
79. .
Seminar on Text Mining
Part Five
Clustering Documents
Hadi Mohammadzadeh Text Mining Pages 79
80. .
What is clustering?
• Clustering: the process of grouping a set of objects into classes
of similar objects
– Documents within a cluster should be similar.
– Documents from different clusters should be dissimilar.
• The commonest form of unsupervised learning
– Unsupervised learning = learning from raw data, as opposed to
supervised data where a classification of examples is given
– A common and important task that finds many applications in IR and
other places
Hadi Mohammadzadeh Text Mining Pages 80
81. .
Applications of clustering in IR
• Whole corpus analysis/navigation(Scatter-gather)
– Better user interface: search without typing
• For improving recall in search applications
– Better search results
• For better navigation of search results
– Effective “user recall” will be higher
• For speeding up vector space retrieval
– Cluster-based retrieval gives faster search
Hadi Mohammadzadeh Text Mining Pages 81
82. .
Google News: automatic clustering gives an effective
news presentation metaphor
Hadi Mohammadzadeh Text Mining Pages 82
84. .
2. For improving search recall
• Cluster hypothesis - Documents in the same cluster behave
similarly with respect to relevance to information needs
• Therefore, to improve search recall:
– Cluster docs in corpus a priori
– When a query matches a doc D, also return other docs in the
cluster containing D
• Hope if we do this: The query “car” will also return docs
containing automobile
– Because clustering grouped together docs containing car with
those containing automobile.
Hadi Mohammadzadeh Text Mining Pages 84
85. .
3. For better navigation of search results
• For grouping search results thematically
Hadi Mohammadzadeh Text Mining Pages 85
86. .
What makes docs “related”?
• Ideal: semantic similarity.
• Practical: statistical similarity
– We will use cosine similarity.
– Docs as vectors.
– For many algorithms, easier to think in terms of a
distance (rather than similarity) between docs.
– We will use Euclidean distance.
Hadi Mohammadzadeh Text Mining Pages 86
87. .
Clustering Algorithms
• Flat algorithms
– Usually start with a random (partial) partitioning
– Refine it iteratively
• K means clustering
• (Model based clustering)
• Hierarchical algorithms
– Bottom-up, agglomerative
– (Top-down, divisive)
Hadi Mohammadzadeh Text Mining Pages 87
88. .
Hard vs. soft clustering
• Hard clustering: Each document belongs to exactly one cluster
– More common and easier to do
• Soft clustering: A document can belong to more than one
cluster.
– Makes more sense for applications like creating browsable
hierarchies
– You may want to put a pair of sneakers in two clusters: (i) sports
apparel and (ii) shoes
– You can only do that with a soft clustering approach.
Hadi Mohammadzadeh Text Mining Pages 88
89. .
Partitioning Algorithms
• Partitioning method: Construct a partition of n documents into
a set of K clusters
• Given: a set of documents and the number K
• Find: a partition of K clusters that optimizes the chosen
partitioning criterion
– Globally optimal: exhaustively enumerate all partitions
– Effective heuristic methods: K-means and K-medoids algorithms
Hadi Mohammadzadeh Text Mining Pages 89
90. .
K-Means
• Assumes documents are real-valued vectors.
• Clusters based on centroids (aka the center of gravity or
mean) of points in a cluster, c:
1
μ(c) = ∑x
| c | x∈c
• Reassignment of instances to clusters is based on distance
to the current cluster centroids.
Hadi Mohammadzadeh Text Mining Pages 90
91. .
K-Means Algorithm
Select K random docs {s1, s2,… sK} as seeds.
Until clustering converges or other stopping criterion:
For each doc di:
Assign di to the cluster cj such that dist(xi, sj) is minimal.
(Update the seeds to the centroid of each cluster)
For each cluster cj
sj = µ(cj)
Hadi Mohammadzadeh Text Mining Pages 91
92. .
Termination conditions
• Several possibilities, e.g.,
– A fixed number of iterations.
– Doc partition unchanged.
– Centroid positions don’t change.
Hadi Mohammadzadeh Text Mining Pages 92
93. .
Seed Choice
• Results can vary based on random seed
selection. Example showing
• Some seeds can result in poor sensitivity to seeds
convergence rate, or convergence to
sub-optimal clusterings.
– Select good seeds using a heuristic (e.g.,
doc least similar to any existing mean)
– Try out multiple starting points In the above, if you start
– Initialize with the results of another with B and E as centroids
method. you converge to {A,B,C}
and {D,E,F}
If you start with D and F
you converge to
{A,B,D,E} {C,F}
Hadi Mohammadzadeh Text Mining Pages 93
94. .
How Many Clusters?
• Number of clusters K is given
– Partition n docs into predetermined number of clusters
• Finding the “right” number of clusters is part of the
problem
– Given docs, partition into an “appropriate” number of subsets.
– E.g., for query results - ideal value of K not known up front -
though UI may impose limits.
• Can usually take an algorithm for one flavor and convert to
the other.
Hadi Mohammadzadeh Text Mining Pages 94
95. .
K not specified in advance
• Given a clustering, define the Benefit for a doc to be
the cosine similarity to its centroid
• Define the Total Benefit to be the sum of the
individual doc Benefits.
Hadi Mohammadzadeh Text Mining Pages 95
96. .
Penalize lots of clusters
• For each cluster, we have a Cost C.
• Thus for a clustering with K clusters, the Total Cost is KC.
• Define the Value of a clustering to be =
Total Benefit - Total Cost.
• Find the clustering of highest value, over all choices of K.
– Total benefit increases with increasing K. But can stop when it
doesn’t increase by “much”. The Cost term enforces this.
Hadi Mohammadzadeh Text Mining Pages 96
97. .
Hierarchical Clustering
• Build a tree-based hierarchical taxonomy (dendrogram)
from a set of documents.
animal
vertebrate invertebrate
fish reptile amphib. mammal worm insect crustacean
• One approach: recursive application of a partitional
clustering algorithm.
Hadi Mohammadzadeh Text Mining Pages 97
98. .
Dendogram: Hierarchical Clustering
• Clustering obtained by cutting
the dendrogram at a desired
level: each connected
component forms a cluster.
Hadi Mohammadzadeh Text Mining Pages 98
99. .
Hierarchical Agglomerative Clustering (HAC)
• Starts with each doc in a separate cluster
– then repeatedly joins the closest pair of clusters, until
there is only one cluster.
• The history of merging forms a binary tree or
hierarchy.
Hadi Mohammadzadeh Text Mining Pages 99
100. .
Closest pair of clusters
Many variants to defining closest pair of clusters
• Single-link
– Similarity of the most cosine-similar (single-link)
• Complete-link
– Similarity of the “furthest” points, the least cosine-similar
• Centroid
– Clusters whose centroids (centers of gravity) are the most cosine-
similar
• Average-link
– Average cosine between pairs of elements
Hadi Mohammadzadeh Text Mining Pages 100
101. .
Closest pair of clusters
Hadi Mohammadzadeh Text Mining Pages 101
102. .
Single Link Agglomerative Clustering
• Use maximum similarity of pairs:
sim(ci ,c j ) = max sim( x, y )
x∈ci , y∈c j
• Can result in “straggly” (long and thin) clusters
due to chaining effect.
• After merging ci and cj, the similarity of the
resulting cluster to another cluster, ck, is:
sim((ci ∪ c j ), ck ) = max(sim(ci , ck ), sim(c j , ck ))
Hadi Mohammadzadeh Text Mining Pages 102
103. .
Single Link Example
Hadi Mohammadzadeh Text Mining Pages 103
104. .
Complete Link Agglomerative Clustering
• Use minimum similarity of pairs:
sim(ci ,c j ) = min sim( x, y )
x∈ i , y∈ j
c c
• Makes “tighter,” spherical clusters that are typically
preferable.
• After merging ci and cj, the similarity of the resulting
cluster to another cluster, ck, is:
sim((ci ∪c j ), ck ) = min( sim(ci , ck ), sim(c j , ck ))
Ci Cj Ck
Hadi Mohammadzadeh Text Mining Pages 104
105. .
Complete Link Example
Hadi Mohammadzadeh Text Mining Pages 105
106. .
Group Average Agglomerative Clustering
• Similarity of two clusters = average similarity of all
pairs within merged cluster.
1
sim(ci , c j ) =
∑c ) y∈(c ∑)sim( x, y)
ci ∪ c j ( ci ∪ c j − 1) x∈( ci ∪ j
∪c j : y ≠ x
i
• Compromise between single and complete link.
• Two options:
– Averaged across all ordered pairs in the merged cluster
– Averaged over all pairs between the two original clusters
• No clear difference in efficacy
Hadi Mohammadzadeh Text Mining Pages 106
107. .
Computing Group Average Similarity
• Always maintain sum of vectors in each cluster.
s (c j ) = ∑x
x∈c j
• Compute similarity of clusters in constant time:
( s (ci ) + s (c j )) • ( s (ci ) + s (c j )) − (| ci | + | c j |)
sim(ci , c j ) =
(| ci | + | c j |)(| ci | + | c j | −1)
Hadi Mohammadzadeh Text Mining Pages 107
108. .
Seminar on Text Mining
Part Six
Text Categorization(TC)
Hadi Mohammadzadeh Text Mining Pages 108
109. .
Approaches to TC
There are two main approaches to TC:
• Knowledge Engineering
– The main drawback of the KEA is what might be called the
Knowledge acquisition bottleneck. The huge amount of highly
skilled labor and expert knowledge required to create and maintain
the knowledge-encoding rules
• Machine Learning
– Requires only a set of manually classified training instances that
muchless costly to produce.
Hadi Mohammadzadeh Text Mining Pages 109
110. .
Applocations of TC
Three common TC appications are:
• Text Indexing
• Document sorting and text filtering
• Web page categorization
Hadi Mohammadzadeh Text Mining Pages 110
111. .
Text Indexing(TI)
• The task of assigning keywords from a controlled
vocabulary to text documents is called TI. If the keywords
are viewed as categories, then TI is an instance of general
TC problem.
Hadi Mohammadzadeh Text Mining Pages 111
112. .
Document sorting and text filtering
• Examples:
– In a newspaper, the classified ads may need to be categorized
into “Personal”, “Car Sales”, “Real State”
– Emails can be sorted intocategories such as “Complaints”,
“Deals”, “Job applications”
• Text Filtering activity can be seen as document sorting
with only two bins- the “relevant” and “irrelevant” docs.
Hadi Mohammadzadeh Text Mining Pages 112
113. .
Web page categorization
• A common use of TC is the automatic classification of
Web pages under the hierarchical calalogues posted by
popular Internet portals such as Yahoo.
• Whenever the number of docs in a category exceeds k, it
should be spilt into two or more subcategories.
• The Web docs contain links, which may be important
source of information for classifier because linked docs
often share semantics.
Hadi Mohammadzadeh Text Mining Pages 113
114. .
Definition of the Problem
• The General text categorization task can be formally
defined as the task of approximating an unknown category
assignment function
F : D × C → { 0,1}
• Where D is the set of all possible docs and C is the set of
predefined categories.
• The value of F ( d , c ) is 1 if the document d belongs to
the category c and 0 otherwise.
• The approximation function M : D ×C →{0,1} is called a
classifer, and the task is to build a classifer that produces
results as “close” as possible to the true category
assignment function F.
Hadi Mohammadzadeh Text Mining Pages 114
115. .
Types of Categorization
• Single-Label versus Multilabel Categorization
– In multilabel categorization the categories overlap, and a document
may belongs to any number of categories.
• Document-Pivoted versus Category-Pivoted Categorization
– The difference is significant only in the case in which not all docs or
not all categories are immediately available.
• Hard Versus Soft Categorization
– Fully automated , and semiautomated
Hadi Mohammadzadeh Text Mining Pages 115
116. .
Machine Learning Approache to TC
• Decition Tree Classifiers
• Naïve Bayes(Probablistic classifer)
• K-Nearest Neighbor classifiaction
• Rocchio Methods
• Decition Rule classifer
• Neural Networks
• Support Vector Machine
Hadi Mohammadzadeh Text Mining Pages 116
117. .
References
• Books
– Introduction to Information Retrieval-2008
– Managing Gigabytes-1999
– The Text Mining Handbook
– Text Mining Application Programming
– Web Data Mining
Hadi Mohammadzadeh Text Mining Pages 117
118. .
References
• Power Points
– Introduction to Information Retrieval-2008
– Text Mining Application Programming
– Web Data Mining
– Word classes and part of speech tagging
• Rada Mihalcea Note: Some of the material in this slide set was adapted from Chris Brew’s (OSU) slides on part of speech tagging
Hadi Mohammadzadeh Text Mining Pages 118