1. Prepared by: Mahmoud Rafeek Alfarra
Seminar Program
Document
Clustering and Classification
2. Out Line
Classification and its techniques
Clustering its techniques
Document clustering !!
Comparison
3. Classification: Definition
Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class.
Find a model for class attribute as a function of
the values of other attributes.
Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
4. Classification: Definition
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Learning
algorithm
Training Set
5. Classification Techniques
Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines
6. Artificial Neural Networks (ANN)
X1 X2 X3 Y
1 0 0 0
1 0 1 1
1 1 0 1
1 1 1 1
0 0 1 0
0 1 0 0
0 1 1 1
0 0 0 0
X1
X2
X3
Y
Black box
Output
Input
Output Y is 1 if at least two of the three inputs are equal
to 1.
8. Artificial Neural Networks (ANN)
Model is an assembly of
inter-connected nodes
and weighted links
Output node sums up
each of its input value
according to the weights
of its links
Compare output node
against some threshold t
X1
X2
X3
Y
Black box
w1
t
Output
node
Input
nodes
w2
w3
)( tXwIY
i
ii
Perceptron Model
)( tXwsignY
i
ii
or
9. Clustering Definition
Clustering is a division of data into groups of
similar objects.
Each group is called cluster and consists of
objects that are similar between themselves and
dissimilar to objects of other groups .
11. Document clustering
Document clustering is an automatic grouping
of text documents into clusters so that
documents within a cluster have high similarity
in comparison to one another, but are dissimilar
to documents in other clusters.
12. The challenge
The problem of Document clustering is how to organize
a large set of documents of various topics and reach
satisfy organization. It can display as follow:
Given: A huge set of documents of various topics (shared,
related, totally different).
Required: Group the documents into a number of clusters
such that the intra-cluster similarity is maximized, and the
inter-cluster similarity is minimized.
14. Clustering’s Process
Knowledge
Document Data Model
Representation
•Document Cleaning
•Feature Selection or Extraction.
Documents samples
Clustering Algorithm
• Similarity Measure
• Criterion of Clustering
Cluster Validation
• External Indices
• Internal Indices
Results Interpretation
Clusters
1 2
3
4
15. Clustering Techniques
Clustering methods in general can be viewed
from different perspectives, the most widely
applied to text domain are:
Hierarchical Clustering
Partitioning Clustering
Neural Network based Clustering
16. Clustering Techniques
Suffix Tree Clustering algorithm
2015-09-26 16
D1: cat ate cheese
D2: mouse ate cheese too
D3: cat ate mouse too
and