Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Web Documents Classification Methods
1. Text Classification in Deep Web
Mining
Presented by:
Zakaria Suliman Zubi
Associate Professor
Computer Science Department
Faculty of Science
Sirte University
Sirte, Libya
1
2. Contents
• Abstract.
• Introduction.
• Deep Content Mining.
• Modeling the Web Documents.
• Web Documents Classification Methods.
• Preprocessing phases.
• Classifier Naive Bayesian (CNB).
• Classifier K-Nearest Neighbor (CK-NN).
• Implementation.
• Results and Discussion.
• Conclusion.
• References.
2
3. Abstract
• The World Wide Web is a rich source of knowledge that can be
useful to many applications.
– Source?
• Billions of web pages and billions of visitors and
contributors.
– What knowledge?
• e.g., the hyperlink structure and a variety of languages.
– Purpose?
• To improve users’ effectiveness in searching for
information on the web.
• Decision-making support or business management.
3
4. Continue
• Web’s Characteristics:
– Large size
– Unstructured
– Different data types: text, image, hyperlinks and user usage
information
– Dynamic content
– Time dimension
– Multilingual (i.e. Latin, non Latin languages)
• The Data Mining (DM) is a significant subfield of this area.
• Using a Classification Methods such as K-Nearest Neighbor
(CK-NN) and Classifier Naïve Bayes (CNB).
• The various activities and efforts in this area are referred to as
Web Mining. 4
5. Contents
• Abstract.
• Introduction.
• Deep Content Mining.
• Modeling the Web Documents.
• Web Documents Classification Methods.
• Preprocessing phases.
• Classifier Naive Bayesian (CNB).
• Classifier K-Nearest Neighbor (CK-NN).
• Implementation.
• Results and Discussion.
• Conclusion.
• References.
5
6. Introduction
The Internet is probably the biggest world’s database were the data is available using
easily accessible techniques.
Data is held in various forms: text, multimedia, database.
Web pages keep standard of html (or another ML family member) which makes it kind
of structural form, but not sufficient to easily use it in data mining.
Web mining – the application of data mining techniques is to extract knowledge from
Web content, structure, and usage.
Deep web also defined as hidden web, invisible web or invisible Internet refers to the
lower novae of the global network.
The easiest way is to put Deep Web as a part of data mining, where web resources are
explored. It is commonly divided into three: 6
7. Introduction is looking
Web usage mining
for useful patterns in logs and
documents containing history
of user’s activity
Web mining – the application of data mining techniques is to
extract knowledge from Web content, structure, and usage.
W e b M in in g
Web content mining is the closest
one to the “classic” data
W e b C o n t e n t M in in g W e bmining”,Masi nWCM mostly W e b
S tru c tu re in g U s a g e M in in g
operates on text and it is
generally common way to put
Text informatione rin kInternet as text,
H y p lin s W e b S e rv e r L o g s
Im a g e D o c u m e n t S tru c tu re d A p p lic a t io n L e v e l L o g s
Web structured mining goal is to
A u d io A p p lic a t io n S e rv e r L o g s
use nature of the Internet
V e d io
connection structure as it is a
bunch of documents connected
S tru c tu re d R e c o rd s with links.
7
8. Contents
• Abstract.
• Introduction.
• Deep Content Mining.
• Modeling the Web Documents.
• Web Documents Classification Methods.
• Preprocessing phases.
• Classifier Naive Bayesian (CNB).
• Classifier K-Nearest Neighbor (CK-NN).
• Implementation.
• Results and Discussion.
• Conclusion.
• References.
8
9. Deep W Content M
eb ining
• Deep Web Content Mining is the process of extracting
useful information from the contents of Web
documents. It may consist of text, images, audio,
video, or structured records such as lists and tables.”
• “Deep Web Content mining refers to the overall
process of discovering potentially useful and
previously unknown information or knowledge from
the Web data.”
9
10. Contents
• Abstract.
• Introduction.
• Deep Content Mining.
• Modeling the Web Documents.
• Web Documents Classification Methods.
• Preprocessing phases.
• Classifier Naive Bayesian (CNB).
• Classifier K-Nearest Neighbor (CK-NN).
• Implementation.
• Results and Discussion.
• Conclusion.
• References.
10
11. Modeling the W Documents
eb
• We represent the Web data in the binary format where all
of the keywords derived from the schema.
• If a keyword is in a frequent schema, a 1 is stored in related
cell and otherwise a 0 is stored in it.
• The attributes of frequent schemas are stated as follow:
– QI1: Data Mining Extract Hidden Data from Database = {Data,
Mining, Hidden, Database}, stop word {from};
– QI2: Web Mining discovers Hidden information on the Web ={Web,
Mining, Hidden}, stop words {on, the};
– QI3: Web content Mining is a branch in Web Mining= {Web,
Mining} , stop words {is, a, in};
– QI4: Knowledge discovery in Database= {Database}, stop word {in}.
11
12. Cont…
Data Mining Extract Database Hidden Web Other Stop
key words
words
QI1 1 1 1 1 1 0 0 1
QI2 0 1 0 0 1 2 2 2
QI3 0 2 0 0 0 2 2 3
QI4 0 0 0 1 0 0 3 1
Tab1. Represent web data in binary scale
12
13. Contents
• Abstract.
• Introduction.
• Deep Content Mining.
• Modeling the Web Documents.
• Web Documents Classification Methods.
• Preprocessing phases.
• Classifier Naive Bayesian (CNB).
• Classifier K-Nearest Neighbor (CK-NN).
• Implementation.
• Results and Discussion.
• Conclusion.
• References.
13
14. W Documents Classification
eb
Methods
• Web documents consist of text, images, videos and audios.
• Text data in web documents are defined to be the most
tremendously.
• The automatic text classification is the process of assigning a
text document to one or more predefined categories based on
its content.
• Automatic text web document classification requires three main
consecutive phases in constructing a classification system listed as
follows:
1. Collect the text documents in corpora and tag them.
2. Select a set of features to represent the defined classes.
3. The classification algorithms must be trained and tested using the
collected corpora in the first stage. 14
15. Cont…
The text classification problem is composed of several sub problems such as:
The document indexing: Document indexing is related to with the way of
extracting the document's keywords, two main approaches to achieve the
document indexing, the first approach considers index terms as bags of words
and the second approach regards the index terms as phrases.
The weighting assignment: Weight assignment techniques associate a real number
assignment
that ranges from 0 to 1 for all documents’ terms weights will be required to
classify new arrived documents.
Learning based text classification algorithm :. A text classification algorithm used
is inductive learning algorithm based on probabilistic theory and different
models were emphasized such as Naive Bayesian models (Which always shows
good result and widely used in text classification). Another text classification
methods have been emerged to categorize documents such as K-Nearest
Neighbor KNN which compute the distances between the document index
terms and the known terms of each category. The accuracy will be tested by K-
fold cross-validation method.
15
16. Contents
• Abstract.
• Introduction.
• Deep Content Mining.
• Modeling the Web Documents.
• Web Documents Classification Methods.
• Preprocessing phases.
• Classifier Naive Bayesian (CNB).
• Classifier K-Nearest Neighbor (CK-NN).
• Implementation.
• Results and Discussion.
• Conclusion.
• References.
16
17. Preprocessing phases
The data used in this work
are collected from many
news web sites. The data set
consist of 1562 Arabic
documents of different
lengths that belongs to 6
categories, the categories
Table 2. Number of Documents per Category
are ( Economic , Cultural ,
First phase: is the preprocessing Political , Social , Sports ,
step where documents are prepared General ), Table 2 represent
to make it adequate for further use, the number of documents
stop words removal and rearrange of for each category.
the document contents are some
steps in this phase.
17
18. Continue
Second phase is the weighting assignment phase, it is defined as
the assignment of real number that relies between 0 and 1 to each
keyword and this number indicates the imperativeness of the
keyword inside the document.
Many methods have been developed and the most widely used model is the tf-idf
weighting factor. This weight of each keyword is computed by multiplying the
term factor (tf) with the inverse document factor (idf) where:
Fik = Occurrences of term tK in document Di.
tfik = fik/max (fil) normalized term frequency occurred in document.
dfk = documents which contain tk .
idfk= log (d/dfk) where d is the total number of documents and dfk is number of
document s that contains term tk.
wik = tfik * idfk for term weight, the computed w ik is a real number ɛ[0,1].
18
19. Contents
• Abstract.
• Introduction.
• Deep Content Mining.
• Modeling the Web Documents.
• Web Documents Classification Methods.
• Preprocessing phases.
• Classifier Naive Bayesian (CNB).
• Classifier K-Nearest Neighbor (CK-NN).
• Implementation.
• Results and Discussion.
• Conclusion.
• References.
19
20. Classifier Naive B of class (CNB ayesian
P(class| document) : It’s the probability )
given a document, or the probability that a given
document D belongs to probability-driven algorithm
Bayesian learning is a a given class C, and that is our
target. on Bayes probability theorem it is highly
based
P(document ) : The probability of a document, we can
recommended in text classification
notice that p(document ) is a Constance divider to
every calculation, so we can ignore it.
A documents can be modeled as sets of words thus the
P( class ): Theclass ) can be written in two way Where:
P(document | probability of a class (or category), we
can compute it from the number of documents in the
category dividedProbability that number inoften outperform more
p(wordi |C )The Naive Bayesian can all of a given
: by documents the i-th word
categories. occurs in a document from class C, and this Classifier task
document sophisticated classification methods.
can be calculated as follow: incoming objects to their appropriate
is to categorize
P(document | Class. : It’s the probability of document
class )
in a given class.
20
21. Contents
• Abstract.
• Introduction.
• Deep Content Mining.
• Modeling the Web Documents.
• Web Documents Classification Methods.
• Preprocessing phases.
• Classifier Naive Bayesian (CNB).
• Classifier K-Nearest Neighbor (CK-NN).
• Implementation.
• Results and Discussion.
• Conclusion.
• References.
21
22. Classifier K-Nearest Neighbor
(CK-NN)
• K-Nearest Neighbor is a widely used text classifier especially in text mining
because of its simplicity and efficiency.
• It’s a a supervised learning algorithm where the result of a new occurrence
query is classified based on the K-nearest neighbor category measurement.
• Its training-phase consists of nothing more than storing all training examples as
classifier.
• It works based on minimum distance from the query instance to the training
samples to determine the K nearest neighbors.
• After collecting K nearest neighbors, we take simple majority of these K-
nearest neighbors to be the prediction of the query-instance.
• CK-NN algorithm consists of several multivariate attributes names X i that will
be used to classify the object Y. We will deal only with quantitative Xi and
binary (nominal) Y. 22
23. Continue
• Example: Suppose that the K factor is set to be equal to 8 (there are 8 nearest
neighbors) as a parameter of this algorithm. Then the distance between the
query-instance and all the training samples is computed, so there are only
quantitative Xi.
• All training samples are included as nearest neighbors if the distance of this
training sample to the query is less than or equal to the Kth smallest distance
in this case the distances are sorted of all training samples to the query and
determine the Kth as a minimum distance.
• The unknown sample is assigned the most common class among its k nearest
neighbors. Then we find the distances between the query and all training
samples.
• The K training samples are the closest K nearest neighbors for the unknown
sample. Closeness is defined in terms of Euclidean distance, where the
Euclidean between two points, X = (x1, x2,...,xn ) and Y = (y1, y2,...,yn) is:
23
24. Contents
• Abstract.
• Introduction.
• Deep Content Mining.
• Modeling the Web Documents.
• Web Documents Classification Methods.
• Preprocessing phases.
• Classifier Naive Bayesian (CNB).
• Classifier K-Nearest Neighbor (CK-NN).
• Implementation.
• Results and Discussion.
• Conclusion.
• References.
24
25. Implementation
The implementation of the proposed Deep Web Text Classifier
(DWTC) demonstrates the importance of classifying the Latin text on
the web documents needs for information retrieval to illustrate both
keywords extraction and text classifiers used in the Algorithms
implementation:
Keywords extraction: Text web documents are scanned to find the
keywords each one is normalized. Normalization process consists of
removing stop words, removing punctuation mark and non-letters in
Latin letters shown in table 3.
Some stop words:
Tab 3: The Example of stop words and non -letters.
25
26. Continue
Terms weighting: There are two criterions:
First criterion the more number of times a term occurs in documents which belongs to
some category, the more it is relative to that category.
Second criterion the more the term appears in different documents representing different
categories; the less the term is useful for discriminating between documents as
belonging to different categories.
In this implementation we used the commonly used approach which is Normalized
tf×idf to overcome the problem of variant documents lengths.
Algorithms implementation :this implementation were mainly developed for testing the
effectiveness of CK-NN and CNB algorithms when it is applied to the Latin text.
We supplies a set of labeled text documents supplied to the system, the labels are used
to indicate the class or classes that the text document belongs to. All documents
belonging to the data set should be labeled in order to learn the system and then test it.
The system distinguishes the labels of the training documents but not those of the test
set.
The system will compare between the two classifiers and report the most higher
accuracy classifier for the current labeled text documents.
The system will compare these results and select the best average accuracy result rates for
each classifier and uses the greater average accuracy result rates in the system. The system
will choose the higher rate to start the retrieving process.
26
27. Contents
• Abstract.
• Introduction.
• Deep Content Mining.
• Modeling the Web Documents.
• Web Documents Classification Methods.
• Preprocessing phases.
• Classifier Naive Bayesian (CNB).
• Classifier K-Nearest Neighbor (CK-NN).
• Implementation.
• Results and Discussion.
• Conclusion.
• References.
27
28. Results and Discussion
• The data used in this work are collected from many web sites. The
data set consist of 3533 Latin and non- Latin text documents of
different lengths that belongs to 6 categories, the categories are
( Economic, Cultural, Political, Social, Sports and General). Table 2.
• To test the system, the documents in the data set were preprocessed
to find main categories.
• Various splitting percentages were used to see how the number of
training documents impacts the classification effectiveness.
• Different k values starting from 1 and up to 20 in order were used to
find the best results for CK-NN. Effectiveness started to decline at
k>15.
• A comparison between the two algorithms and make labeling to the
sample data, the classifier has been indicated in the system also
28
29. Continue
• The k-fold cross-validation method is used to test the accuracy of the
system.
• Our result is roughly near from the other developer's results.
• The results of the conducted experiments are included on the last columns in
Table 4 and table 5. Our result is roughly near from each other results.
• It is induced from the below results that the Classifier K-Nearest Neighbors
(CK-NN) with an average (93.08%) has better than Classifier Naïve
Bayesian that had (90.03%) in Latin text.
• It means that the DWTC system in this case will use the CK-NN for Latin
text classification and extraction instead of CNB.
• In case of non –Latin text the DWTC system will use CNB text
classification which has the average of 91.05% in Non- Latin classification
and extraction instead of CK-NN with the average of 88.06% indicated in
table5.
29
30. Contents
• Abstract.
• Introduction.
• Deep Content Mining.
• Modeling the Web Documents.
• Web Documents Classification Methods.
• Preprocessing phases.
• Classifier Naive Bayesian (CNB).
• Classifier K-Nearest Neighbor (CK-NN).
• Implementation.
• Results and Discussion.
• Conclusion.
• References.
30
31. Conclusion
• An evaluation to the use of Classifier K-Nearest Neighbor (CK-NN) and
Classifier Naïve Bayes (CNB) to the classification Arabic text was considered.
• A development of a special corpus which consists of 3533 documents that
belong to 6 categories.
• An extracted feature set of keywords and terms weighting in order to improve
the performance were indicated as well.
• As a result we applied two algorithms for classifying to the text documents with
a satisfactory number of patterns for each category.
• The accuracy was measured by the use of k-fold cross-validation method to test
the accuracy of the system.
• We proposed an empirical Latin and non-Latin text classifier system called the
Deep Web Text Classifier (DWTC).
• The system compares the results between both classifiers used (CK-NN, CNB)
and select the best average accuracy result rates in case of Latin or non-Latin
31
text.
32. Contents
• Abstract.
• Introduction.
• Deep Content Mining.
• Modeling the Web Documents.
• Web Documents Classification Methods.
• Preprocessing phases.
• Classifier Naive Bayesian (CNB).
• Classifier K-Nearest Neighbor (CK-NN).
• Implementation.
• Results and Discussion.
• Conclusion.
• References.
32
33. References
[2] Alexandrov M., Gelbukh A. and Lozovo. (2001). Chi-square Classifier for
Document Categorization. 2nd International Conference on Intelligent Text
Processing and Computational Linguistics, Mexico City.
[37] Zakaria Suliman Zubi. 2010. Text mining documents in electronic data
interchange environment. In Proceedings of the 11th WSEAS international
conference on nural networks and 11th WSEAS international conference on
evolutionary computing and 11th WSEAS international conference on Fuzzy
systems (NN'10/EC'10/FS'10), Viorel Munteanu, Razvan Raducanu, Gheorghe
Dutica, Anca Croitoru, Valentina Emilia Balas, and Alina Gavrilut (Eds.).
World Scientific and Engineering Academy and Society (WSEAS), Stevens
Point, Wisconsin, USA, 76-88.
[38] Zakaria Suliman Zubi. 2009. Using some web content mining techniques for
Arabic text classification. In Proceedings of the 8th WSEAS international
conference on Data networks, communications, computers (DNCOCO'09), Manoj
Jha, Charles Long, Nikos Mastorakis, and Cornelia Aida Bulucea (Eds.). World
Scientific and Engineering Academy and Society (WSEAS), Stevens Point,
Wisconsin, USA, 73-84. 33