SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Text Classification in Deep Web
               Mining
                   Presented by:

   Zakaria Suliman Zubi
    Associate Professor
Computer Science Department
     Faculty of Science
      Sirte University
        Sirte, Libya




                                    1
Contents
•   Abstract.
•   Introduction.
•   Deep Content Mining.
•   Modeling the Web Documents.
•   Web Documents Classification Methods.
•   Preprocessing phases.
•   Classifier Naive Bayesian (CNB).
•   Classifier K-Nearest Neighbor (CK-NN).
•   Implementation.
•   Results and Discussion.
•   Conclusion.
•   References.



                                             2
Abstract

• The World Wide Web is a rich source of knowledge that can be
  useful to many applications.

   – Source?
      • Billions of web pages and billions of visitors and
        contributors.

   – What knowledge?
      • e.g., the hyperlink structure and a variety of languages.

   – Purpose?
      • To improve users’ effectiveness in searching for
        information on the web.
      • Decision-making support or business management.
                                                    3
Continue
• Web’s Characteristics:
  – Large size
  – Unstructured
  – Different data types: text, image, hyperlinks and user usage
    information
  – Dynamic content
  – Time dimension
  – Multilingual (i.e. Latin, non Latin languages)

• The Data Mining (DM) is a significant subfield of this area.

• Using a Classification Methods such as K-Nearest Neighbor
  (CK-NN) and Classifier Naïve Bayes (CNB).

• The various activities and efforts in this area are referred to as
  Web Mining.                                           4
Contents
• Abstract.
• Introduction.

•   Deep Content Mining.
•   Modeling the Web Documents.
•   Web Documents Classification Methods.
•   Preprocessing phases.
•   Classifier Naive Bayesian (CNB).
•   Classifier K-Nearest Neighbor (CK-NN).
•   Implementation.
•   Results and Discussion.
•   Conclusion.
•   References.

                                             5
Introduction

 The Internet is probably the biggest world’s database were the data is available using
easily accessible techniques.

Data is held in various forms: text, multimedia, database.

 Web pages keep standard of html (or another ML family member) which makes it kind
of structural form, but not sufficient to easily use it in data mining.

Web mining – the application of data mining techniques is to extract knowledge from
Web content, structure, and usage.

 Deep web also defined as hidden web, invisible web or invisible Internet refers to the
lower novae of the global network.



The easiest way is to put Deep Web as a part of data mining, where web resources are
explored. It is commonly divided into three:                              6
Introduction is looking
                             Web usage mining
                                        for useful patterns in logs and
                                        documents containing history
                                        of user’s activity
  Web mining – the application of data mining techniques is to
  extract knowledge from Web content, structure, and usage.

                                                      W e b M in in g

                                           Web content mining is the closest
                                                  one to the “classic” data
W e b C o n t e n t M in in g                W e bmining”,Masi nWCM mostly W e b
                                                   S tru c tu re  in g                  U s a g e M in in g
                                                  operates on text and it is
                                              generally common way to put
                    Text                     informatione rin kInternet as text,
                                                            H y p lin s                   W e b S e rv e r L o g s

                  Im a g e                          D o c u m e n t S tru c tu re d   A p p lic a t io n L e v e l L o g s
                                              Web structured mining goal is to
                   A u d io                                                  A p p lic a t io n         S e rv e r L o g s
                                                use nature of the Internet
                   V e d io
                                                connection structure as it is a
                                                bunch of documents connected
         S tru c tu re d R e c o rd s           with links.
                                                                                 7
Contents
•   Abstract.
•   Introduction.
•   Deep Content Mining.


•   Modeling the Web Documents.
•   Web Documents Classification Methods.
•   Preprocessing phases.
•   Classifier Naive Bayesian (CNB).
•   Classifier K-Nearest Neighbor (CK-NN).
•   Implementation.
•   Results and Discussion.
•   Conclusion.
•   References.


                                             8
Deep W Content M
      eb        ining

• Deep Web Content Mining is the process of extracting
  useful information from the contents of Web
  documents. It may consist of text, images, audio,
  video, or structured records such as lists and tables.”

• “Deep Web Content mining refers to the overall
  process of discovering potentially useful and
  previously unknown information or knowledge from
  the Web data.”

                                              9
Contents
•   Abstract.
•   Introduction.
•   Deep Content Mining.
•   Modeling the Web Documents.


•   Web Documents Classification Methods.
•   Preprocessing phases.
•   Classifier Naive Bayesian (CNB).
•   Classifier K-Nearest Neighbor (CK-NN).
•   Implementation.
•   Results and Discussion.
•   Conclusion.
•   References.


                                             10
Modeling the W Documents
              eb
 • We represent the Web data in the binary format where all
   of the keywords derived from the schema.
 • If a keyword is in a frequent schema, a 1 is stored in related
   cell and otherwise a 0 is stored in it.
 • The attributes of frequent schemas are stated as follow:

    – QI1: Data Mining Extract Hidden Data from Database = {Data,
      Mining, Hidden, Database}, stop word {from};

    – QI2: Web Mining discovers Hidden information on the Web ={Web,
      Mining, Hidden}, stop words {on, the};

    – QI3: Web content Mining is a branch in Web Mining= {Web,
      Mining} , stop words {is, a, in};

    – QI4: Knowledge discovery in Database= {Database}, stop word {in}.
                                                        11
Cont…

      Data   Mining Extract Database Hidden        Web Other          Stop
                                                                key   words
                                                             words




QI1    1       1         1         1         1          0        0      1

QI2    0       1         0         0         1          2        2      2

QI3    0       2         0         0         0          2        2      3

QI4    0       0         0         1         0          0        3      1


             Tab1. Represent web data in binary scale
                                                            12
Contents
•   Abstract.
•   Introduction.
•   Deep Content Mining.
•   Modeling the Web Documents.
•   Web Documents Classification Methods.


•   Preprocessing phases.
•   Classifier Naive Bayesian (CNB).
•   Classifier K-Nearest Neighbor (CK-NN).
•   Implementation.
•   Results and Discussion.
•   Conclusion.
•   References.


                                             13
W Documents Classification
     eb
          Methods

•    Web documents consist of text, images, videos and audios.
•    Text data in web documents are defined to be the most
     tremendously.
•    The automatic text classification is the process of assigning a
     text document to one or more predefined categories based on
     its content.
•     Automatic text web document classification requires three main
     consecutive phases in constructing a classification system listed as
     follows:

        1.   Collect the text documents in corpora and tag them.

        2.   Select a set of features to represent the defined classes.

        3.   The classification algorithms must be trained and tested using the
             collected corpora in the first stage.                  14
Cont…
The text classification problem is composed of several sub problems such as:
The document indexing: Document indexing is related to with the way of
     extracting the document's keywords, two main approaches to achieve the
     document indexing, the first approach considers index terms as bags of words
     and the second approach regards the index terms as phrases.

The weighting assignment: Weight assignment techniques associate a real number
               assignment
    that ranges from 0 to 1 for all documents’ terms weights will be required to
    classify new arrived documents.

Learning based text classification algorithm :. A text classification algorithm used
    is inductive learning algorithm based on probabilistic theory and different
    models were emphasized such as Naive Bayesian models (Which always shows
    good result and widely used in text classification). Another text classification
    methods have been emerged to categorize documents such as K-Nearest
    Neighbor KNN which compute the distances between the document index
    terms and the known terms of each category. The accuracy will be tested by K-
    fold cross-validation method.
                                                                15
Contents
•   Abstract.
•   Introduction.
•   Deep Content Mining.
•   Modeling the Web Documents.
•   Web Documents Classification Methods.
•   Preprocessing phases.



•   Classifier Naive Bayesian (CNB).
•   Classifier K-Nearest Neighbor (CK-NN).
•   Implementation.
•   Results and Discussion.
•   Conclusion.
•   References.



                                             16
Preprocessing phases
                                            The data used in this work
                                            are collected from many
                                            news web sites. The data set
                                            consist of 1562 Arabic
                                            documents of different
                                            lengths that belongs to 6
                                            categories, the categories
Table 2. Number of Documents per Category
                                            are ( Economic , Cultural ,
 First phase: is the preprocessing          Political , Social , Sports ,
 step where documents are prepared          General ), Table 2 represent
 to make it adequate for further use,       the number of documents
 stop words removal and rearrange of        for each category.
 the document contents are some
 steps in this phase.
                                                          17
Continue
Second phase is the weighting assignment phase, it is defined as
the assignment of real number that relies between 0 and 1 to each
keyword and this number indicates the imperativeness of the
keyword inside the document.
Many methods have been developed and the most widely used model is the tf-idf
weighting factor. This weight of each keyword is computed by multiplying the
term factor (tf) with the inverse document factor (idf) where:
Fik = Occurrences of term tK in document Di.
tfik = fik/max (fil) normalized term frequency occurred in document.
dfk = documents which contain tk .
idfk= log (d/dfk) where d is the total number of documents and dfk is number of
document s that contains term tk.
wik = tfik * idfk for term weight, the computed w ik is a real number ɛ[0,1].

                                                              18
Contents
•   Abstract.
•   Introduction.
•   Deep Content Mining.
•   Modeling the Web Documents.
•   Web Documents Classification Methods.
•   Preprocessing phases.
•   Classifier Naive Bayesian (CNB).



•   Classifier K-Nearest Neighbor (CK-NN).
•   Implementation.
•   Results and Discussion.
•   Conclusion.
•   References.



                                             19
Classifier Naive B of class (CNB   ayesian
P(class| document) : It’s the probability  )
given a document, or the probability that a given
document D belongs to probability-driven algorithm
 Bayesian learning is a a given class C, and that is our
target. on Bayes probability theorem it is highly
 based
P(document ) : The probability of a document, we can
 recommended in text classification
notice that p(document ) is a Constance divider to
every calculation, so we can ignore it.
 A documents can be modeled as sets of words thus the
P( class ): Theclass ) can be written in two way Where:
 P(document | probability of a class (or category), we
can compute it from the number of documents in the
category dividedProbability that number inoften outperform more
  p(wordi |C )The Naive Bayesian can all of a given
               : by documents the i-th word
categories. occurs in a document from class C, and this Classifier task
 document      sophisticated classification methods.
 can be calculated as follow: incoming objects to their appropriate
               is to categorize
P(document | Class. : It’s the probability of document
               class )
in a given class.




                                                                          20
Contents
•   Abstract.
•   Introduction.
•   Deep Content Mining.
•   Modeling the Web Documents.
•   Web Documents Classification Methods.
•   Preprocessing phases.
•   Classifier Naive Bayesian (CNB).
•   Classifier K-Nearest Neighbor (CK-NN).




•   Implementation.
•   Results and Discussion.
•   Conclusion.
•   References.


                                             21
Classifier K-Nearest Neighbor
               (CK-NN)
•     K-Nearest Neighbor is a widely used text classifier especially in text mining
     because of its simplicity and efficiency.

•    It’s a a supervised learning algorithm where the result of a new occurrence
     query is classified based on the K-nearest neighbor category measurement.

•    Its training-phase consists of nothing more than storing all training examples as
     classifier.

•    It works based on minimum distance from the query instance to the training
     samples to determine the K nearest neighbors.

•    After collecting K nearest neighbors, we take simple majority of these K-
     nearest neighbors to be the prediction of the query-instance.

•    CK-NN algorithm consists of several multivariate attributes names X i that will
     be used to classify the object Y. We will deal only with quantitative Xi and
     binary (nominal) Y.                                          22
Continue
• Example: Suppose that the K factor is set to be equal to 8 (there are 8 nearest
  neighbors) as a parameter of this algorithm. Then the distance between the
  query-instance and all the training samples is computed, so there are only
  quantitative Xi.
• All training samples are included as nearest neighbors if the distance of this
  training sample to the query is less than or equal to the Kth smallest distance
  in this case the distances are sorted of all training samples to the query and
  determine the Kth as a minimum distance.
• The unknown sample is assigned the most common class among its k nearest
  neighbors. Then we find the distances between the query and all training
  samples.
• The K training samples are the closest K nearest neighbors for the unknown
  sample. Closeness is defined in terms of Euclidean distance, where the
  Euclidean between two points, X = (x1, x2,...,xn ) and Y = (y1, y2,...,yn) is:



                                                               23
Contents
•   Abstract.
•   Introduction.
•   Deep Content Mining.
•   Modeling the Web Documents.
•   Web Documents Classification Methods.
•   Preprocessing phases.
•   Classifier Naive Bayesian (CNB).
•   Classifier K-Nearest Neighbor (CK-NN).
•   Implementation.




•   Results and Discussion.
•   Conclusion.
•   References.


                                             24
Implementation


The implementation of the proposed Deep Web Text Classifier
(DWTC) demonstrates the importance of classifying the Latin text on
the web documents needs for information retrieval to illustrate both
keywords extraction and text classifiers used in the Algorithms
implementation:
     Keywords extraction: Text web documents are scanned to find the
      keywords each one is normalized. Normalization process consists of
      removing stop words, removing punctuation mark and non-letters in
      Latin letters shown in table 3.
     Some stop words:




                         Tab 3: The Example of stop words and non -letters.
                                                               25
Continue
 Terms weighting: There are two criterions:
     First criterion the more number of times a term occurs in documents which belongs to
        some category, the more it is relative to that category.
     Second criterion the more the term appears in different documents representing different
        categories; the less the term is useful for discriminating between documents as
        belonging to different categories.
     In this implementation we used the commonly used approach which is Normalized
        tf×idf to overcome the problem of variant documents lengths.
 Algorithms implementation :this implementation were mainly developed for testing the
  effectiveness of CK-NN and CNB algorithms when it is applied to the Latin text.
     We supplies a set of labeled text documents supplied to the system, the labels are used
        to indicate the class or classes that the text document belongs to. All documents
        belonging to the data set should be labeled in order to learn the system and then test it.
     The system distinguishes the labels of the training documents but not those of the test
        set.
     The system will compare between the two classifiers and report the most higher
        accuracy classifier for the current labeled text documents.
 The system will compare these results and select the best average accuracy result rates for
  each classifier and uses the greater average accuracy result rates in the system. The system
  will choose the higher rate to start the retrieving process.
                                                                             26
Contents
•   Abstract.
•   Introduction.
•   Deep Content Mining.
•   Modeling the Web Documents.
•   Web Documents Classification Methods.
•   Preprocessing phases.
•   Classifier Naive Bayesian (CNB).
•   Classifier K-Nearest Neighbor (CK-NN).
•   Implementation.
•   Results and Discussion.




•   Conclusion.
•   References.



                                             27
Results and Discussion
  • The data used in this work are collected from many web sites. The
        data set consist of 3533 Latin and non- Latin text documents of
        different lengths that belongs to 6 categories, the categories are
   ( Economic, Cultural, Political, Social, Sports and General). Table 2.

 • To test the system, the documents in the data set were preprocessed
                                               to find main categories.

• Various splitting percentages were used to see how the number of
  training documents impacts the classification effectiveness.

• Different k values starting from 1 and up to 20 in order were used to
  find the best results for CK-NN. Effectiveness started to decline at
  k>15.
• A comparison between the two algorithms and make labeling to the
  sample data, the classifier has been indicated in the system also
                                                          28
Continue
•   The k-fold cross-validation method is used to test the accuracy of the
    system.

•   Our result is roughly near from the other developer's results.

•   The results of the conducted experiments are included on the last columns in
    Table 4 and table 5. Our result is roughly near from each other results.

•   It is induced from the below results that the Classifier K-Nearest Neighbors
    (CK-NN) with an average (93.08%) has better than Classifier Naïve
    Bayesian that had (90.03%) in Latin text.

•   It means that the DWTC system in this case will use the CK-NN for Latin
    text classification and extraction instead of CNB.

•   In case of non –Latin text the DWTC system will use CNB text
    classification which has the average of 91.05% in Non- Latin classification
    and extraction instead of CK-NN with the average of 88.06% indicated in
    table5.
                                                                     29
Contents
•   Abstract.
•   Introduction.
•   Deep Content Mining.
•   Modeling the Web Documents.
•   Web Documents Classification Methods.
•   Preprocessing phases.
•   Classifier Naive Bayesian (CNB).
•   Classifier K-Nearest Neighbor (CK-NN).
•   Implementation.
•   Results and Discussion.
•   Conclusion.




•   References.



                                             30
Conclusion
• An evaluation to the use of Classifier K-Nearest Neighbor (CK-NN) and
  Classifier Naïve Bayes (CNB) to the classification Arabic text was considered.

• A development of a special corpus which consists of 3533 documents that
  belong to 6 categories.

• An extracted feature set of keywords and terms weighting in order to improve
  the performance were indicated as well.

• As a result we applied two algorithms for classifying to the text documents with
  a satisfactory number of patterns for each category.

• The accuracy was measured by the use of k-fold cross-validation method to test
  the accuracy of the system.

• We proposed an empirical Latin and non-Latin text classifier system called the
  Deep Web Text Classifier (DWTC).
• The system compares the results between both classifiers used (CK-NN, CNB)
  and select the best average accuracy result rates in case of Latin or non-Latin
                                                                 31
  text.
Contents
•   Abstract.
•   Introduction.
•   Deep Content Mining.
•   Modeling the Web Documents.
•   Web Documents Classification Methods.
•   Preprocessing phases.
•   Classifier Naive Bayesian (CNB).
•   Classifier K-Nearest Neighbor (CK-NN).
•   Implementation.
•   Results and Discussion.
•   Conclusion.
•   References.

                                             32
References
[2] Alexandrov M., Gelbukh A. and Lozovo. (2001). Chi-square Classifier for
   Document Categorization. 2nd International Conference on Intelligent Text
   Processing and Computational Linguistics, Mexico City.

[37] Zakaria Suliman Zubi. 2010. Text mining documents in electronic data
   interchange environment. In Proceedings of the 11th WSEAS international
   conference on nural networks and 11th WSEAS international conference on
   evolutionary computing and 11th WSEAS international conference on Fuzzy
   systems (NN'10/EC'10/FS'10), Viorel Munteanu, Razvan Raducanu, Gheorghe
   Dutica, Anca Croitoru, Valentina Emilia Balas, and Alina Gavrilut (Eds.).
   World Scientific and Engineering Academy and Society (WSEAS), Stevens
   Point, Wisconsin, USA, 76-88.

[38] Zakaria Suliman Zubi. 2009. Using some web content mining techniques for
    Arabic text classification. In Proceedings of the 8th WSEAS international
    conference on Data networks, communications, computers (DNCOCO'09), Manoj
    Jha, Charles Long, Nikos Mastorakis, and Cornelia Aida Bulucea (Eds.). World
    Scientific and Engineering Academy and Society (WSEAS), Stevens Point,
    Wisconsin, USA, 73-84.                                           33
Thank you !!!
           34
35

Weitere ähnliche Inhalte

Ähnlich wie Web Documents Classification Methods

An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBAn Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBWilliam LaForest
 
Web Mining & Text Mining
Web Mining & Text MiningWeb Mining & Text Mining
Web Mining & Text MiningHemant Sharma
 
Module 1 - Chapter1.pptx
Module 1 - Chapter1.pptxModule 1 - Chapter1.pptx
Module 1 - Chapter1.pptxSoniaDevi15
 
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Chris Freeland
 
Role of Ontologies in Semantic Digital Libraries
Role of Ontologies in Semantic Digital LibrariesRole of Ontologies in Semantic Digital Libraries
Role of Ontologies in Semantic Digital LibrariesSebastian Ryszard Kruk
 
Honey on the Wire KohaCon18
Honey on the Wire  KohaCon18Honey on the Wire  KohaCon18
Honey on the Wire KohaCon18Joy Nelson
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMSai Kumar Ale
 
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET Journal
 
Produce and consume_linked_data_with_drupal
Produce and consume_linked_data_with_drupalProduce and consume_linked_data_with_drupal
Produce and consume_linked_data_with_drupalSTIinnsbruck
 
Time -Travel on the Internet
Time -Travel on the InternetTime -Travel on the Internet
Time -Travel on the InternetIRJET Journal
 
Babouk: Focused Web Crawling for Corpus Compilation and Automatic Terminology...
Babouk: Focused Web Crawling for Corpus Compilation and Automatic Terminology...Babouk: Focused Web Crawling for Corpus Compilation and Automatic Terminology...
Babouk: Focused Web Crawling for Corpus Compilation and Automatic Terminology...Christophe Tricot
 
Similarity based Dynamic Web Data Extraction and Integration System from Sear...
Similarity based Dynamic Web Data Extraction and Integration System from Sear...Similarity based Dynamic Web Data Extraction and Integration System from Sear...
Similarity based Dynamic Web Data Extraction and Integration System from Sear...IDES Editor
 
Embedding Services: Linking from Google Scholar to Discover
Embedding Services:  Linking from Google Scholar to DiscoverEmbedding Services:  Linking from Google Scholar to Discover
Embedding Services: Linking from Google Scholar to DiscoverCISTI ICIST
 

Ähnlich wie Web Documents Classification Methods (20)

An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBAn Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDB
 
105 108
105 108105 108
105 108
 
Web Mining & Text Mining
Web Mining & Text MiningWeb Mining & Text Mining
Web Mining & Text Mining
 
Anti-social Databases
Anti-social DatabasesAnti-social Databases
Anti-social Databases
 
Module 1 - Chapter1.pptx
Module 1 - Chapter1.pptxModule 1 - Chapter1.pptx
Module 1 - Chapter1.pptx
 
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
 
報告
報告報告
報告
 
Cl32543545
Cl32543545Cl32543545
Cl32543545
 
Cl32543545
Cl32543545Cl32543545
Cl32543545
 
Role of Ontologies in Semantic Digital Libraries
Role of Ontologies in Semantic Digital LibrariesRole of Ontologies in Semantic Digital Libraries
Role of Ontologies in Semantic Digital Libraries
 
By
ByBy
By
 
Honey on the Wire KohaCon18
Honey on the Wire  KohaCon18Honey on the Wire  KohaCon18
Honey on the Wire KohaCon18
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEM
 
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
 
Produce and consume_linked_data_with_drupal
Produce and consume_linked_data_with_drupalProduce and consume_linked_data_with_drupal
Produce and consume_linked_data_with_drupal
 
Time -Travel on the Internet
Time -Travel on the InternetTime -Travel on the Internet
Time -Travel on the Internet
 
Babouk: Focused Web Crawling for Corpus Compilation and Automatic Terminology...
Babouk: Focused Web Crawling for Corpus Compilation and Automatic Terminology...Babouk: Focused Web Crawling for Corpus Compilation and Automatic Terminology...
Babouk: Focused Web Crawling for Corpus Compilation and Automatic Terminology...
 
Similarity based Dynamic Web Data Extraction and Integration System from Sear...
Similarity based Dynamic Web Data Extraction and Integration System from Sear...Similarity based Dynamic Web Data Extraction and Integration System from Sear...
Similarity based Dynamic Web Data Extraction and Integration System from Sear...
 
Embedding Services: Linking from Google Scholar to Discover
Embedding Services:  Linking from Google Scholar to DiscoverEmbedding Services:  Linking from Google Scholar to Discover
Embedding Services: Linking from Google Scholar to Discover
 
Digital Content Management
Digital Content ManagementDigital Content Management
Digital Content Management
 

Mehr von Zakaria Zubi

applyingwebminingapplicationforuserbehaviorunderstanding-131215105223-phpapp0...
applyingwebminingapplicationforuserbehaviorunderstanding-131215105223-phpapp0...applyingwebminingapplicationforuserbehaviorunderstanding-131215105223-phpapp0...
applyingwebminingapplicationforuserbehaviorunderstanding-131215105223-phpapp0...Zakaria Zubi
 
Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)Zakaria Zubi
 
Knowledge Discovery in Remote Access Databases
Knowledge Discovery in Remote Access Databases Knowledge Discovery in Remote Access Databases
Knowledge Discovery in Remote Access Databases Zakaria Zubi
 
I- Extended Databases
I- Extended DatabasesI- Extended Databases
I- Extended DatabasesZakaria Zubi
 
Using Data Mining Techniques to Analyze Crime Pattern
Using Data Mining Techniques to Analyze Crime PatternUsing Data Mining Techniques to Analyze Crime Pattern
Using Data Mining Techniques to Analyze Crime PatternZakaria Zubi
 
COMPARISON OF ROUTING PROTOCOLS FOR AD HOC WIRELESS NETWORK WITH MEDICAL DATA
COMPARISON OF ROUTING PROTOCOLS FOR AD HOC WIRELESS NETWORK WITH MEDICAL DATA COMPARISON OF ROUTING PROTOCOLS FOR AD HOC WIRELESS NETWORK WITH MEDICAL DATA
COMPARISON OF ROUTING PROTOCOLS FOR AD HOC WIRELESS NETWORK WITH MEDICAL DATA Zakaria Zubi
 
A Comparative Study of Data Mining Methods to Analyzing Libyan National Crime...
A Comparative Study of Data Mining Methods to Analyzing Libyan National Crime...A Comparative Study of Data Mining Methods to Analyzing Libyan National Crime...
A Comparative Study of Data Mining Methods to Analyzing Libyan National Crime...Zakaria Zubi
 
Applying web mining application for user behavior understanding
Applying web mining application for user behavior understandingApplying web mining application for user behavior understanding
Applying web mining application for user behavior understandingZakaria Zubi
 
Ibtc dwt hybrid coding of digital images
Ibtc dwt hybrid coding of digital imagesIbtc dwt hybrid coding of digital images
Ibtc dwt hybrid coding of digital imagesZakaria Zubi
 
Information communication technology in libya for educational purposes
Information communication technology in libya for educational purposesInformation communication technology in libya for educational purposes
Information communication technology in libya for educational purposesZakaria Zubi
 

Mehr von Zakaria Zubi (13)

applyingwebminingapplicationforuserbehaviorunderstanding-131215105223-phpapp0...
applyingwebminingapplicationforuserbehaviorunderstanding-131215105223-phpapp0...applyingwebminingapplicationforuserbehaviorunderstanding-131215105223-phpapp0...
applyingwebminingapplicationforuserbehaviorunderstanding-131215105223-phpapp0...
 
Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)
 
Knowledge Discovery in Remote Access Databases
Knowledge Discovery in Remote Access Databases Knowledge Discovery in Remote Access Databases
Knowledge Discovery in Remote Access Databases
 
I- Extended Databases
I- Extended DatabasesI- Extended Databases
I- Extended Databases
 
Using Data Mining Techniques to Analyze Crime Pattern
Using Data Mining Techniques to Analyze Crime PatternUsing Data Mining Techniques to Analyze Crime Pattern
Using Data Mining Techniques to Analyze Crime Pattern
 
COMPARISON OF ROUTING PROTOCOLS FOR AD HOC WIRELESS NETWORK WITH MEDICAL DATA
COMPARISON OF ROUTING PROTOCOLS FOR AD HOC WIRELESS NETWORK WITH MEDICAL DATA COMPARISON OF ROUTING PROTOCOLS FOR AD HOC WIRELESS NETWORK WITH MEDICAL DATA
COMPARISON OF ROUTING PROTOCOLS FOR AD HOC WIRELESS NETWORK WITH MEDICAL DATA
 
Ismail&&ziko 2003
Ismail&&ziko 2003Ismail&&ziko 2003
Ismail&&ziko 2003
 
A Comparative Study of Data Mining Methods to Analyzing Libyan National Crime...
A Comparative Study of Data Mining Methods to Analyzing Libyan National Crime...A Comparative Study of Data Mining Methods to Analyzing Libyan National Crime...
A Comparative Study of Data Mining Methods to Analyzing Libyan National Crime...
 
Applying web mining application for user behavior understanding
Applying web mining application for user behavior understandingApplying web mining application for user behavior understanding
Applying web mining application for user behavior understanding
 
Edi text
Edi textEdi text
Edi text
 
Model
ModelModel
Model
 
Ibtc dwt hybrid coding of digital images
Ibtc dwt hybrid coding of digital imagesIbtc dwt hybrid coding of digital images
Ibtc dwt hybrid coding of digital images
 
Information communication technology in libya for educational purposes
Information communication technology in libya for educational purposesInformation communication technology in libya for educational purposes
Information communication technology in libya for educational purposes
 

Kürzlich hochgeladen

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Kürzlich hochgeladen (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

Web Documents Classification Methods

  • 1. Text Classification in Deep Web Mining Presented by: Zakaria Suliman Zubi Associate Professor Computer Science Department Faculty of Science Sirte University Sirte, Libya 1
  • 2. Contents • Abstract. • Introduction. • Deep Content Mining. • Modeling the Web Documents. • Web Documents Classification Methods. • Preprocessing phases. • Classifier Naive Bayesian (CNB). • Classifier K-Nearest Neighbor (CK-NN). • Implementation. • Results and Discussion. • Conclusion. • References. 2
  • 3. Abstract • The World Wide Web is a rich source of knowledge that can be useful to many applications. – Source? • Billions of web pages and billions of visitors and contributors. – What knowledge? • e.g., the hyperlink structure and a variety of languages. – Purpose? • To improve users’ effectiveness in searching for information on the web. • Decision-making support or business management. 3
  • 4. Continue • Web’s Characteristics: – Large size – Unstructured – Different data types: text, image, hyperlinks and user usage information – Dynamic content – Time dimension – Multilingual (i.e. Latin, non Latin languages) • The Data Mining (DM) is a significant subfield of this area. • Using a Classification Methods such as K-Nearest Neighbor (CK-NN) and Classifier Naïve Bayes (CNB). • The various activities and efforts in this area are referred to as Web Mining. 4
  • 5. Contents • Abstract. • Introduction. • Deep Content Mining. • Modeling the Web Documents. • Web Documents Classification Methods. • Preprocessing phases. • Classifier Naive Bayesian (CNB). • Classifier K-Nearest Neighbor (CK-NN). • Implementation. • Results and Discussion. • Conclusion. • References. 5
  • 6. Introduction  The Internet is probably the biggest world’s database were the data is available using easily accessible techniques. Data is held in various forms: text, multimedia, database.  Web pages keep standard of html (or another ML family member) which makes it kind of structural form, but not sufficient to easily use it in data mining. Web mining – the application of data mining techniques is to extract knowledge from Web content, structure, and usage.  Deep web also defined as hidden web, invisible web or invisible Internet refers to the lower novae of the global network. The easiest way is to put Deep Web as a part of data mining, where web resources are explored. It is commonly divided into three: 6
  • 7. Introduction is looking Web usage mining for useful patterns in logs and documents containing history of user’s activity Web mining – the application of data mining techniques is to extract knowledge from Web content, structure, and usage. W e b M in in g Web content mining is the closest one to the “classic” data W e b C o n t e n t M in in g W e bmining”,Masi nWCM mostly W e b S tru c tu re in g U s a g e M in in g operates on text and it is generally common way to put Text informatione rin kInternet as text, H y p lin s W e b S e rv e r L o g s Im a g e D o c u m e n t S tru c tu re d A p p lic a t io n L e v e l L o g s Web structured mining goal is to A u d io A p p lic a t io n S e rv e r L o g s use nature of the Internet V e d io connection structure as it is a bunch of documents connected S tru c tu re d R e c o rd s with links. 7
  • 8. Contents • Abstract. • Introduction. • Deep Content Mining. • Modeling the Web Documents. • Web Documents Classification Methods. • Preprocessing phases. • Classifier Naive Bayesian (CNB). • Classifier K-Nearest Neighbor (CK-NN). • Implementation. • Results and Discussion. • Conclusion. • References. 8
  • 9. Deep W Content M eb ining • Deep Web Content Mining is the process of extracting useful information from the contents of Web documents. It may consist of text, images, audio, video, or structured records such as lists and tables.” • “Deep Web Content mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data.” 9
  • 10. Contents • Abstract. • Introduction. • Deep Content Mining. • Modeling the Web Documents. • Web Documents Classification Methods. • Preprocessing phases. • Classifier Naive Bayesian (CNB). • Classifier K-Nearest Neighbor (CK-NN). • Implementation. • Results and Discussion. • Conclusion. • References. 10
  • 11. Modeling the W Documents eb • We represent the Web data in the binary format where all of the keywords derived from the schema. • If a keyword is in a frequent schema, a 1 is stored in related cell and otherwise a 0 is stored in it. • The attributes of frequent schemas are stated as follow: – QI1: Data Mining Extract Hidden Data from Database = {Data, Mining, Hidden, Database}, stop word {from}; – QI2: Web Mining discovers Hidden information on the Web ={Web, Mining, Hidden}, stop words {on, the}; – QI3: Web content Mining is a branch in Web Mining= {Web, Mining} , stop words {is, a, in}; – QI4: Knowledge discovery in Database= {Database}, stop word {in}. 11
  • 12. Cont… Data Mining Extract Database Hidden Web Other Stop key words words QI1 1 1 1 1 1 0 0 1 QI2 0 1 0 0 1 2 2 2 QI3 0 2 0 0 0 2 2 3 QI4 0 0 0 1 0 0 3 1 Tab1. Represent web data in binary scale 12
  • 13. Contents • Abstract. • Introduction. • Deep Content Mining. • Modeling the Web Documents. • Web Documents Classification Methods. • Preprocessing phases. • Classifier Naive Bayesian (CNB). • Classifier K-Nearest Neighbor (CK-NN). • Implementation. • Results and Discussion. • Conclusion. • References. 13
  • 14. W Documents Classification eb Methods • Web documents consist of text, images, videos and audios. • Text data in web documents are defined to be the most tremendously. • The automatic text classification is the process of assigning a text document to one or more predefined categories based on its content. • Automatic text web document classification requires three main consecutive phases in constructing a classification system listed as follows: 1. Collect the text documents in corpora and tag them. 2. Select a set of features to represent the defined classes. 3. The classification algorithms must be trained and tested using the collected corpora in the first stage. 14
  • 15. Cont… The text classification problem is composed of several sub problems such as: The document indexing: Document indexing is related to with the way of extracting the document's keywords, two main approaches to achieve the document indexing, the first approach considers index terms as bags of words and the second approach regards the index terms as phrases. The weighting assignment: Weight assignment techniques associate a real number assignment that ranges from 0 to 1 for all documents’ terms weights will be required to classify new arrived documents. Learning based text classification algorithm :. A text classification algorithm used is inductive learning algorithm based on probabilistic theory and different models were emphasized such as Naive Bayesian models (Which always shows good result and widely used in text classification). Another text classification methods have been emerged to categorize documents such as K-Nearest Neighbor KNN which compute the distances between the document index terms and the known terms of each category. The accuracy will be tested by K- fold cross-validation method. 15
  • 16. Contents • Abstract. • Introduction. • Deep Content Mining. • Modeling the Web Documents. • Web Documents Classification Methods. • Preprocessing phases. • Classifier Naive Bayesian (CNB). • Classifier K-Nearest Neighbor (CK-NN). • Implementation. • Results and Discussion. • Conclusion. • References. 16
  • 17. Preprocessing phases The data used in this work are collected from many news web sites. The data set consist of 1562 Arabic documents of different lengths that belongs to 6 categories, the categories Table 2. Number of Documents per Category are ( Economic , Cultural , First phase: is the preprocessing Political , Social , Sports , step where documents are prepared General ), Table 2 represent to make it adequate for further use, the number of documents stop words removal and rearrange of for each category. the document contents are some steps in this phase. 17
  • 18. Continue Second phase is the weighting assignment phase, it is defined as the assignment of real number that relies between 0 and 1 to each keyword and this number indicates the imperativeness of the keyword inside the document. Many methods have been developed and the most widely used model is the tf-idf weighting factor. This weight of each keyword is computed by multiplying the term factor (tf) with the inverse document factor (idf) where: Fik = Occurrences of term tK in document Di. tfik = fik/max (fil) normalized term frequency occurred in document. dfk = documents which contain tk . idfk= log (d/dfk) where d is the total number of documents and dfk is number of document s that contains term tk. wik = tfik * idfk for term weight, the computed w ik is a real number ɛ[0,1]. 18
  • 19. Contents • Abstract. • Introduction. • Deep Content Mining. • Modeling the Web Documents. • Web Documents Classification Methods. • Preprocessing phases. • Classifier Naive Bayesian (CNB). • Classifier K-Nearest Neighbor (CK-NN). • Implementation. • Results and Discussion. • Conclusion. • References. 19
  • 20. Classifier Naive B of class (CNB ayesian P(class| document) : It’s the probability ) given a document, or the probability that a given document D belongs to probability-driven algorithm Bayesian learning is a a given class C, and that is our target. on Bayes probability theorem it is highly based P(document ) : The probability of a document, we can recommended in text classification notice that p(document ) is a Constance divider to every calculation, so we can ignore it. A documents can be modeled as sets of words thus the P( class ): Theclass ) can be written in two way Where: P(document | probability of a class (or category), we can compute it from the number of documents in the category dividedProbability that number inoften outperform more p(wordi |C )The Naive Bayesian can all of a given : by documents the i-th word categories. occurs in a document from class C, and this Classifier task document sophisticated classification methods. can be calculated as follow: incoming objects to their appropriate is to categorize P(document | Class. : It’s the probability of document class ) in a given class. 20
  • 21. Contents • Abstract. • Introduction. • Deep Content Mining. • Modeling the Web Documents. • Web Documents Classification Methods. • Preprocessing phases. • Classifier Naive Bayesian (CNB). • Classifier K-Nearest Neighbor (CK-NN). • Implementation. • Results and Discussion. • Conclusion. • References. 21
  • 22. Classifier K-Nearest Neighbor (CK-NN) • K-Nearest Neighbor is a widely used text classifier especially in text mining because of its simplicity and efficiency. • It’s a a supervised learning algorithm where the result of a new occurrence query is classified based on the K-nearest neighbor category measurement. • Its training-phase consists of nothing more than storing all training examples as classifier. • It works based on minimum distance from the query instance to the training samples to determine the K nearest neighbors. • After collecting K nearest neighbors, we take simple majority of these K- nearest neighbors to be the prediction of the query-instance. • CK-NN algorithm consists of several multivariate attributes names X i that will be used to classify the object Y. We will deal only with quantitative Xi and binary (nominal) Y. 22
  • 23. Continue • Example: Suppose that the K factor is set to be equal to 8 (there are 8 nearest neighbors) as a parameter of this algorithm. Then the distance between the query-instance and all the training samples is computed, so there are only quantitative Xi. • All training samples are included as nearest neighbors if the distance of this training sample to the query is less than or equal to the Kth smallest distance in this case the distances are sorted of all training samples to the query and determine the Kth as a minimum distance. • The unknown sample is assigned the most common class among its k nearest neighbors. Then we find the distances between the query and all training samples. • The K training samples are the closest K nearest neighbors for the unknown sample. Closeness is defined in terms of Euclidean distance, where the Euclidean between two points, X = (x1, x2,...,xn ) and Y = (y1, y2,...,yn) is: 23
  • 24. Contents • Abstract. • Introduction. • Deep Content Mining. • Modeling the Web Documents. • Web Documents Classification Methods. • Preprocessing phases. • Classifier Naive Bayesian (CNB). • Classifier K-Nearest Neighbor (CK-NN). • Implementation. • Results and Discussion. • Conclusion. • References. 24
  • 25. Implementation The implementation of the proposed Deep Web Text Classifier (DWTC) demonstrates the importance of classifying the Latin text on the web documents needs for information retrieval to illustrate both keywords extraction and text classifiers used in the Algorithms implementation:  Keywords extraction: Text web documents are scanned to find the keywords each one is normalized. Normalization process consists of removing stop words, removing punctuation mark and non-letters in Latin letters shown in table 3.  Some stop words: Tab 3: The Example of stop words and non -letters. 25
  • 26. Continue  Terms weighting: There are two criterions:  First criterion the more number of times a term occurs in documents which belongs to some category, the more it is relative to that category.  Second criterion the more the term appears in different documents representing different categories; the less the term is useful for discriminating between documents as belonging to different categories.  In this implementation we used the commonly used approach which is Normalized tf×idf to overcome the problem of variant documents lengths.  Algorithms implementation :this implementation were mainly developed for testing the effectiveness of CK-NN and CNB algorithms when it is applied to the Latin text.  We supplies a set of labeled text documents supplied to the system, the labels are used to indicate the class or classes that the text document belongs to. All documents belonging to the data set should be labeled in order to learn the system and then test it.  The system distinguishes the labels of the training documents but not those of the test set.  The system will compare between the two classifiers and report the most higher accuracy classifier for the current labeled text documents.  The system will compare these results and select the best average accuracy result rates for each classifier and uses the greater average accuracy result rates in the system. The system will choose the higher rate to start the retrieving process. 26
  • 27. Contents • Abstract. • Introduction. • Deep Content Mining. • Modeling the Web Documents. • Web Documents Classification Methods. • Preprocessing phases. • Classifier Naive Bayesian (CNB). • Classifier K-Nearest Neighbor (CK-NN). • Implementation. • Results and Discussion. • Conclusion. • References. 27
  • 28. Results and Discussion • The data used in this work are collected from many web sites. The data set consist of 3533 Latin and non- Latin text documents of different lengths that belongs to 6 categories, the categories are ( Economic, Cultural, Political, Social, Sports and General). Table 2. • To test the system, the documents in the data set were preprocessed to find main categories. • Various splitting percentages were used to see how the number of training documents impacts the classification effectiveness. • Different k values starting from 1 and up to 20 in order were used to find the best results for CK-NN. Effectiveness started to decline at k>15. • A comparison between the two algorithms and make labeling to the sample data, the classifier has been indicated in the system also 28
  • 29. Continue • The k-fold cross-validation method is used to test the accuracy of the system. • Our result is roughly near from the other developer's results. • The results of the conducted experiments are included on the last columns in Table 4 and table 5. Our result is roughly near from each other results. • It is induced from the below results that the Classifier K-Nearest Neighbors (CK-NN) with an average (93.08%) has better than Classifier Naïve Bayesian that had (90.03%) in Latin text. • It means that the DWTC system in this case will use the CK-NN for Latin text classification and extraction instead of CNB. • In case of non –Latin text the DWTC system will use CNB text classification which has the average of 91.05% in Non- Latin classification and extraction instead of CK-NN with the average of 88.06% indicated in table5. 29
  • 30. Contents • Abstract. • Introduction. • Deep Content Mining. • Modeling the Web Documents. • Web Documents Classification Methods. • Preprocessing phases. • Classifier Naive Bayesian (CNB). • Classifier K-Nearest Neighbor (CK-NN). • Implementation. • Results and Discussion. • Conclusion. • References. 30
  • 31. Conclusion • An evaluation to the use of Classifier K-Nearest Neighbor (CK-NN) and Classifier Naïve Bayes (CNB) to the classification Arabic text was considered. • A development of a special corpus which consists of 3533 documents that belong to 6 categories. • An extracted feature set of keywords and terms weighting in order to improve the performance were indicated as well. • As a result we applied two algorithms for classifying to the text documents with a satisfactory number of patterns for each category. • The accuracy was measured by the use of k-fold cross-validation method to test the accuracy of the system. • We proposed an empirical Latin and non-Latin text classifier system called the Deep Web Text Classifier (DWTC). • The system compares the results between both classifiers used (CK-NN, CNB) and select the best average accuracy result rates in case of Latin or non-Latin 31 text.
  • 32. Contents • Abstract. • Introduction. • Deep Content Mining. • Modeling the Web Documents. • Web Documents Classification Methods. • Preprocessing phases. • Classifier Naive Bayesian (CNB). • Classifier K-Nearest Neighbor (CK-NN). • Implementation. • Results and Discussion. • Conclusion. • References. 32
  • 33. References [2] Alexandrov M., Gelbukh A. and Lozovo. (2001). Chi-square Classifier for Document Categorization. 2nd International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City. [37] Zakaria Suliman Zubi. 2010. Text mining documents in electronic data interchange environment. In Proceedings of the 11th WSEAS international conference on nural networks and 11th WSEAS international conference on evolutionary computing and 11th WSEAS international conference on Fuzzy systems (NN'10/EC'10/FS'10), Viorel Munteanu, Razvan Raducanu, Gheorghe Dutica, Anca Croitoru, Valentina Emilia Balas, and Alina Gavrilut (Eds.). World Scientific and Engineering Academy and Society (WSEAS), Stevens Point, Wisconsin, USA, 76-88. [38] Zakaria Suliman Zubi. 2009. Using some web content mining techniques for Arabic text classification. In Proceedings of the 8th WSEAS international conference on Data networks, communications, computers (DNCOCO'09), Manoj Jha, Charles Long, Nikos Mastorakis, and Cornelia Aida Bulucea (Eds.). World Scientific and Engineering Academy and Society (WSEAS), Stevens Point, Wisconsin, USA, 73-84. 33
  • 35. 35