SlideShare ist ein Scribd-Unternehmen logo
1 von 48
Using Some Web Content Mining
   Techniques for Arabic Text
         Classification
         Presented by:

                       Zakaria Suliman Zubi
                        Assistant Professor
                    Computer Science Department
                         Faculty of Science
                        Altahadi University
                            Sirte, Libya




                                                  1
Contents
•   Introduction to Web mining.
•   Web Mining Basics.
•   Web Content Mining: Definition.
•   Opportunities and Challenges.
•   Web Content Mining Subtasks.
•   Related Work.
•   Text Mining for Web Documents.
•   Multilingual Web Mining.
•   Arabic Language-Independent Specification.
•   Arabic Text Classification.
•   Arabic Corpora.
•   Methods and Algorithms.
•   Implementation of the ATC .
•   Results and Discussion.
•   Conclusion.
                                                 2
•   References.
Introduction to W mining
                 eb

• The World Wide Web is a rich source of knowledge that can be
  useful to many applications.

   – Source?
      • Billions of web pages and billions of visitors and
        contributors.

   – What knowledge?
      • e.g., the hyperlink structure and a variety of languages.

   – Purpose?
      • To improve users’ effectiveness in searching for
        information on the web.
      • Decision-making support or business management.
                                                    3
Continue

• Web’s Characteristics:
  – Large size
  – Unstructured
  – Different data types: text, image, hyperlinks and user usage
    information
  – Dynamic content
  – Time dimension
  – Multilingual

• The Data Mining (DM) is a significant subfield of this area.

• The various activities and efforts in this area are referred to as
  Web Mining.
                                                        4
Contents
•   Web Mining Basics.
•   Web Content Mining: Definition.
•   Opportunities and Challenges.
•   Web Content Mining Subtasks.
•   Related Work.
•   Text Mining for Web Documents.
•   Multilingual Web Mining.
•   Arabic Language-Independent Specification.
•   Arabic Text Classification.
•   Arabic Corpora.
•   Methods and Algorithms.
•   Implementation of the ATC .
•   Results and Discussion.
•   Conclusion.
•   References.
                                                 5
W M
           eb ining Basics

  Web mining – the application of data mining techniques is to
  extract knowledge from Web content, structure, and usage.

                                                   W e b M in in g




W e b C o n t e n t M in in g           W e b S t r u c t u r e M in in g         W e b U s a g e M in in g



                    Text                                 H y p e r lin k s                 W e b S e rv e r L o g s

                  Im a g e                      D o c u m e n t S tru c tu re d        A p p lic a t io n L e v e l L o g s

                   A u d io                                                           A p p lic a t io n S e r v e r L o g s

                   V e d io

         S tru c tu re d R e c o rd s
                                                                                          6
Contents
•   Web Content Mining: Definition.
•   Opportunities and Challenges.
•   Web Content Mining Subtasks.
•   Related Work.
•   Text Mining for Web Documents.
•   Multilingual Web Mining.
•   Arabic Language-Independent Specification.
•   Arabic Text Classification.
•   Arabic Corpora.
•   Methods and Algorithms.
•   Implementation of the ATC .
•   Results and Discussion.
•   Conclusion.
•   References.
                                                 7
W Content M
         eb           ining:
            Definition
•   “Web Content Mining is the process of extracting
    useful information from the contents of Web
    documents. It may consist of text, images, audio,
    video, or structured records such as lists and tables.”

• “Web Content mining refers to the overall process of
  discovering potentially useful and previously unknown
  information or knowledge from the Web data.”


                                                 8
Contents
•   Opportunities and Challenges.
•   Web Content Mining Subtasks.
•   Related Work.
•   Text Mining for Web Documents.
•   Multilingual Web Mining.
•   Arabic Language-Independent Specification.
•   Arabic Text Classification.
•   Arabic Corpora.
•   Methods and Algorithms.
•   Implementation of the ATC .
•   Results and Discussion.
•   Conclusion.
•   References.


                                                 9
Opportunities and Challenges
 Web offers an unique opportunity and challenge to data mining
 • The amount of information on the Web is huge, and easily
   accessible.
 • The coverage of Web information is very wide and diverse. One
   can find information about almost anything.
 • Information/data of almost all types exist on the Web, e.g.,
   structured tables, texts, multimedia data, etc.
 • Much of the Web information is semi-structured due to the
   nested structure of HTML code.
 • Much of the Web information is linked. There are hyperlinks
   among pages within a site, and across different sites.
 • Much of the Web information is redundant. The same piece of
   information or it may appears in many pages.


                                                  10
Cont…

• The Web is noisy. A Web page typically contains a mixture of
  many kinds of information, e.g., main contents, advertisements,
  navigation panels, copyright notices, etc.
• The Web is also about services. Many Web sites and pages
  enable people to perform operations with input parameters, i.e.,
  they provide services.
• The Web is dynamic. Information on the Web changes
  constantly. Keeping up with the changes and monitoring the
  changes are important issues.
• Above all, the Web is a virtual society. It is not only about data,
  information and services, but also about interactions among
  people, organizations and automatic systems, i.e., communities.
                                                       11
Contents
•   Web Content Mining Subtasks.
•   Related Work.
•   Text Mining for Web Documents.
•   Multilingual Web Mining.
•   Arabic Language-Independent Specification.
•   Arabic Text Classification.
•   Arabic Corpora.
•   Methods and Algorithms.
•   Implementation of the ATC .
•   Results and Discussion.
•   Conclusion.
•   References.

                                                 12
W Content M
  eb        ining Subtasks


• Resource finding
   – Retrieving intended documents
• Information selection/pre-processing
   – Select and pre-process specific information from
     selected documents
• Generalization
   – Discover general patterns within and across web
     sites
• Analysis
   – Validation and/or interpretation of mined patterns
                                              13
Contents
•   Related Work.
•   Text Mining for Web Documents.
•   Multilingual Web Mining.
•   Arabic Language-Independent Specification.
•   Arabic Text Classification.
•   Arabic Corpora.
•   Methods and Algorithms.
•   Implementation of the ATC .
•   Results and Discussion.
•   Conclusion.
•   References.

                                           14
Related Work

There are many research in classifying English documents , In addition to English
language there are many researches in European languages such as German, Italian and
Spanish, Asian languages as well, such as Chinese and Japanese, Arabic language has
only four researches listed as follow:

 1.    El-Kourdi et, use naïve bayes algorithm to automatic Arabic document
       classification. The average accuracy reported was about 68.78%.

 2.    Siraj from Sakhr. The system is available at (siraj.sakhr.com) but it has no
       technical documentation to explain the method used in the system and the accuracy
       of the system.

 3.    Sawaf et. al., uses the statistical classification methods such as maximum entropy
       to classify and cluster news articles, the best classification accuracy they reported
       was 62.7% with precision of 50% which is a very low precision in this field.

 4.    El-Halees, described a method based on association rules to classify Arabic
       documents. The classification accuracy reported was 74.41%. 15
Contents
•   Text Mining for Web Documents.
•   Multilingual Web Mining.
•   Arabic Language-Independent Specification.
•   Arabic Text Classification.
•   Arabic Corpora.
•   Methods and Algorithms.
•   Implementation of the ATC .
•   Results and Discussion.
•   Conclusion.
•   References.

                                           16
Text M ining for Web
         Documents
• Text mining is often considered a sub-field of data mining and refers to the
  extraction of knowledge from text documents. These documents could be
  located on the web.
• Text mining for web documents considered as a sub-field of web mining, or,
  more specifically, web content mining. Information extraction, text
  classification, and text clustering are patterns of text-mining applications
  that have been functional to web documents.
• Most web documents are text documents in the form of HTML web pages
  where we can identify useful tags and then extract useful information from
  the web.
• In most cases web documents consists of unstructured data which causes the
  data on the web documents to be noisy some times.
• Some business applications are used frequently to extract English text
  patterns from web pages. For instance, FlipDog developed by the Whiz-
  bang labs, and Lencom software.

                                                              17
Continue

• In some applications the HTML tags are simply stripped from
  the web documents and traditional algorithms are then applied
  to perform text classification and clustering.
• To creates a unique problem for performing text classification
  and clustering requires to ignore some useful characteristics of
  web page design such as web page hyperlinks , but “Home,”
  “Click here,’’ and “Contact us,” would be included as a
  document’s features.




                                                      18
Contents
•   Multilingual Web Mining.
•   Arabic Language-Independent Specification.
•   Arabic Text Classification.
•   Arabic Corpora.
•   Methods and Algorithms.
•   Implementation of the ATC .
•   Results and Discussion.
•   Conclusion.
•   References.


                                           19
Multilingual W M
                  eb ining

•    The number of non-English documents on the web continues to grow-more
    than 30 percent of web pages are in a language other than English.

• In order to extract a non-English knowledge from the web, web mining
  systems have to deal with issues in language-specific text processing or
  called Machine Learning Process or a Learning Process.

• Text classification and clustering needs a set of features (a vector of
  keywords) to start the learning process.

• The algorithms regularly depend on some phrase segmentation and
  extraction programs to generate a set of features or keywords to represent
  web documents.

• Many existing extraction programs, especially those employing a linguistic
  approach, are language-dependent and work only with English texts.
                                                                 20
Continue
•   In order to perform analysis on non-English documents, web mining systems
    must use the corresponding phrase extraction program for each language.

•   Some segmentation and extraction programs are language-independent by
    using a machine learning approach.

•   The most common approaches to categorize and classified the documents are
    Key Nearest Neighbor (K-NN) and the Naïve Bayes.

•    These algorithms were tested on a non English documents as a language-
    independent technique for key phrase extraction and have shown promising
    results in many research.

•   These algorithms do not rely on specific linguistic rules; they can be easily
    modified to work with different languages as well.

•    We tried to indicate the must relevant text classification algorithms for
    Arabic text. These algorithms will be evaluated on a data set in an Arabic
    corpus.                                                         21
Contents
•   Arabic Language-Independent Specification .
•   Arabic Text Classification.
•   Arabic Corpora.
•   Methods and Algorithms.
•   Implementation of the ATC .
•   Results and Discussion.
•   Conclusion.
•   References.

                                      22
Arabic Language-Independent
             Specification
• A language-independent specification is a programming
  language specification which provides a common interface
  functional for defining semantics applicable headed for
  arbitrary language bindings.

• The rules that can be applied to any new language will be added
  to the system.

• The language independent specific resources are morphological
  or subject domain-specific dictionaries, linguistic grammar
  rules, etc.

•    It carries on the application modular by storing necessary
    language-specific resources outside the rules, inside a language-
    specific parameter files.
                                                       23
Continue

• It gives the application more capabilities to add new languages
  to be plugged in the system more easily.

• It cannot be done without the use of bottom-up, data-driven
  bootstrapping methods to create the monolingual resources.

•    It avoids the language pair-specific resources and procedures
    because almost exponential growth of language pairs would
    automatically limit the number of languages a system can deal
    with.


                                                      24
Contents
•   Arabic Text Classification.
•   Arabic Corpora.
•   Methods and Algorithms.
•   Implementation of the ATC .
•   Results and Discussion.
•   Conclusion.
•   References.

                                  25
Arabic Text Classification
Arabic is the mother language of more than 300 million people
  world wide which contains the following characteristics:
    Non Latin-based alphabets.
    The direction of writing in Arabic is from right to left.
    The Arabic alphabet consists of 28 letters.
    Arabic words have two genders, female and male.
    A very sophisticated grammar transliteration.

    Classifying Arabic text is different than classifying English language
     because Arabic is highly inflectional and derivational language which
     makes monophonical analysis a very complex task.

    Arabic scripts do not use capitalization for proper nouns which are
     necessary in classifying documents classification.

    In text classification we take care of the documents as a bag-of-words with
     the text as a vector of a weighted frequency for each of the individual
     words or tokens.                                           26
Continue
There are several efforts to improve text illustration using concepts or multi-word
     terms in Arabic language showing as follows:
    1. El-Kourdi et, use naïve bayes algorithm to automatic Arabic document
         classification. The average accuracy reported was about 68.78%.

    2.   Siraj from Sakhr. The system is available at (siraj.sakhr.com) but it has
         no technical documentation to explain the method used in the system and
         the accuracy of the system.

    3.   Sawaf et. al., uses the statistical classification methods such as
         maximum entropy to classify and cluster news articles, the best
         classification accuracy they reported was 62.7% with precision of 50%
         which is a very low precision in this field.

    4.   El-Halees, described a method based on association rules to classify
         Arabic documents. The classification accuracy reported was 74.41%.

    In this paper we are going to evaluate some recent average accuracy reports
                                                               27
     from researches that use the same classifiers we used in our proposed
Contents
•   Arabic Corpora.
•   Methods and Algorithms.
•   Implementation of the ATC .
•   Results and Discussion.
•   Conclusion.
•   References.



                                  28
Arabic Corpora

•    A corpus of Arabic text documents was collected from online
    Arabic newspapers archives, including Al-Jazeera, Al-Nahar,
    Al-hayat, Al-Ahram, and Al-Dostor as well as a few other
    specialized websites.
     A small corpus was proposed and prepared which consists of 1562
      documents belonging to 6 different categories.
     Arabic morphology and text classification depends on the contents of
      documents.
     A huge number of features or keywords can be found in Arabic text such
      as morphemes that may generated from one root which may tend to a
      poor performance in terms of both accuracy and time.
     We reduced the number of extracted features by focusing on
      unvowelized keywords.
     All documents, whether training documents or documents to be
      classified went through a preprocessing phase removing punctuation
      marks, stop words, diacritics, and non letters.

                                                            29
Contents
•   Methods and Algorithms.
•   Implementation of the ATC .
•   Results and Discussion.
•   Conclusion.
•   References.




                                  30
Methods and Algorithms
The text classification problem is composed of several sub problems such as:
The document indexing: Document indexing is related to with the way of
     extracting the document's keywords, two main approaches to achieve the
     document indexing, the first approach considers index terms as bags of words
     and the second approach regards the index terms as phrases.

The weighting assignment: Weight assignment techniques associate a real number
               assignment
    that ranges from 0 to 1 for all documents’ terms weights will be required to
    classify new arrived documents.

Learning based text classification algorithm :. A text classification algorithm used
    is inductive learning algorithm based on probabilistic theory and different
    models were emphasized such as Naive Bayesian models (Which always shows
    good result and widely used in text classification). Another text classification
    methods have been emerged to categorize documents such as K-Nearest
    Neighbor KNN which compute the distances between the document index
    terms and the known terms of each category. The accuracy will be tested by K-
    fold cross-validation method.
                                                                31
Preprocessing phases
                                       The data used in this work
                                       are collected from many
                                       Arabic sites. The data set
                                       consist of 1562 Arabic
                                       documents of different
                                       lengths that belongs to 6
                                       categories, the categories
                                       are ( Economic " ‫,"اقتصادية‬
First phase: is the preprocessing      Cultural "‫ , "ثقافة‬Political "
step where documents are prepared      ‫ , "سياسية‬Social " ‫, "اجتماعية‬
to make it adequate for further use,   Sports " ‫ , " رياضة‬General "
stop words removal and rearrange of    ‫ ,) "عامة‬Table 1 represent the
the document contents are some         number of documents for
steps in this phase.                   each category.
                                                      32
Continue
Second phase is the weighting assignment phase, it is defined as
the assignment of real number that relies between 0 and 1 to each
keyword and this number indicates the imperativeness of the
keyword inside the document.
Many methods have been developed and the most widely used model is the tf-idf
weighting factor. This weight of each keyword is computed by multiplying the
term factor (tf) with the inverse document factor (idf) where:
Fik = Occurrences of term tK in document Di.
tfik = fik/max (fil) normalized term frequency occurred in document.
dfk = documents which contain tk .
idfk= log (d/dfk) where d is the total number of documents and dfk is number of
document s that contains term tk.
wik = tfik * idfk for term weight, the computed w ik is a real number ɛ[0,1].

                                                              33
Classifier Naive Bayesian (CNB)
 P(class| document) : It’s the probability of class
Bayesian learning is a probability-driven algorithm
given a document, or the probability that a given
based on Bayes probability theorem it is highly
document D belongs to a given class C, and that is our
recommended in Arabic text classification
target.
P(document ) : The probability of a document, we can
notice that p(document ) is a Constance divider to
A documents can be modeled as sets of words thus the
every calculation, so we can ignore it.
P(document | class ) can be written in two way Where:
P( class ): The probability of a class (or category), we
 p(wordi |C )The Naive Bayesian can often outperform more
               : Probability that the i-th word of a given
can compute it from the number of documents in the
document occurs in a document from class C, and this Classifier task
              sophisticated classification methods.
category divided by documents number in all
can be calculated as follow: incoming objects to their appropriate
              is to categorize
categories.
             Class.
P(document | class ) : It’s the probability of document
in a given class.




                                                                       34
Classifier K-Nearest Neighbor
               (CK-NN)
•     K-Nearest Neighbor is a widely used text classifier especially in non English
     text mining because of its simplicity and efficiency.

•    It’s a a supervised learning algorithm where the result of a new occurrence
     query is classified based on the K-nearest neighbor category measurement.

•    Its training-phase consists of nothing more than storing all training examples as
     classifier.

•    It works based on minimum distance from the query instance to the training
     samples to determine the K nearest neighbors.

•    After we gather K nearest neighbors, we take simple majority of these K-nearest
     neighbors to be the prediction of the query-instance.

•    CK-NN algorithm consists of several multivariate attributes names X i that will
     be used to classify the object Y. We will deal only with quantitative Xi and
                                                                  35
     binary (nominal) Y.
Continue
• Example: Suppose that the K factor is set to be equal to 8 (there are 8 nearest
  neighbors) as a parameter of this algorithm. Then the distance between the
  query-instance and all the training samples is computed, so there are only
  quantitative Xi.
  All training samples are included as nearest neighbors if the distance of this
  training sample to the query is less than or equal to the Kth smallest distance
  in this case the distances are sorted of all training samples to the query and
  determine the Kth as a minimum distance.
• The unknown sample is assigned the most common class among its k nearest
  neighbors. Then we find the distances between the query and all training
  samples.
• The K training samples are the closest K nearest neighbors for the unknown
  sample. Closeness is defined in terms of Euclidean distance, where the
  Euclidean between two points, X = (x1, x2,...,xn ) and Y = (y1, y2,...,yn) is:



                                                               36
Contents
•   Implementation of the ATC .
•   Results and Discussion.
•   Conclusion.
•   References.




                                  37
Implementation of the ATC


The proposed Arabic Text Classifier (ATC) demonstrates the
importance of classifying the Arabic text on the web documents needs
for information retrieval to illustrate both keywords extraction and text
classifiers used in the Algorithms implementation:
     Keywords extraction: Text web documents are scanned to find the keywords
      each one is normalized. Normalization process consists of removing stop
      words, removing punctuation mark in Arabic letters such as the normalization
      of the (hamza (‫ )إ‬or (‫ )أ‬or (‫ ))آ‬in all its forms to (alef (‫ ))ا‬only, the ‫ ى‬replaced
      by ‫ , ي‬and the ‫ ة‬to ‫ ,ه‬moreover, remove any Maad ” ‫"ـ‬character and finding
      keywords root or stem. In this work light stemmer is used to find the
      requested keywords.
     Some stop words:




                                                                      38
Continue
 Terms weighting: There are two criterions:
     First criterion the more number of times a term occurs in documents which belongs to
       some category, the more it is relative to that category.
     Second criterion the more the term appears in different documents representing different
       categories; the less the term is useful for discriminating between documents as
       belonging to different categories.
     In this implementation we used the commonly used approach which is Normalized
       tf×idf to overcome the problem of variant documents lengths.
 Algorithms implementation :this implementation were mainly developed for testing the
  effectiveness of CK-NN and CNB algorithms when it is applied to the Arabic text.
     We supplies a set of labeled text documents supplied to the system, the labels are used
       to indicate the class or classes that the text document belongs to. All documents
       belonging to the data set should be labeled in order to learn the system and then test it.
     The system distinguishes the labels of the training documents but not those of the test
       set.
     The system will compare between the two classifiers and report the most higher
       accuracy classifier for the current labeled text documents.
 The system will compare these results and select the best average accuracy result rates for
  each classifier and uses the greater average accuracy result rates in the ATC system. The
  system will choose the higher rate to start the retrieving process.
                                                                             39
Contents
•   Results and Discussion.
•   Conclusion.
•   References.




                              40
Results and Discussion
• The system used a data set consist of 1562 Arabic documents of
  different lengths that belongs to 6 categories, the categories are
  ( Economic " ‫ ,"اقتصادية‬Cultural "‫ , "ثقافة‬Political " ‫ , "سياسية‬Social "
  ‫ , "اجتماعية‬Sports " ‫ , " رياضة‬General " ‫.) "عامة‬
• To test the system, the documents in the data set were preprocessed
  to find main categories.

• Various splitting percentages were used to see how the number of
  training documents impacts the classification effectiveness.

• Different k values starting from 1 and up to 20 in order were used to
  find the best results for CK-NN. Effectiveness started to decline at
  k>15.
• A comparison between the two algorithms and make labeling to the
  sample data, the classifier has been indicated in the system also
  depending on some training data.
                                                            41
Continue

• The k-fold cross-validation method is used to test the accuracy of the
  system.
• Our result is roughly near from the other developer's results. It is induced
  from the below results that the Classifier K-Nearest Neighbors (CK-NN)
  with an average (86.02%) has better than Classifier Naïve Bayesian that had
  (77.03%).
• The ATC system in this case will use the CK-NN for Arabic text
  classification and extraction instead of CNB. But it never means that the
  ATC system won't use CNB as an Arabic text classification. Because some
  times in case of Multilanguage text the CNB shows a higher average
  accuracy rates, more than other classification algorithm.




                                                               42
Contents
• Conclusion.
• References.




                     43
Conclusion
• An evaluation to the use of Classifier K-Nearest Neighbor (CK-NN) and
  Classifier Naïve Bayes (CNB) to the classification Arabic text was considered.

• A development of a special corpus consists of 1562 documents that belong to 6
  categories.

• An extracted feature set of keywords and terms weighting in order to improve
  the performance were indicated as well.

• As a result we applied two algorithms for classifying the Arabic text as a
  satisfactory number of patterns for each category.

• The accuracy was measured by the use of k-fold cross-validation method to test
  the accuracy of the system.

•   we proposed an empirical Arabic text classifier system called ATC.

•    The ATC compares the results between both classifiers used44
                                                               (CK-NN, CNB)
    and select the best average accuracy result rates.
Contents
• References.




                     45
References
[1] Aas K. and Eikvil L. (1999). Text Categorisation: A survey. Technical report, Norwegian
    Computing Center.
[2] Alexandrov M., Gelbukh A. and Lozovo. (2001). Chi-square Classifier for Document
    Categorization. 2nd InternationalConference on Intelligent Text Processing and
    Computational Linguistics, Mexico City.
[3] Al-Shalabi R. and Evens M. (1998). A Computational Morphology System for Arabic.
    Workshop on Semitic Language Processing. COLING-ACL’98, University of Montreal,
    Montreal, PQ, Canada.pp. 66-72.
[4] Breiman, L., J.H. Friedman, R.A. Olshen and C.J. Stone, 1984. Classification and
    Regression Trees. 1st Edn., Chapman Hall, New York, ISBN: 0-412-04841-8.
[5] Bordogna, G. Pasi, G. , A user-adaptive indexing model of structured documents, Fuzzy
    Systems, 2001. The 10th IEEE International Conference on 2-5 Dec. 2001, Volume: 2, On
    page(s): 984- 989 vol.3,
    ISBN: 0-7803-7293-X.
[6] Ciravegna F., Gilardoni L., Lavelli A., Ferraro M., Mana N., Mazza, S., Matiasek J., Black
    W. and Rinaldi F. (2000). Flexible Text Classification for Financial Applications: the
    FACILE System. In Proceedings of PAIS-2000, Prestigious Applications of Intelligent
    Systems sub-conference of ECAI2000.
[7] Chang, C. and Lui, S. 2001. IEPAD: information extraction based on pattern discovery. In
    Proceedings of the 10th international Conference on World Wide Web (Hong Kong, Hong
    Kong, May 01 - 05, 2001). WWW '01. ACM, New York, NY, 681-688. DOI=
    http://doi.acm.org/10.1145/371920.372182
                                                                           46
Thank you !!!
           47
48

Weitere ähnliche Inhalte

Andere mochten auch

Sentiment analysis of arabic,a survey
Sentiment analysis of arabic,a surveySentiment analysis of arabic,a survey
Sentiment analysis of arabic,a surveyArabic_NLP_ImamU2013
 
Sentiment Analysis for Arabic tweets
Sentiment Analysis for Arabic tweetsSentiment Analysis for Arabic tweets
Sentiment Analysis for Arabic tweetsRaed Marji
 
RCOMM 2011 - Sentiment Classification with RapidMiner
RCOMM 2011 - Sentiment Classification with RapidMinerRCOMM 2011 - Sentiment Classification with RapidMiner
RCOMM 2011 - Sentiment Classification with RapidMinerbohanairl
 
Subjectivity and sentiment analysis of arabic trends and challenges
Subjectivity and sentiment analysis of arabic trends and challengesSubjectivity and sentiment analysis of arabic trends and challenges
Subjectivity and sentiment analysis of arabic trends and challengesASA_Group
 
Sentiment analysis-by-nltk
Sentiment analysis-by-nltkSentiment analysis-by-nltk
Sentiment analysis-by-nltkWei-Ting Kuo
 
[ASA] Sentiment Analysis in Twitter, a Study on the Saudi Community
[ASA] Sentiment Analysis in Twitter, a Study on the Saudi Community[ASA] Sentiment Analysis in Twitter, a Study on the Saudi Community
[ASA] Sentiment Analysis in Twitter, a Study on the Saudi CommunityASA_Group
 
Harvesting Data from Twitter Workshop: Hands-on Experience
Harvesting Data from Twitter Workshop: Hands-on ExperienceHarvesting Data from Twitter Workshop: Hands-on Experience
Harvesting Data from Twitter Workshop: Hands-on ExperienceASA_Group
 
I- Extended Databases
I- Extended DatabasesI- Extended Databases
I- Extended DatabasesZakaria Zubi
 
Knowledge Discovery in Remote Access Databases
Knowledge Discovery in Remote Access Databases Knowledge Discovery in Remote Access Databases
Knowledge Discovery in Remote Access Databases Zakaria Zubi
 
Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)Zakaria Zubi
 
Building corpus from www for arabic
Building corpus from www for arabicBuilding corpus from www for arabic
Building corpus from www for arabicArabic_NLP_ImamU2013
 
A Comparative Study of Data Mining Methods to Analyzing Libyan National Crime...
A Comparative Study of Data Mining Methods to Analyzing Libyan National Crime...A Comparative Study of Data Mining Methods to Analyzing Libyan National Crime...
A Comparative Study of Data Mining Methods to Analyzing Libyan National Crime...Zakaria Zubi
 
A Fuzzy Approach For Multi-Domain Sentiment Analysis
A Fuzzy Approach For Multi-Domain Sentiment AnalysisA Fuzzy Approach For Multi-Domain Sentiment Analysis
A Fuzzy Approach For Multi-Domain Sentiment AnalysisMauro Dragoni
 
Sentiment mining- The Design and Implementation of an Internet Public Opinion...
Sentiment mining- The Design and Implementation of an Internet PublicOpinion...Sentiment mining- The Design and Implementation of an Internet PublicOpinion...
Sentiment mining- The Design and Implementation of an Internet Public Opinion...Prateek Singh
 
OUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text ClassificationOUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text ClassificationFlorian Leitner
 

Andere mochten auch (20)

Sentiment analysis of arabic,a survey
Sentiment analysis of arabic,a surveySentiment analysis of arabic,a survey
Sentiment analysis of arabic,a survey
 
Sentiment Analysis for Arabic tweets
Sentiment Analysis for Arabic tweetsSentiment Analysis for Arabic tweets
Sentiment Analysis for Arabic tweets
 
Arabic tokenization and stemming
Arabic tokenization and  stemmingArabic tokenization and  stemming
Arabic tokenization and stemming
 
RCOMM 2011 - Sentiment Classification with RapidMiner
RCOMM 2011 - Sentiment Classification with RapidMinerRCOMM 2011 - Sentiment Classification with RapidMiner
RCOMM 2011 - Sentiment Classification with RapidMiner
 
Subjectivity and sentiment analysis of arabic trends and challenges
Subjectivity and sentiment analysis of arabic trends and challengesSubjectivity and sentiment analysis of arabic trends and challenges
Subjectivity and sentiment analysis of arabic trends and challenges
 
Sentiment analysis-by-nltk
Sentiment analysis-by-nltkSentiment analysis-by-nltk
Sentiment analysis-by-nltk
 
[ASA] Sentiment Analysis in Twitter, a Study on the Saudi Community
[ASA] Sentiment Analysis in Twitter, a Study on the Saudi Community[ASA] Sentiment Analysis in Twitter, a Study on the Saudi Community
[ASA] Sentiment Analysis in Twitter, a Study on the Saudi Community
 
Harvesting Data from Twitter Workshop: Hands-on Experience
Harvesting Data from Twitter Workshop: Hands-on ExperienceHarvesting Data from Twitter Workshop: Hands-on Experience
Harvesting Data from Twitter Workshop: Hands-on Experience
 
Ismail&&ziko 2003
Ismail&&ziko 2003Ismail&&ziko 2003
Ismail&&ziko 2003
 
I- Extended Databases
I- Extended DatabasesI- Extended Databases
I- Extended Databases
 
Edi text
Edi textEdi text
Edi text
 
Knowledge Discovery in Remote Access Databases
Knowledge Discovery in Remote Access Databases Knowledge Discovery in Remote Access Databases
Knowledge Discovery in Remote Access Databases
 
Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)
 
Building corpus from www for arabic
Building corpus from www for arabicBuilding corpus from www for arabic
Building corpus from www for arabic
 
A Comparative Study of Data Mining Methods to Analyzing Libyan National Crime...
A Comparative Study of Data Mining Methods to Analyzing Libyan National Crime...A Comparative Study of Data Mining Methods to Analyzing Libyan National Crime...
A Comparative Study of Data Mining Methods to Analyzing Libyan National Crime...
 
SentiCircles for Contextual and Conceptual Semantic Sentiment Analysis of Twi...
SentiCircles for Contextual and Conceptual Semantic Sentiment Analysis of Twi...SentiCircles for Contextual and Conceptual Semantic Sentiment Analysis of Twi...
SentiCircles for Contextual and Conceptual Semantic Sentiment Analysis of Twi...
 
Data mining project
Data mining projectData mining project
Data mining project
 
A Fuzzy Approach For Multi-Domain Sentiment Analysis
A Fuzzy Approach For Multi-Domain Sentiment AnalysisA Fuzzy Approach For Multi-Domain Sentiment Analysis
A Fuzzy Approach For Multi-Domain Sentiment Analysis
 
Sentiment mining- The Design and Implementation of an Internet Public Opinion...
Sentiment mining- The Design and Implementation of an Internet PublicOpinion...Sentiment mining- The Design and Implementation of an Internet PublicOpinion...
Sentiment mining- The Design and Implementation of an Internet Public Opinion...
 
OUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text ClassificationOUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text Classification
 

Ähnlich wie Arabic Text mining Classification

Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Roxanne Missingham
 
Internet & Library Use 2022 .pptx
Internet & Library Use 2022 .pptxInternet & Library Use 2022 .pptx
Internet & Library Use 2022 .pptxAhmedAmerica
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notesAnandh Arumugakan
 
The commitment of arabic sites in the field of libraries and information that...
The commitment of arabic sites in the field of libraries and information that...The commitment of arabic sites in the field of libraries and information that...
The commitment of arabic sites in the field of libraries and information that...Alexander Decker
 
Information Architecture for SharePoint
Information Architecture for SharePointInformation Architecture for SharePoint
Information Architecture for SharePointnForm User Experience
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunitiesAhmed AlSum
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniquesTola Odugbesan
 
Deep Web and Digital Investigations
Deep Web and Digital Investigations Deep Web and Digital Investigations
Deep Web and Digital Investigations Damir Delija
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAnkur Biswas
 
Benefits of Internet
Benefits of Internet Benefits of Internet
Benefits of Internet yogini sharma
 
Connecting Intelligent Content with Micropublishing and Beyond
Connecting Intelligent Content with Micropublishing and BeyondConnecting Intelligent Content with Micropublishing and Beyond
Connecting Intelligent Content with Micropublishing and BeyondDon Day
 
Presentation Deep Web Technology.pptx
Presentation Deep Web Technology.pptxPresentation Deep Web Technology.pptx
Presentation Deep Web Technology.pptxmayurbokan
 

Ähnlich wie Arabic Text mining Classification (20)

Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
 
Web mining
Web miningWeb mining
Web mining
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Web mining
Web miningWeb mining
Web mining
 
Internet & Library Use 2022 .pptx
Internet & Library Use 2022 .pptxInternet & Library Use 2022 .pptx
Internet & Library Use 2022 .pptx
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
The commitment of arabic sites in the field of libraries and information that...
The commitment of arabic sites in the field of libraries and information that...The commitment of arabic sites in the field of libraries and information that...
The commitment of arabic sites in the field of libraries and information that...
 
Information Architecture for SharePoint
Information Architecture for SharePointInformation Architecture for SharePoint
Information Architecture for SharePoint
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 
Deep Web mining
Deep Web miningDeep Web mining
Deep Web mining
 
Internet and Its Applications
Internet and Its ApplicationsInternet and Its Applications
Internet and Its Applications
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniques
 
Deep Web and Digital Investigations
Deep Web and Digital Investigations Deep Web and Digital Investigations
Deep Web and Digital Investigations
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web Technology
 
Web content mining
Web content miningWeb content mining
Web content mining
 
Benefits of Internet
Benefits of Internet Benefits of Internet
Benefits of Internet
 
Connecting Intelligent Content with Micropublishing and Beyond
Connecting Intelligent Content with Micropublishing and BeyondConnecting Intelligent Content with Micropublishing and Beyond
Connecting Intelligent Content with Micropublishing and Beyond
 
Ir1
Ir1Ir1
Ir1
 
Presentation Deep Web Technology.pptx
Presentation Deep Web Technology.pptxPresentation Deep Web Technology.pptx
Presentation Deep Web Technology.pptx
 

Mehr von Zakaria Zubi

applyingwebminingapplicationforuserbehaviorunderstanding-131215105223-phpapp0...
applyingwebminingapplicationforuserbehaviorunderstanding-131215105223-phpapp0...applyingwebminingapplicationforuserbehaviorunderstanding-131215105223-phpapp0...
applyingwebminingapplicationforuserbehaviorunderstanding-131215105223-phpapp0...Zakaria Zubi
 
Using Data Mining Techniques to Analyze Crime Pattern
Using Data Mining Techniques to Analyze Crime PatternUsing Data Mining Techniques to Analyze Crime Pattern
Using Data Mining Techniques to Analyze Crime PatternZakaria Zubi
 
COMPARISON OF ROUTING PROTOCOLS FOR AD HOC WIRELESS NETWORK WITH MEDICAL DATA
COMPARISON OF ROUTING PROTOCOLS FOR AD HOC WIRELESS NETWORK WITH MEDICAL DATA COMPARISON OF ROUTING PROTOCOLS FOR AD HOC WIRELESS NETWORK WITH MEDICAL DATA
COMPARISON OF ROUTING PROTOCOLS FOR AD HOC WIRELESS NETWORK WITH MEDICAL DATA Zakaria Zubi
 
Applying web mining application for user behavior understanding
Applying web mining application for user behavior understandingApplying web mining application for user behavior understanding
Applying web mining application for user behavior understandingZakaria Zubi
 
Ibtc dwt hybrid coding of digital images
Ibtc dwt hybrid coding of digital imagesIbtc dwt hybrid coding of digital images
Ibtc dwt hybrid coding of digital imagesZakaria Zubi
 
Information communication technology in libya for educational purposes
Information communication technology in libya for educational purposesInformation communication technology in libya for educational purposes
Information communication technology in libya for educational purposesZakaria Zubi
 

Mehr von Zakaria Zubi (7)

applyingwebminingapplicationforuserbehaviorunderstanding-131215105223-phpapp0...
applyingwebminingapplicationforuserbehaviorunderstanding-131215105223-phpapp0...applyingwebminingapplicationforuserbehaviorunderstanding-131215105223-phpapp0...
applyingwebminingapplicationforuserbehaviorunderstanding-131215105223-phpapp0...
 
Using Data Mining Techniques to Analyze Crime Pattern
Using Data Mining Techniques to Analyze Crime PatternUsing Data Mining Techniques to Analyze Crime Pattern
Using Data Mining Techniques to Analyze Crime Pattern
 
COMPARISON OF ROUTING PROTOCOLS FOR AD HOC WIRELESS NETWORK WITH MEDICAL DATA
COMPARISON OF ROUTING PROTOCOLS FOR AD HOC WIRELESS NETWORK WITH MEDICAL DATA COMPARISON OF ROUTING PROTOCOLS FOR AD HOC WIRELESS NETWORK WITH MEDICAL DATA
COMPARISON OF ROUTING PROTOCOLS FOR AD HOC WIRELESS NETWORK WITH MEDICAL DATA
 
Applying web mining application for user behavior understanding
Applying web mining application for user behavior understandingApplying web mining application for user behavior understanding
Applying web mining application for user behavior understanding
 
Model
ModelModel
Model
 
Ibtc dwt hybrid coding of digital images
Ibtc dwt hybrid coding of digital imagesIbtc dwt hybrid coding of digital images
Ibtc dwt hybrid coding of digital images
 
Information communication technology in libya for educational purposes
Information communication technology in libya for educational purposesInformation communication technology in libya for educational purposes
Information communication technology in libya for educational purposes
 

Arabic Text mining Classification

  • 1. Using Some Web Content Mining Techniques for Arabic Text Classification Presented by: Zakaria Suliman Zubi Assistant Professor Computer Science Department Faculty of Science Altahadi University Sirte, Libya 1
  • 2. Contents • Introduction to Web mining. • Web Mining Basics. • Web Content Mining: Definition. • Opportunities and Challenges. • Web Content Mining Subtasks. • Related Work. • Text Mining for Web Documents. • Multilingual Web Mining. • Arabic Language-Independent Specification. • Arabic Text Classification. • Arabic Corpora. • Methods and Algorithms. • Implementation of the ATC . • Results and Discussion. • Conclusion. 2 • References.
  • 3. Introduction to W mining eb • The World Wide Web is a rich source of knowledge that can be useful to many applications. – Source? • Billions of web pages and billions of visitors and contributors. – What knowledge? • e.g., the hyperlink structure and a variety of languages. – Purpose? • To improve users’ effectiveness in searching for information on the web. • Decision-making support or business management. 3
  • 4. Continue • Web’s Characteristics: – Large size – Unstructured – Different data types: text, image, hyperlinks and user usage information – Dynamic content – Time dimension – Multilingual • The Data Mining (DM) is a significant subfield of this area. • The various activities and efforts in this area are referred to as Web Mining. 4
  • 5. Contents • Web Mining Basics. • Web Content Mining: Definition. • Opportunities and Challenges. • Web Content Mining Subtasks. • Related Work. • Text Mining for Web Documents. • Multilingual Web Mining. • Arabic Language-Independent Specification. • Arabic Text Classification. • Arabic Corpora. • Methods and Algorithms. • Implementation of the ATC . • Results and Discussion. • Conclusion. • References. 5
  • 6. W M eb ining Basics Web mining – the application of data mining techniques is to extract knowledge from Web content, structure, and usage. W e b M in in g W e b C o n t e n t M in in g W e b S t r u c t u r e M in in g W e b U s a g e M in in g Text H y p e r lin k s W e b S e rv e r L o g s Im a g e D o c u m e n t S tru c tu re d A p p lic a t io n L e v e l L o g s A u d io A p p lic a t io n S e r v e r L o g s V e d io S tru c tu re d R e c o rd s 6
  • 7. Contents • Web Content Mining: Definition. • Opportunities and Challenges. • Web Content Mining Subtasks. • Related Work. • Text Mining for Web Documents. • Multilingual Web Mining. • Arabic Language-Independent Specification. • Arabic Text Classification. • Arabic Corpora. • Methods and Algorithms. • Implementation of the ATC . • Results and Discussion. • Conclusion. • References. 7
  • 8. W Content M eb ining: Definition • “Web Content Mining is the process of extracting useful information from the contents of Web documents. It may consist of text, images, audio, video, or structured records such as lists and tables.” • “Web Content mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data.” 8
  • 9. Contents • Opportunities and Challenges. • Web Content Mining Subtasks. • Related Work. • Text Mining for Web Documents. • Multilingual Web Mining. • Arabic Language-Independent Specification. • Arabic Text Classification. • Arabic Corpora. • Methods and Algorithms. • Implementation of the ATC . • Results and Discussion. • Conclusion. • References. 9
  • 10. Opportunities and Challenges Web offers an unique opportunity and challenge to data mining • The amount of information on the Web is huge, and easily accessible. • The coverage of Web information is very wide and diverse. One can find information about almost anything. • Information/data of almost all types exist on the Web, e.g., structured tables, texts, multimedia data, etc. • Much of the Web information is semi-structured due to the nested structure of HTML code. • Much of the Web information is linked. There are hyperlinks among pages within a site, and across different sites. • Much of the Web information is redundant. The same piece of information or it may appears in many pages. 10
  • 11. Cont… • The Web is noisy. A Web page typically contains a mixture of many kinds of information, e.g., main contents, advertisements, navigation panels, copyright notices, etc. • The Web is also about services. Many Web sites and pages enable people to perform operations with input parameters, i.e., they provide services. • The Web is dynamic. Information on the Web changes constantly. Keeping up with the changes and monitoring the changes are important issues. • Above all, the Web is a virtual society. It is not only about data, information and services, but also about interactions among people, organizations and automatic systems, i.e., communities. 11
  • 12. Contents • Web Content Mining Subtasks. • Related Work. • Text Mining for Web Documents. • Multilingual Web Mining. • Arabic Language-Independent Specification. • Arabic Text Classification. • Arabic Corpora. • Methods and Algorithms. • Implementation of the ATC . • Results and Discussion. • Conclusion. • References. 12
  • 13. W Content M eb ining Subtasks • Resource finding – Retrieving intended documents • Information selection/pre-processing – Select and pre-process specific information from selected documents • Generalization – Discover general patterns within and across web sites • Analysis – Validation and/or interpretation of mined patterns 13
  • 14. Contents • Related Work. • Text Mining for Web Documents. • Multilingual Web Mining. • Arabic Language-Independent Specification. • Arabic Text Classification. • Arabic Corpora. • Methods and Algorithms. • Implementation of the ATC . • Results and Discussion. • Conclusion. • References. 14
  • 15. Related Work There are many research in classifying English documents , In addition to English language there are many researches in European languages such as German, Italian and Spanish, Asian languages as well, such as Chinese and Japanese, Arabic language has only four researches listed as follow: 1. El-Kourdi et, use naïve bayes algorithm to automatic Arabic document classification. The average accuracy reported was about 68.78%. 2. Siraj from Sakhr. The system is available at (siraj.sakhr.com) but it has no technical documentation to explain the method used in the system and the accuracy of the system. 3. Sawaf et. al., uses the statistical classification methods such as maximum entropy to classify and cluster news articles, the best classification accuracy they reported was 62.7% with precision of 50% which is a very low precision in this field. 4. El-Halees, described a method based on association rules to classify Arabic documents. The classification accuracy reported was 74.41%. 15
  • 16. Contents • Text Mining for Web Documents. • Multilingual Web Mining. • Arabic Language-Independent Specification. • Arabic Text Classification. • Arabic Corpora. • Methods and Algorithms. • Implementation of the ATC . • Results and Discussion. • Conclusion. • References. 16
  • 17. Text M ining for Web Documents • Text mining is often considered a sub-field of data mining and refers to the extraction of knowledge from text documents. These documents could be located on the web. • Text mining for web documents considered as a sub-field of web mining, or, more specifically, web content mining. Information extraction, text classification, and text clustering are patterns of text-mining applications that have been functional to web documents. • Most web documents are text documents in the form of HTML web pages where we can identify useful tags and then extract useful information from the web. • In most cases web documents consists of unstructured data which causes the data on the web documents to be noisy some times. • Some business applications are used frequently to extract English text patterns from web pages. For instance, FlipDog developed by the Whiz- bang labs, and Lencom software. 17
  • 18. Continue • In some applications the HTML tags are simply stripped from the web documents and traditional algorithms are then applied to perform text classification and clustering. • To creates a unique problem for performing text classification and clustering requires to ignore some useful characteristics of web page design such as web page hyperlinks , but “Home,” “Click here,’’ and “Contact us,” would be included as a document’s features. 18
  • 19. Contents • Multilingual Web Mining. • Arabic Language-Independent Specification. • Arabic Text Classification. • Arabic Corpora. • Methods and Algorithms. • Implementation of the ATC . • Results and Discussion. • Conclusion. • References. 19
  • 20. Multilingual W M eb ining • The number of non-English documents on the web continues to grow-more than 30 percent of web pages are in a language other than English. • In order to extract a non-English knowledge from the web, web mining systems have to deal with issues in language-specific text processing or called Machine Learning Process or a Learning Process. • Text classification and clustering needs a set of features (a vector of keywords) to start the learning process. • The algorithms regularly depend on some phrase segmentation and extraction programs to generate a set of features or keywords to represent web documents. • Many existing extraction programs, especially those employing a linguistic approach, are language-dependent and work only with English texts. 20
  • 21. Continue • In order to perform analysis on non-English documents, web mining systems must use the corresponding phrase extraction program for each language. • Some segmentation and extraction programs are language-independent by using a machine learning approach. • The most common approaches to categorize and classified the documents are Key Nearest Neighbor (K-NN) and the Naïve Bayes. • These algorithms were tested on a non English documents as a language- independent technique for key phrase extraction and have shown promising results in many research. • These algorithms do not rely on specific linguistic rules; they can be easily modified to work with different languages as well. • We tried to indicate the must relevant text classification algorithms for Arabic text. These algorithms will be evaluated on a data set in an Arabic corpus. 21
  • 22. Contents • Arabic Language-Independent Specification . • Arabic Text Classification. • Arabic Corpora. • Methods and Algorithms. • Implementation of the ATC . • Results and Discussion. • Conclusion. • References. 22
  • 23. Arabic Language-Independent Specification • A language-independent specification is a programming language specification which provides a common interface functional for defining semantics applicable headed for arbitrary language bindings. • The rules that can be applied to any new language will be added to the system. • The language independent specific resources are morphological or subject domain-specific dictionaries, linguistic grammar rules, etc. • It carries on the application modular by storing necessary language-specific resources outside the rules, inside a language- specific parameter files. 23
  • 24. Continue • It gives the application more capabilities to add new languages to be plugged in the system more easily. • It cannot be done without the use of bottom-up, data-driven bootstrapping methods to create the monolingual resources. • It avoids the language pair-specific resources and procedures because almost exponential growth of language pairs would automatically limit the number of languages a system can deal with. 24
  • 25. Contents • Arabic Text Classification. • Arabic Corpora. • Methods and Algorithms. • Implementation of the ATC . • Results and Discussion. • Conclusion. • References. 25
  • 26. Arabic Text Classification Arabic is the mother language of more than 300 million people world wide which contains the following characteristics:  Non Latin-based alphabets.  The direction of writing in Arabic is from right to left.  The Arabic alphabet consists of 28 letters.  Arabic words have two genders, female and male.  A very sophisticated grammar transliteration.  Classifying Arabic text is different than classifying English language because Arabic is highly inflectional and derivational language which makes monophonical analysis a very complex task.  Arabic scripts do not use capitalization for proper nouns which are necessary in classifying documents classification.  In text classification we take care of the documents as a bag-of-words with the text as a vector of a weighted frequency for each of the individual words or tokens. 26
  • 27. Continue There are several efforts to improve text illustration using concepts or multi-word terms in Arabic language showing as follows: 1. El-Kourdi et, use naïve bayes algorithm to automatic Arabic document classification. The average accuracy reported was about 68.78%. 2. Siraj from Sakhr. The system is available at (siraj.sakhr.com) but it has no technical documentation to explain the method used in the system and the accuracy of the system. 3. Sawaf et. al., uses the statistical classification methods such as maximum entropy to classify and cluster news articles, the best classification accuracy they reported was 62.7% with precision of 50% which is a very low precision in this field. 4. El-Halees, described a method based on association rules to classify Arabic documents. The classification accuracy reported was 74.41%.  In this paper we are going to evaluate some recent average accuracy reports 27 from researches that use the same classifiers we used in our proposed
  • 28. Contents • Arabic Corpora. • Methods and Algorithms. • Implementation of the ATC . • Results and Discussion. • Conclusion. • References. 28
  • 29. Arabic Corpora • A corpus of Arabic text documents was collected from online Arabic newspapers archives, including Al-Jazeera, Al-Nahar, Al-hayat, Al-Ahram, and Al-Dostor as well as a few other specialized websites.  A small corpus was proposed and prepared which consists of 1562 documents belonging to 6 different categories.  Arabic morphology and text classification depends on the contents of documents.  A huge number of features or keywords can be found in Arabic text such as morphemes that may generated from one root which may tend to a poor performance in terms of both accuracy and time.  We reduced the number of extracted features by focusing on unvowelized keywords.  All documents, whether training documents or documents to be classified went through a preprocessing phase removing punctuation marks, stop words, diacritics, and non letters. 29
  • 30. Contents • Methods and Algorithms. • Implementation of the ATC . • Results and Discussion. • Conclusion. • References. 30
  • 31. Methods and Algorithms The text classification problem is composed of several sub problems such as: The document indexing: Document indexing is related to with the way of extracting the document's keywords, two main approaches to achieve the document indexing, the first approach considers index terms as bags of words and the second approach regards the index terms as phrases. The weighting assignment: Weight assignment techniques associate a real number assignment that ranges from 0 to 1 for all documents’ terms weights will be required to classify new arrived documents. Learning based text classification algorithm :. A text classification algorithm used is inductive learning algorithm based on probabilistic theory and different models were emphasized such as Naive Bayesian models (Which always shows good result and widely used in text classification). Another text classification methods have been emerged to categorize documents such as K-Nearest Neighbor KNN which compute the distances between the document index terms and the known terms of each category. The accuracy will be tested by K- fold cross-validation method. 31
  • 32. Preprocessing phases The data used in this work are collected from many Arabic sites. The data set consist of 1562 Arabic documents of different lengths that belongs to 6 categories, the categories are ( Economic " ‫,"اقتصادية‬ First phase: is the preprocessing Cultural "‫ , "ثقافة‬Political " step where documents are prepared ‫ , "سياسية‬Social " ‫, "اجتماعية‬ to make it adequate for further use, Sports " ‫ , " رياضة‬General " stop words removal and rearrange of ‫ ,) "عامة‬Table 1 represent the the document contents are some number of documents for steps in this phase. each category. 32
  • 33. Continue Second phase is the weighting assignment phase, it is defined as the assignment of real number that relies between 0 and 1 to each keyword and this number indicates the imperativeness of the keyword inside the document. Many methods have been developed and the most widely used model is the tf-idf weighting factor. This weight of each keyword is computed by multiplying the term factor (tf) with the inverse document factor (idf) where: Fik = Occurrences of term tK in document Di. tfik = fik/max (fil) normalized term frequency occurred in document. dfk = documents which contain tk . idfk= log (d/dfk) where d is the total number of documents and dfk is number of document s that contains term tk. wik = tfik * idfk for term weight, the computed w ik is a real number ɛ[0,1]. 33
  • 34. Classifier Naive Bayesian (CNB) P(class| document) : It’s the probability of class Bayesian learning is a probability-driven algorithm given a document, or the probability that a given based on Bayes probability theorem it is highly document D belongs to a given class C, and that is our recommended in Arabic text classification target. P(document ) : The probability of a document, we can notice that p(document ) is a Constance divider to A documents can be modeled as sets of words thus the every calculation, so we can ignore it. P(document | class ) can be written in two way Where: P( class ): The probability of a class (or category), we p(wordi |C )The Naive Bayesian can often outperform more : Probability that the i-th word of a given can compute it from the number of documents in the document occurs in a document from class C, and this Classifier task sophisticated classification methods. category divided by documents number in all can be calculated as follow: incoming objects to their appropriate is to categorize categories. Class. P(document | class ) : It’s the probability of document in a given class. 34
  • 35. Classifier K-Nearest Neighbor (CK-NN) • K-Nearest Neighbor is a widely used text classifier especially in non English text mining because of its simplicity and efficiency. • It’s a a supervised learning algorithm where the result of a new occurrence query is classified based on the K-nearest neighbor category measurement. • Its training-phase consists of nothing more than storing all training examples as classifier. • It works based on minimum distance from the query instance to the training samples to determine the K nearest neighbors. • After we gather K nearest neighbors, we take simple majority of these K-nearest neighbors to be the prediction of the query-instance. • CK-NN algorithm consists of several multivariate attributes names X i that will be used to classify the object Y. We will deal only with quantitative Xi and 35 binary (nominal) Y.
  • 36. Continue • Example: Suppose that the K factor is set to be equal to 8 (there are 8 nearest neighbors) as a parameter of this algorithm. Then the distance between the query-instance and all the training samples is computed, so there are only quantitative Xi. All training samples are included as nearest neighbors if the distance of this training sample to the query is less than or equal to the Kth smallest distance in this case the distances are sorted of all training samples to the query and determine the Kth as a minimum distance. • The unknown sample is assigned the most common class among its k nearest neighbors. Then we find the distances between the query and all training samples. • The K training samples are the closest K nearest neighbors for the unknown sample. Closeness is defined in terms of Euclidean distance, where the Euclidean between two points, X = (x1, x2,...,xn ) and Y = (y1, y2,...,yn) is: 36
  • 37. Contents • Implementation of the ATC . • Results and Discussion. • Conclusion. • References. 37
  • 38. Implementation of the ATC The proposed Arabic Text Classifier (ATC) demonstrates the importance of classifying the Arabic text on the web documents needs for information retrieval to illustrate both keywords extraction and text classifiers used in the Algorithms implementation:  Keywords extraction: Text web documents are scanned to find the keywords each one is normalized. Normalization process consists of removing stop words, removing punctuation mark in Arabic letters such as the normalization of the (hamza (‫ )إ‬or (‫ )أ‬or (‫ ))آ‬in all its forms to (alef (‫ ))ا‬only, the ‫ ى‬replaced by ‫ , ي‬and the ‫ ة‬to ‫ ,ه‬moreover, remove any Maad ” ‫"ـ‬character and finding keywords root or stem. In this work light stemmer is used to find the requested keywords.  Some stop words: 38
  • 39. Continue  Terms weighting: There are two criterions:  First criterion the more number of times a term occurs in documents which belongs to some category, the more it is relative to that category.  Second criterion the more the term appears in different documents representing different categories; the less the term is useful for discriminating between documents as belonging to different categories.  In this implementation we used the commonly used approach which is Normalized tf×idf to overcome the problem of variant documents lengths.  Algorithms implementation :this implementation were mainly developed for testing the effectiveness of CK-NN and CNB algorithms when it is applied to the Arabic text.  We supplies a set of labeled text documents supplied to the system, the labels are used to indicate the class or classes that the text document belongs to. All documents belonging to the data set should be labeled in order to learn the system and then test it.  The system distinguishes the labels of the training documents but not those of the test set.  The system will compare between the two classifiers and report the most higher accuracy classifier for the current labeled text documents.  The system will compare these results and select the best average accuracy result rates for each classifier and uses the greater average accuracy result rates in the ATC system. The system will choose the higher rate to start the retrieving process. 39
  • 40. Contents • Results and Discussion. • Conclusion. • References. 40
  • 41. Results and Discussion • The system used a data set consist of 1562 Arabic documents of different lengths that belongs to 6 categories, the categories are ( Economic " ‫ ,"اقتصادية‬Cultural "‫ , "ثقافة‬Political " ‫ , "سياسية‬Social " ‫ , "اجتماعية‬Sports " ‫ , " رياضة‬General " ‫.) "عامة‬ • To test the system, the documents in the data set were preprocessed to find main categories. • Various splitting percentages were used to see how the number of training documents impacts the classification effectiveness. • Different k values starting from 1 and up to 20 in order were used to find the best results for CK-NN. Effectiveness started to decline at k>15. • A comparison between the two algorithms and make labeling to the sample data, the classifier has been indicated in the system also depending on some training data. 41
  • 42. Continue • The k-fold cross-validation method is used to test the accuracy of the system. • Our result is roughly near from the other developer's results. It is induced from the below results that the Classifier K-Nearest Neighbors (CK-NN) with an average (86.02%) has better than Classifier Naïve Bayesian that had (77.03%). • The ATC system in this case will use the CK-NN for Arabic text classification and extraction instead of CNB. But it never means that the ATC system won't use CNB as an Arabic text classification. Because some times in case of Multilanguage text the CNB shows a higher average accuracy rates, more than other classification algorithm. 42
  • 44. Conclusion • An evaluation to the use of Classifier K-Nearest Neighbor (CK-NN) and Classifier Naïve Bayes (CNB) to the classification Arabic text was considered. • A development of a special corpus consists of 1562 documents that belong to 6 categories. • An extracted feature set of keywords and terms weighting in order to improve the performance were indicated as well. • As a result we applied two algorithms for classifying the Arabic text as a satisfactory number of patterns for each category. • The accuracy was measured by the use of k-fold cross-validation method to test the accuracy of the system. • we proposed an empirical Arabic text classifier system called ATC. • The ATC compares the results between both classifiers used44 (CK-NN, CNB) and select the best average accuracy result rates.
  • 46. References [1] Aas K. and Eikvil L. (1999). Text Categorisation: A survey. Technical report, Norwegian Computing Center. [2] Alexandrov M., Gelbukh A. and Lozovo. (2001). Chi-square Classifier for Document Categorization. 2nd InternationalConference on Intelligent Text Processing and Computational Linguistics, Mexico City. [3] Al-Shalabi R. and Evens M. (1998). A Computational Morphology System for Arabic. Workshop on Semitic Language Processing. COLING-ACL’98, University of Montreal, Montreal, PQ, Canada.pp. 66-72. [4] Breiman, L., J.H. Friedman, R.A. Olshen and C.J. Stone, 1984. Classification and Regression Trees. 1st Edn., Chapman Hall, New York, ISBN: 0-412-04841-8. [5] Bordogna, G. Pasi, G. , A user-adaptive indexing model of structured documents, Fuzzy Systems, 2001. The 10th IEEE International Conference on 2-5 Dec. 2001, Volume: 2, On page(s): 984- 989 vol.3, ISBN: 0-7803-7293-X. [6] Ciravegna F., Gilardoni L., Lavelli A., Ferraro M., Mana N., Mazza, S., Matiasek J., Black W. and Rinaldi F. (2000). Flexible Text Classification for Financial Applications: the FACILE System. In Proceedings of PAIS-2000, Prestigious Applications of Intelligent Systems sub-conference of ECAI2000. [7] Chang, C. and Lui, S. 2001. IEPAD: information extraction based on pattern discovery. In Proceedings of the 10th international Conference on World Wide Web (Hong Kong, Hong Kong, May 01 - 05, 2001). WWW '01. ACM, New York, NY, 681-688. DOI= http://doi.acm.org/10.1145/371920.372182 46
  • 48. 48